
回复
本文讨论逻辑回归的基础知识及其在Python中的实现。逻辑回归基本上是一种有监督分类算法。在分类问题中,目标变量(或输出)y 对于给定的一组特征(或输入)X ,且X只能取离散值。
有一点与流行的看法相反地是,我认为逻辑回归是一种回归模型。该模型建立一个回归模型来预测给定数据条目属于编号为“1”的类别的概率。就像线性回归假设数据遵循线性函数一样,逻辑回归只是使用sigmoid函数对数据建模。
仅当将决策阈值引入时,逻辑回归才成为一种分类技术。阈值的设置是Logistic回归的一个很重要的方面,依赖于分类问题本身。
阈值的取值主要受精确率和召回率的影响。理想情况下,我们希望精确率和召回率都为1,但这种情况很少发生。
在Precision-Recall平衡的情况下,我们使用以下参数来决定阈值:
根据类别的数量,逻辑回归可以分为:
首先,我们探索逻辑回归的最简单形式,即二项式逻辑回归。考虑一个示例数据集,它将学习的小时数与考试结果进行映射。结果只能取两个值,即 passed(1) 或 failed(0):
小时(x) 通过(y)
0.50 0
0.75 0
1.00 0
1.25 0
1.50 1
1.75 0
2.00 1
2.25 1
2.50 1
2.75 1
3.00 0
3.25 0
3.50 0
3.75 1
4.00 0
4.25 0
4.50 1
4.75 0
5.00 0
5.50 0
所以,我们有
即 y 是一个分类目标变量,它只能采用两种可能的类型:“0”或“1”。
为了推广我们的模型,我们假设:
该数据集具有“p”个特征变量和“n”个观测值。特征矩阵表示为:
然后,以更紧凑的形式,
所以,
称为逻辑函数或sigmoid函数。
这是显示 g(z) 的图:
从上图我们可以推断:
import csv
import numpy as np
import matplotlib.pyplot as plt
def loadCSV(filename):
'''
function to load dataset
'''
with open(filename,"r") as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for i in range(len(dataset)):
dataset[i] = [float(x) for x in dataset[i]]
return np.array(dataset)
def normalize(X):
'''
function to normalize feature matrix, X
'''
mins = np.min(X, axis = 0)
maxs = np.max(X, axis = 0)
rng = maxs - mins
norm_X = 1 - ((maxs - X)/rng)
return norm_X
def logistic_func(beta, X):
'''
logistic(sigmoid) function
'''
return 1.0/(1 + np.exp(-np.dot(X, beta.T)))
def log_gradient(beta, X, y):
'''
logistic gradient function
'''
first_calc = logistic_func(beta, X) - y.reshape(X.shape[0], -1)
final_calc = np.dot(first_calc.T, X)
return final_calc
def cost_func(beta, X, y):
'''
cost function, J
'''
log_func_v = logistic_func(beta, X)
y = np.squeeze(y)
step1 = y * np.log(log_func_v)
step2 = (1 - y) * np.log(1 - log_func_v)
final = -step1 - step2
return np.mean(final)
def grad_desc(X, y, beta, lr=.01, converge_change=.001):
'''
gradient descent function
'''
cost = cost_func(beta, X, y)
change_cost = 1
num_iter = 1
while(change_cost > converge_change):
old_cost = cost
beta = beta - (lr * log_gradient(beta, X, y))
cost = cost_func(beta, X, y)
change_cost = old_cost - cost
num_iter += 1
return beta, num_iter
def pred_values(beta, X):
'''
function to predict labels
'''
pred_prob = logistic_func(beta, X)
pred_value = np.where(pred_prob >= .5, 1, 0)
return np.squeeze(pred_value)
def plot_reg(X, y, beta):
'''
function to plot decision boundary
'''
# labelled observations
x_0 = X[np.where(y == 0.0)]
x_1 = X[np.where(y == 1.0)]
# plotting points with diff color for diff label
plt.scatter([x_0[:, 1]], [x_0[:, 2]], c='b', label='y = 0')
plt.scatter([x_1[:, 1]], [x_1[:, 2]], c='r', label='y = 1')
# plotting decision boundary
x1 = np.arange(0, 1, 0.1)
x2 = -(beta[0,0] + beta[0,1]*x1)/beta[0,2]
plt.plot(x1, x2, c='k', label='reg line')
plt.xlabel('x1')
plt.ylabel('x2')
plt.legend()
plt.show()
if __name__ == "__main__":
# load the dataset
dataset = loadCSV('dataset1.csv')
# normalizing feature matrix
X = normalize(dataset[:, :-1])
# stacking columns with all ones in feature matrix
X = np.hstack((np.matrix(np.ones(X.shape[0])).T, X))
# response vector
y = dataset[:, -1]
# initial beta values
beta = np.matrix(np.zeros(X.shape[1]))
# beta values after running gradient descent
beta, num_iter = grad_desc(X, y, beta)
# estimated beta values and number of iterations
print("Estimated regression coefficients:", beta)
print("No. of iterations:", num_iter)
# predicted labels
y_pred = pred_values(beta, X)
# number of correctly predicted labels
print("Correctly predicted labels:", np.sum(y == y_pred))
# plotting regression line
plot_reg(X, y, beta)
最终结果:
估计回归系数:[[ 1.70474504 15.04062212 -20.47216021]]
迭代次数:2612
正确预测标签:100
本文转载自沐白AI笔记,作者:杨沐白