Python機(jī)器學(xué)習(xí)（十）經(jīng)典算法大全

天下小糧倉(cāng) 2021-12-11

展開(kāi)全文

1.KNN 分類算法

由于knn算法涉及到距離的概念，KNN 算法需要先進(jìn)行歸一化處理

1.1 歸一化處理 scaler

from sklearn.preprocessing import StandardScaler

standardScaler =StandardScaler()

standardScaler.fit(X_train)
X_train_standard = standardScaler.transform(X_train)
X_test_standard = standardScaler.transform(X_test)

歸一化之后送入模型進(jìn)行訓(xùn)練

from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=8)
knn_classifier.fit(X_train_standard, y_train)
y_predict = knn_clf.predict(X_test_standard)

# 默認(rèn)的預(yù)測(cè)指標(biāo)為分類準(zhǔn)確度
knn_clf.score(X_test, y_test)

1.2 網(wǎng)格搜索 GridSearchCV

使用網(wǎng)格搜索來(lái)確定KNN算法合適的超參數(shù)

from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=8)
knn_classifier.fit(X_train_standard, y_train)
y_predict = knn_clf.predict(X_test_standard)

# 默認(rèn)的預(yù)測(cè)指標(biāo)為分類準(zhǔn)確度
knn_clf.score(X_test, y_test)

1.3 交叉驗(yàn)證

GridSearchCV 本身就包括了交叉驗(yàn)證,，也可自己指定參數(shù)cv默認(rèn)GridSearchCV的KFold平分為3份
自己指定交叉驗(yàn)證，查看交叉驗(yàn)證成績(jī)from sklearn.model_selection import cross_val_score # 默認(rèn)為分成3份 cross_val_score(knn_clf, X_train, y_train, cv=5)這里默認(rèn)的scoring標(biāo)準(zhǔn)為 accuracy有許多可選的參數(shù),，具體查看官方文檔
封裝成函數(shù)，在fit完模型之后,，一次性查看多個(gè)評(píng)價(jià)指標(biāo)的成績(jī)這里選的只是針對(duì)分類算法的指標(biāo),，也可以是針對(duì)回歸，聚類算法的評(píng)價(jià)指標(biāo)

def cv_score_train_test(model):
    num_cv = 5
    score_list = ["accuracy","f1", "neg_log_loss", "roc_auc"]
    for score in score_list:
        print(score,"\t train:",cross_val_score(model, X_train, y_train, cv=num_cv, scoring=score).mean())
        print(score,"\t test:",cross_val_score(model, X_test, y_test, cv=num_cv, scoring=score).mean())

2. 線性回歸

2.1 簡(jiǎn)單線性回歸

from sklearn.linear_model import LinearRegression

linreg = LinearRegression()

linreg.fit(X_train, y_train)
#查看截距和系數(shù)

print linreg.intercept_
print linreg.coef_
lin_reg.score(X_test, y_test)

y_predict = linreg.predict(X_test)

2.2 多元線性回歸

在更高維度的空間中的“直線”,，即數(shù)據(jù)不只有一個(gè)維度,，而具有多個(gè)維度

代碼和上面的簡(jiǎn)單線性回歸相同

3. 梯度下降法

使用梯度下降法之前，需要對(duì)數(shù)據(jù)進(jìn)行歸一化處理

回到頂部

3.1 隨機(jī)梯度下降線性回歸

SGD_reg

from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(max_iter=100)
sgd_reg.fit(X_train_standard, y_train_boston)
sgd_reg.score(X_test_standard, y_test_boston)

3.2 確定梯度下降計(jì)算的準(zhǔn)確性

以多元線性回歸的目標(biāo)函數(shù)（損失函數(shù)）為例

比較使用數(shù)學(xué)推導(dǎo)式（得出具體解析解）的方法和debug的近似方法的比較

# 編寫(xiě)損失函數(shù)
def J(theta, X_b, y):
    try:
        return np.sum((y - X_b.dot(theta)) ** 2) / len(y)
    except:
        return float('inf')
        
# 編寫(xiě)梯度函數(shù)（使用數(shù)學(xué)推導(dǎo)方式得到的）
def dJ_math(theta, X_b, y):
    return X_b.T.dot(X_b.dot(theta) - y) * 2.0 / len(y)
    
# 編寫(xiě)梯度函數(shù)（用來(lái)debug的形式）
def dJ_debug(theta, X_b, y, epsilon=0.01):
    res = np.empty(len(theta))
    for i in range(len(theta)):
        theta_1 = theta.copy()
        theta_1[i] += epsilon
        theta_2 = theta.copy()
        theta_2[i] -= epsilon
        res[i] = (J(theta_1, X_b, y) - J(theta_2, X_b, y)) / (2 * epsilon)
    return res

# 批量梯度下降,，尋找最優(yōu)的theta
def gradient_descent(dJ, X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):
    theta = initial_theta
    i_iter = 0
    
    while i_iter < n_iters:
        gradient = dJ(theta, X_b, y)
        last_theta = theta
        theta = theta - eta * gradient
        
        if(abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
            break
        
        i_iter += 1
    return theta

# 函數(shù)入口參數(shù)第一個(gè),，要指定dJ函數(shù)是什么樣的
X_b = np.hstack([np.ones((len(X), 1)), X])
initial_theta = np.zeros(X_b.shape[1])
eta = 0.01

# 使用debug方式
theta = gradient_descent(dJ_debug, X_b, y, initial_theta, eta)
# 使用數(shù)學(xué)推導(dǎo)方式
theta = gradient_descent(dJ_math, X_b, y, initial_theta, eta)
# 得出的這兩個(gè)theta應(yīng)該是相同的

4. PCA算法

由于是求方差最大，因此使用的是梯度上升法

PCA算法不能在前處理進(jìn)行歸一化處理,，否則將會(huì)找不到主成分

4.1 代碼流程

# 對(duì)于二維的數(shù)據(jù)樣本來(lái)說(shuō)
from sklearn.decomposition import PCA

pca = PCA(n_components=1) #指定需要保留的前n個(gè)主成分,，不指定為默認(rèn)保留所有
pca.fit(X)

比如，要使用KNN分類算法,，先進(jìn)行數(shù)據(jù)的降維操作

from sklearn.decomposition import PCA

pca = PCA(n_components=2)  #這里也可以給一個(gè)百分比,，代表想要保留的數(shù)據(jù)的方差占比
pca.fit(X_train)

#訓(xùn)練集和測(cè)試集需要進(jìn)行相同降維處理操作
X_train_reduction = pca.transform(X_train)
X_test_reduction = pca.transform(X_test)

#降維完成后就可以送給模型進(jìn)行擬合
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train_reduction, y_train)
knn_clf.score(X_test_reduction, y_test)

4.2 降維的維數(shù)和精度的取舍

指定的維數(shù)，能解釋原數(shù)據(jù)的方差的比例

pca.explained_variance_ratio_

# 指定保留所有的主成分
pca = PCA(n_components=X_train.shape[1])
pca.fit(X_train)
pca.explained_variance_ratio_

# 查看降維后特征的維數(shù)
pca.n_components_

把數(shù)據(jù)降維到2維,，可以進(jìn)行scatter的可視化操作

4.3 PCA數(shù)據(jù)降噪

先使用pca降維,，之后再反向，升維

from sklearn.decomposition import PCA

pca = PCA(0.7)
pca.fit(X)
pca.n_components_

X_reduction = pca.transform(X)
X_inversed = pca.inverse_transform(X_reduction)

5. 多項(xiàng)式回歸與模型泛化

多項(xiàng)式回顧需要指定最高的階數(shù),， degree

擬合的將不再是一條直線

只有一個(gè)特征的樣本,，進(jìn)行多項(xiàng)式回歸可以擬合出曲線，并且在二維平面圖上進(jìn)行繪制
而對(duì)于具有多個(gè)特征的樣本,，同樣可以進(jìn)行多項(xiàng)式回歸,，但是不能可視化擬合出來(lái)的曲線

5.1 多項(xiàng)式回歸和Pipeline

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


poly_reg = Pipeline([
    ("poly", PolynomialFeatures(degree=2)),
    ("std_scaler", StandardScaler()),
    ("lin_reg", LinearRegression())
])

poly_reg.fit(X, y)
y_predict = poly_reg.predict(X)

# 對(duì)二維數(shù)據(jù)點(diǎn)可以繪制擬合后的圖像
plt.scatter(X, y)
plt.plot(np.sort(x), y_predict[np.argsort(x)], color='r')
plt.show()
#更常用的是，把pipeline寫(xiě)在函數(shù)中
def PolynomialRegression(degree):
    return Pipeline([
        ("poly", PolynomialFeatures(degree=degree)),
        ("std_scaler", StandardScaler()),
        ("lin_reg", LinearRegression())
    ])

poly2_reg = PolynomialRegression(degree=2)
poly2_reg.fit(X, y)

y2_predict = poly2_reg.predict(X)
mean_squared_error(y, y2_predict)

5.2 GridSearchCV 和 Pipeline

明確：

GridSearchCV：用于尋找給定模型的最優(yōu)的參數(shù)
Pipeline：用于將幾個(gè)流程整合在一起（PolynomialFeatures(),、StandardScaler(),、LinearRegression()）

如果非要把上兩者寫(xiě)在一起，應(yīng)該把指定好param_grid參數(shù)的grid_search作為成員,，傳遞給Pipeline

5.3 模型泛化之嶺回歸（Ridge）

首先明確：

模型泛化是為了解決模型過(guò)擬合的問(wèn)題
嶺回歸是模型正則化的一種處理方式,，也稱為L2正則化
嶺回歸是線性回歸的一種正則化處理后的模型（作為pipeline的成員使用）

from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

def RidgeRegression(degree, alpha):
    return Pipeline([
        ("poly", PolynomialFeatures(degree=degree)),
        ("std_scaler", StandardScaler()),
        ("ridge_reg", Ridge(alpha=alpha))
    ])

ridge_reg = RidgeRegression(degree=20, alpha=0.0001)
ridge_reg.fit(X_train, y_train)

y_predict = ridge_reg.predict(X_test)
mean_squared_error(y_test, y_predict)

代碼中：

alpha為L(zhǎng)2正則項(xiàng)前面的系數(shù)，代表的含義與LASSO回歸相同

alpha越小,，越傾向于選擇復(fù)雜模型
alpha越大,，越傾向于選擇簡(jiǎn)單模型

Ridge回歸、LASSO回歸的區(qū)別

Ridge：更傾向于保持為曲線
LASSO：更傾向于變?yōu)?strong>直線（即趨向于使得部分theta變成0,，因此有特征選擇的作用）

5.4 模型泛化之LASSO回歸

嶺回歸是模型正則化的一種處理方式,，也稱為L1正則化
嶺回歸是線性回歸的一種正則化處理后的模型（作為pipeline的成員使用）

from sklearn.linear_model import Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

def LassoRegression(degree, alpha):
    return Pipeline([
        ("poly", PolynomialFeatures(degree=degree)),
        ("std_scaler", StandardScaler()),
        ("lasso_reg", Lasso(alpha=alpha))
    ])

lasso_reg = LassoRegression(3, 0.01)
lasso_reg.fit(X_train, y_train)

y_predict = lasso_reg.predict(X_test)
mean_squared_error(y_test, y_predict)

6. 邏輯回歸

將樣本特征與樣本發(fā)生的概率聯(lián)系起來(lái)。

既可看做回歸算法,，也可分類算法
通常作為二分類算法

6.1 繪制決策邊界

# 不規(guī)則決策邊界繪制方法
def plot_decision_boundary(model, axis):

    x0, x1 = np.meshgrid(
        np.linspace(axis[0], axis[1], int((axis[1] - axis[0]) * 100)).reshape(-1, 1),
        np.linspace(axis[2], axis[3], int((axis[3] - axis[2]) * 100)).reshape(-1, 1)
    )
    X_new = np.c_[x0.ravel(), x1.ravel()]

    y_predict = model.predict(X_new)
    zz = y_predict.reshape(x0.shape)

    from matplotlib.colors import ListedColormap
    custom_cmap = ListedColormap(['#EF9A9A', '#FFF59D', '#90CAF9'])

    plt.contourf(x0, x1, zz, linewidth=5, cmap=custom_cmap)
    
    
#此處為線性邏輯回歸
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
log_reg.score(X_test, y_test)
繪制決策邊界

plot_decision_boundary(log_reg, axis=[4, 7.5, 1.5, 4.5])
plt.scatter(X[y==0, 0], X[y==0, 1], color='r')
plt.scatter(X[y==1, 0], X[y==1, 1], color='blue')
plt.show()

6.2 多項(xiàng)式邏輯回歸

同樣,，類似于多項(xiàng)式回歸,，需要使用Pipeline構(gòu)造多項(xiàng)式特征項(xiàng)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

def PolynomialLogisticRegression(degree):
    return Pipeline([
        ('poly',PolynomialFeatures(degree=degree)),
        ('std_scaler',StandardScaler()),
        ('log_reg',LogisticRegression())
    ])

poly_log_reg = PolynomialLogisticRegression(degree=2)
poly_log_reg.fit(X, y)
poly_log_reg.score(X, y)

如果有需要，可以繪制出決策邊界

plot_decision_boundary(poly_log_reg, axis=[-4, 4, -4, 4])
plt.scatter(X[y==0, 0], X[y==0, 1])
plt.scatter(X[y==1, 0], X[y==1, 1])
plt.show()

6.3 邏輯回歸中的正則化項(xiàng)和懲罰系數(shù)C

公式為：

C * J(θ) + L1

C * J(θ) + L2

上式中：

C越大,，L1,、L2的作用越弱，模型越傾向復(fù)雜
C越小,，相對(duì)L1、L2作用越強(qiáng),， J(θ) 作用越弱,，模型越傾向簡(jiǎn)單

def PolynomialLogisticRegression(degree, C, penalty='l2'):
    return Pipeline([
        ('poly',PolynomialFeatures(degree=degree)),
        ('std_scaler',StandardScaler()),
        ('log_reg',LogisticRegression(C = C, penalty=penalty))
        # 邏輯回歸模型，默認(rèn)為 penalty='l2'
    ])

6.4 OVR 和 OVO

將只適用于二分類的算法,，改造為適用于多分類問(wèn)題

scikit封裝了OvO OvR這兩個(gè)類,，方便其他二分類算法，使用這兩個(gè)類實(shí)現(xiàn)多分類

例子中：log_reg是已經(jīng)創(chuàng)建好的邏輯回歸二分類器

from sklearn.multiclass import OneVsRestClassifier

ovr = OneVsRestClassifier(log_reg)
ovr.fit(X_train, y_train)
ovr.score(X_test, y_test)


from sklearn.multiclass import OneVsOneClassifier

ovo = OneVsOneClassifier(log_reg)
ovo.fit(X_train, y_train)
ovo.score(X_test, y_test)

7. 支撐向量機(jī)SVM

注意

由于涉及到距離的概念,，因此,，在SVM擬合之前，必須先進(jìn)行數(shù)據(jù)標(biāo)準(zhǔn)化

支撐向量機(jī)要滿足的優(yōu)化目標(biāo)是：

使 “最優(yōu)決策邊界” 到與兩個(gè)類別的最近的樣本的距離最遠(yuǎn)

即,，使得 margin 最大化

分為：

Hard Margin SVM
Soft Margin SVM

7.1 SVM的正則化

為了改善SVM模型的泛化能力,，需要進(jìn)行正則化處理，同樣有L1,、L2正則化

正則化即弱化限定條件,，使得某些樣本可以不再M(fèi)argin區(qū)域內(nèi)

懲罰系數(shù) C 是乘在正則項(xiàng)前面的

min12||w||2+C∑i=1mξi,L1正則項(xiàng)min12||w||2+C∑i=1mξi,L1正則項(xiàng)

min12||w||2+C∑i=1mξ2i,L2正則項(xiàng)min12||w||2+C∑i=1mξi2,L2正則項(xiàng)

變化規(guī)律 ：

C越大，容錯(cuò)空間越小,，越偏向于Hard Margin
C越小,，容錯(cuò)空間越大，越偏向于Soft Margin

7.2 線性SVM

from sklearn.preprocessing import StandardScaler

standardScaler = StandardScaler()
standardScaler.fit(X)
X_standard = standardScaler.transform(X)

from sklearn.svm import LinearSVC
svc = LinearSVC(C=1e9)
svc.fit(X_standard, y)

簡(jiǎn)潔起見(jiàn),，可以用Pipeline包裝起來(lái)

from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

def Linear_svc(C=1.0):
    return Pipeline([
        ("std_scaler", StandardScaler()),
        ("linearSVC", LinearSVC(C=C))
    ])
linear_svc = Linear_svc(C=1e5)
linear_svc.fit(X, y)

7.3 多項(xiàng)式特征SVM

明確：使用多項(xiàng)式核函數(shù)的目的都是將數(shù)據(jù)升維,，使得原本線性不可分的數(shù)據(jù)變得線性可分

在SVM中使用多項(xiàng)式特征有兩種方式

使用線性SVM，通過(guò)pipeline將 **poly ,、std ,、 linear_svc ** 三個(gè)連接起來(lái)
使用多項(xiàng)式核函數(shù)SVM, 則Pipeline只用包裝 std 、 kernelSVC 兩個(gè)類

7.3.1 傳統(tǒng)Pipeline多項(xiàng)式SVM

# 傳統(tǒng)上使用多項(xiàng)式特征的SVM
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

def PolynomialSVC(degree, C=1.0):
    return Pipeline([
        ("ploy", PolynomialFeatures(degree=degree)),
        ("std_standard", StandardScaler()),
        ("linearSVC", LinearSVC(C=C))
    ])

poly_svc = PolynomialSVC(degree=3)
poly_svc.fit(X, y)

7.3.2 多項(xiàng)式核函數(shù)SVM

# 使用多項(xiàng)式核函數(shù)的SVM

from sklearn.svm import SVC

def PolynomialKernelSVC(degree, C=1.0):
    return Pipeline([
        ("std_standard", StandardScaler()),
        ("kernelSVC", SVC(kernel='poly', degree=degree, C=C))
    ])

poly_kernel_svc = PolynomialKernelSVC(degree=3)
poly_kernel_svc.fit(X, y)

7.3.3 高斯核SVM（RBF）

將原本是m?nm?n的數(shù)據(jù)變?yōu)閙?mm?m

from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline

def RBFkernelSVC(gamma=1.0):
    return Pipeline([
        ("std_standard", StandardScaler()),
        ("svc", SVC(kernel="rbf", gamma=gamma))
    ])

svc = RBFkernelSVC(gamma=1.0)
svc.fit(X, y)

超參數(shù)gamma γγ 規(guī)律：

gamma越大,，高斯核越“窄”,，頭部越“尖”
gamma越小，高斯核越“寬”,，頭部越“平緩”,，圖形叉得越開(kāi)

若gamma太大，會(huì)造成 過(guò)擬合

若gamma太小,，會(huì)造成 欠擬合 ,，決策邊界變?yōu)?直線

7.4 使用SVM解決回歸問(wèn)題

指定margin區(qū)域垂直方向上的距離 ?? epsilon

通用可以分為線性SVR和多項(xiàng)式SVR

from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVR
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline

def StandardLinearSVR(epsilon=0.1):
    return Pipeline([
        ("std_scaler", StandardScaler()),
        ("linearSVR", LinearSVR(epsilon=epsilon))
    ])

svr = StandardLinearSVR()
svr.fit(X_train, y_train)

svr.score(X_test, y_test)
# 可以使用cross_val_score來(lái)獲得交叉驗(yàn)證的成績(jī),，成績(jī)更加準(zhǔn)確

8. 決策樹(shù)

非參數(shù)學(xué)習(xí)算法、天然可解決多分類問(wèn)題,、可解決回歸問(wèn)題(取葉子結(jié)點(diǎn)的平均值),、非常容易產(chǎn)生過(guò)擬合

可以考慮使用網(wǎng)格搜索來(lái)尋找最優(yōu)的超參數(shù)

劃分的依據(jù)有基于信息熵 、基于基尼系數(shù) (scikit默認(rèn)用gini,，兩者沒(méi)有特別優(yōu)劣之分)

ID3,、C4.5都是使用“entropy"評(píng)判方式

CART(Classification and Regression Tree)使用的是“gini"評(píng)判方式

常用超參數(shù)：

max_depth
min_samples_split （設(shè)置最小的可供繼續(xù)劃分的樣本數(shù)量）
min_samples_leaf （指定葉子結(jié)點(diǎn)最小的包含樣本的數(shù)量）
max_leaf_nodes （指定，最多能生長(zhǎng)出來(lái)的葉子結(jié)點(diǎn)的數(shù)量）

8.1 分類

from sklearn.tree import DecisionTreeClassifier

dt_clf = DecisionTreeClassifier(max_depth=2, criterion="gini")
# dt_clf = DecisionTreeClassifier(max_depth=2, criterion="entropy")

dt_clf.fit(X, y)

8.2 回歸

from sklearn.tree import DecisionTreeRegressor

dt_reg = DecisionTreeRegressor()
dt_reg.fit(X_train, y_train)

dt_reg.score(X_test, y_test)
# 計(jì)算的是R2值

9. 集成學(xué)習(xí)和隨機(jī)森林

9.1 Hard Voting Classifier

把幾種分類模型包裝在一起,，根據(jù)每種模型的投票結(jié)果來(lái)得出最終預(yù)測(cè)類別

可以先使用網(wǎng)格搜索把每種模型的參數(shù)調(diào)至最優(yōu),，再來(lái)Voting

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

voting_clf = VotingClassifier(estimators=[
    ("log_clf",LogisticRegression()),
    ("svm_clf", SVC()),
    ("dt_clf", DecisionTreeClassifier())
], voting='hard')
voting_clf.fit(X_train, y_train)
voting_clf.score(X_test, y_test)

9.2 Soft Voting Classifier

更合理的投票應(yīng)該考慮每種模型的權(quán)重，即考慮每種模型對(duì)自己分類結(jié)果的 有把握程度

所以,，每種模型都應(yīng)該能估計(jì)結(jié)果的概率

邏輯回歸
KNN
決策樹(shù)（葉子結(jié)點(diǎn)一般不止含有一類數(shù)據(jù),，因此可以有概率）
SVM中的SVC（可指定probability參數(shù)為T(mén)rue）

soft_voting_clf = VotingClassifier(estimators=[
    ("log_clf",LogisticRegression()),
    ("svm_clf", SVC(probability=True)),
    ("dt_clf", DecisionTreeClassifier(random_state=666))
], voting='soft')

soft_voting_clf.fit(X_train, y_train)
soft_voting_clf.score(X_test, y_test)

9.3 Bagging（放回取樣）

（1）Bagging(放回取樣) 和 Pasting(不放回取樣)，由參數(shù) bootstrap 來(lái)指定

True：放回取樣
False：不放回取樣

（2）這類集成學(xué)習(xí)方法需要指定一個(gè) base estimator

（3）放回取樣,，會(huì)存在 oob (out of bag) 的樣本數(shù)據(jù),，比例約37%，正好作為測(cè)試集

obb_score=True/False , 是否使用oob作為測(cè)試集

（4）產(chǎn)生差異化的方式：

只針對(duì)特征進(jìn)行隨機(jī)采樣：random subspace
既針對(duì)樣本,，又針對(duì)特征隨機(jī)采樣： random patches

random_subspaces_clf = BaggingClassifier(DecisionTreeClassifier(),
                               n_estimators=500, max_samples=500,
                               bootstrap=True, oob_score=True,
                               n_jobs=-1,
                               max_features=1, bootstrap_features=True)
random_subspaces_clf.fit(X, y)
random_subspaces_clf.oob_score_
random_patches_clf = BaggingClassifier(DecisionTreeClassifier(),
                               n_estimators=500, max_samples=100,
                               bootstrap=True, oob_score=True,
                               n_jobs=-1,
                               max_features=1, bootstrap_features=True)
random_patches_clf.fit(X, y)
random_patches_clf.oob_score_

參數(shù)解釋：

max_samples: 如果和樣本總數(shù)一致,，則不進(jìn)行樣本隨機(jī)采樣
max_features: 指定隨機(jī)采樣特征的個(gè)數(shù)（應(yīng)小于樣本維數(shù)）
bootstrap_features: 指定是否進(jìn)行隨機(jī)特征采樣
oob_score: 指定是都用oob樣本來(lái)評(píng)分
bootstrap: 指定是否進(jìn)行放回取樣

9.4 隨機(jī)森林和Extra-Tree

9.4.1 隨機(jī)森林

隨機(jī)森林是指定了 Base Estimator為Decision Tree 的Bagging集成學(xué)習(xí)模型

已經(jīng)被scikit封裝好，可以直接使用

from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=500, random_state=666, oob_score=True, n_jobs=-1)
rf_clf.fit(X, y)
rf_clf.oob_score_

#因?yàn)殡S機(jī)森林是基于決策樹(shù)的,，因此,，決策樹(shù)的相關(guān)參數(shù)這里都可以指定修改
rf_clf2 = RandomForestClassifier(n_estimators=500, random_state=666, max_leaf_nodes=16, oob_score=True, n_jobs=-1)
rf_clf2.fit(X, y)
rf_clf.oob_score_

9.4.2 Extra-Tree

Base Estimator為Decision Tree 的Bagging集成學(xué)習(xí)模型

特點(diǎn)：

決策樹(shù)在結(jié)點(diǎn)劃分上，使用隨機(jī)的特征和閾值
提供了額外的隨機(jī)性,，可以抑制過(guò)擬合,，但會(huì)增大Bias (偏差)
具有更快的訓(xùn)練速度

from sklearn.ensemble import ExtraTreesRegressor
et_clf = ExtraTreesClassifier(n_estimators=500, bootstrap=True,                               oob_score=True, random_state=666)
et_clf.fit(X, y)
et_clf.oob_score_

9.5 Ada Boosting

每個(gè)子模型模型都在嘗試增強(qiáng)（boost）整體的效果，通過(guò)不斷的模型迭代,，更新樣本點(diǎn)的權(quán)重

Ada Boosting沒(méi)有oob的樣本,，因此需要進(jìn)行 train_test_split

需要指定 Base Estimator

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2), n_estimators=500)
ada_clf.fit(X_train, y_train)

ada_clf.score(X_test, y_test)

9.6 Gradient Boosting

訓(xùn)練一個(gè)模型m1，產(chǎn)生錯(cuò)誤e1

針對(duì)e1訓(xùn)練第二個(gè)模型m2,，產(chǎn)生錯(cuò)誤e2

針對(duì)e2訓(xùn)練第二個(gè)模型m3,，產(chǎn)生錯(cuò)誤e3

......

最終的預(yù)測(cè)模型是：m1+m2+m3+...m1+m2+m3+...

Gradient Boosting是基于決策樹(shù)的，不用指定Base Estimator

from sklearn.ensemble import GradientBoostingClassifier

gb_clf = GradientBoostingClassifier(max_depth=2, n_estimators=30)
gb_clf.fit(X_train, y_train)
gb_clf.score(X_test, y_test)

總結(jié)

上述提到的集成學(xué)習(xí)模型,，不僅可以用于解決分類問(wèn)題,，也可解決回歸問(wèn)題

from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor

例子：

決策樹(shù)和Ada Boosting回歸問(wèn)題效果對(duì)比

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor

# 構(gòu)造測(cè)試函數(shù)
rng = np.random.RandomState(1)
X = np.linspace(-5, 5, 200)[:, np.newaxis]
y = np.sin(X).ravel() + np.sin(6 * X).ravel() + rng.normal(0, 0.1, X.shape[0])

# 回歸決策樹(shù)
dt_reg = DecisionTreeRegressor(max_depth=4)
# 集成模型下的回歸決策樹(shù)
ada_dt_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4),
                               n_estimators=200, random_state=rng)

dt_reg.fit(X, y)
ada_dt_reg.fit(X, y)

# 預(yù)測(cè)
y_1 = dt_reg.predict(X)
y_2 = ada_dt_reg.predict(X)

# 畫(huà)圖
plt.figure()
plt.scatter(X, y, c="k", label="trainning samples")
plt.plot(X, y_1, c="g", label="n_estimators=1", linewidth=2)
plt.plot(X, y_2, c="r", label="n_estimators=200", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Boosted Decision Tree Regression")
plt.legend()
plt.show()

10. K-means聚類

K-means算法實(shí)現(xiàn)：文章介紹了k-means算法的基本原理和scikit中封裝的kmeans庫(kù)的基本參數(shù)的含義

K-means源碼解讀：這篇文章解讀了scikit中kmeans的源碼

本例的notebook筆記文件：git倉(cāng)庫(kù)

實(shí)例代碼：

from matplotlib import pyplot  as plt
from sklearn.metrics import accuracy_score
import numpy as np
import seaborn as sns; sns.set()
%matplotlib inline

10.1 傳統(tǒng)K-means聚類

構(gòu)造數(shù)據(jù)集

from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
plt.scatter(X[:,0], X[:, 1], s=50)

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

繪制聚類結(jié)果，畫(huà)出聚類中心

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:,0], centers[:, 1], c='black', s=80, marker='x')

10.2 非線性邊界聚類

對(duì)于非線性邊界的kmeans聚類的介紹,，查閱于《python數(shù)據(jù)科學(xué)手冊(cè)》P410

構(gòu)造數(shù)據(jù)

from sklearn.datasets import make_moons
X, y = make_moons(200, noise=0.05, random_state=0)

傳統(tǒng)kmeans聚類失敗的情況

labels = KMeans(n_clusters=2, random_state=0).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis')

應(yīng)用核方法,，將數(shù)據(jù)投影到更高緯的空間，變成線性可分

from sklearn.cluster import SpectralClustering
model = SpectralClustering(n_clusters=2, affinity='nearest_neighbors', assign_labels='kmeans')
labels = model.fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis')

10.3 預(yù)測(cè)結(jié)果與真實(shí)標(biāo)簽的匹配

手寫(xiě)數(shù)字識(shí)別例子

from sklearn.datasets import load_digits
digits = load_digits()

進(jìn)行聚類

kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape
(10, 64)

可以將這些族中心點(diǎn)看做是具有代表性的數(shù)字

fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
    axi.set(xticks=[], yticks=[])
    axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

進(jìn)行眾數(shù)匹配

from scipy.stats import mode

labels = np.zeros_like(clusters)
for i in range(10):
    #得到聚類結(jié)果第i類的 True Flase 類型的index矩陣
    mask = (clusters ==i)
    #根據(jù)index矩陣，找出這些target中的眾數(shù),，作為真實(shí)的label
    labels[mask] = mode(digits.target[mask])[0]
#有了真實(shí)的指標(biāo),，可以進(jìn)行準(zhǔn)確度計(jì)算

accuracy_score(digits.target, labels)
0.7935447968836951

10.4 聚類結(jié)果的混淆矩陣

from sklearn.metrics import confusion_matrix
mat = confusion_matrix(digits.target, labels)
np.fill_diagonal(mat, 0)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=digits.target_names,
            yticklabels=digits.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label')

10.5 t分布鄰域嵌入預(yù)處理

即將高緯的非線性的數(shù)據(jù)

通過(guò)流形學(xué)習(xí)

投影到低維空間

from sklearn.manifold import TSNE

# 投影數(shù)據(jù)
# 此過(guò)程比較耗時(shí)
tsen = TSNE(n_components=2, init='pca', random_state=0)
digits_proj = tsen.fit_transform(digits.data)

#計(jì)算聚類的結(jié)果
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits_proj)

#將聚類結(jié)果和真實(shí)標(biāo)簽進(jìn)行匹配
labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]
    
# 計(jì)算準(zhǔn)確度
accuracy_score(digits.target, labels)

11. 高斯混合模型（聚類、密度估計(jì)）

k-means算法的非概率性和僅根據(jù)到族中心的距離指派族的特征導(dǎo)致該算法性能低下

且k-means算法只對(duì)簡(jiǎn)單的,，分離性能好的,，并且是圓形分布的數(shù)據(jù)有比較好的效果

本例中所有代碼的實(shí)現(xiàn)已上傳至 git倉(cāng)庫(kù)

11.1 觀察K-means算法的缺陷

通過(guò)實(shí)例來(lái)觀察K-means算法的缺陷

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
# 生成數(shù)據(jù)點(diǎn)
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=400, centers=4,
                       cluster_std=0.60, random_state=0)
X = X[:, ::-1] # flip axes for better plotting
# 繪制出kmeans聚類后的標(biāo)簽的結(jié)果
from sklearn.cluster import KMeans
kmeans = KMeans(4, random_state=0)
labels = kmeans.fit(X).predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis');

centers = kmeans.cluster_centers_
plt.scatter(centers[:,0], centers[:, 1], c='black', s=80, marker='x')

k-means算法相當(dāng)于在每個(gè)族的中心放置了一個(gè)圓圈，（針對(duì)此處的二維數(shù)據(jù)來(lái)說(shuō)）

半徑是根據(jù)最遠(yuǎn)的點(diǎn)與族中心點(diǎn)的距離算出

下面用一個(gè)函數(shù)將這個(gè)聚類圓圈可視化

from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist

def plot_kmeans(kmeans, X, n_clusters=4, rseed=0, ax=None):
    labels = kmeans.fit_predict(X)

    # plot the input data
    ax = ax or plt.gca()
    ax.axis('equal')
    ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)


    # plot the representation of the KMeans model

    centers = kmeans.cluster_centers_
    ax.scatter(centers[:,0], centers[:, 1], c='black', s=150, marker='x')
    
    radii = [cdist(X[labels == i], [center]).max() for i, center in enumerate(centers)]
    #用列表推導(dǎo)式求出每一個(gè)聚類中心 i = 0, 1, 2, 3在自己的所屬族的距離的最大值
    #labels == i 返回一個(gè)布爾型index,，所以X[labels == i]只取出i這個(gè)族類的數(shù)據(jù)點(diǎn)
    #求出這些數(shù)據(jù)點(diǎn)到聚類中心的距離cdist(X[labels == i], [center])  再求最大值 .max()
    
    
    for c, r in zip(centers, radii):
        ax.add_patch(plt.Circle(c, r, fc='#CCCCCC', lw=3, alpha=0.5, zorder=1))
#如果數(shù)據(jù)點(diǎn)不是圓形分布的

k-means算法的聚類效果就會(huì)變差

rng = np.random.RandomState(13)
# 這里乘以一個(gè)2,2的矩陣,，相當(dāng)于在空間上執(zhí)行旋轉(zhuǎn)拉伸操作
X_stretched = np.dot(X, rng.randn(2, 2))

kmeans = KMeans(n_clusters=4, random_state=0)
plot_kmeans(kmeans, X_stretched)

11.2 引出高斯混合模型

高斯混合模型能夠計(jì)算出每個(gè)數(shù)據(jù)點(diǎn)，屬于每個(gè)族中心的概率大小

在默認(rèn)參數(shù)設(shè)置的,、數(shù)據(jù)簡(jiǎn)單可分的情況下,，

GMM的分類效果與k-means基本相同

from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=4).fit(X)
labels = gmm.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis');


#gmm的中心點(diǎn)叫做 means_
centers = gmm.means_
plt.scatter(centers[:,0], centers[:, 1], c='black', s=80, marker='x');

得到數(shù)據(jù)的概率分布結(jié)果

probs = gmm.predict_proba(X)
print(probs[:5].round(3))
[[0.    0.469 0.    0.531]
 [1.    0.    0.    0.   ]
 [1.    0.    0.    0.   ]
 [0.    0.    0.    1.   ]
 [1.    0.    0.    0.   ]]

編寫(xiě)繪制gmm繪制邊界的函數(shù)

from matplotlib.patches import Ellipse

def draw_ellipse(position, covariance, ax=None, **kwargs):
    """Draw an ellipse with a given position and covariance"""
    ax = ax or plt.gca()
    
    # Convert covariance to principal axes
    if covariance.shape == (2, 2):
        U, s, Vt = np.linalg.svd(covariance)
        angle = np.degrees(np.arctan2(U[1, 0], U[0, 0]))
        width, height = 2 * np.sqrt(s)
    else:
        angle = 0
        width, height = 2 * np.sqrt(covariance)
    
    # Draw the Ellipse
    for nsig in range(1, 4):
        ax.add_patch(Ellipse(position, nsig * width, nsig * height,
                             angle, **kwargs))
        
def plot_gmm(gmm, X, label=True, ax=None):
    ax = ax or plt.gca()
    labels = gmm.fit(X).predict(X)
    if label:
        ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)
    else:
        ax.scatter(X[:, 0], X[:, 1], s=40, zorder=2)
    ax.axis('equal')
    
    w_factor = 0.2 / gmm.weights_.max()
    for pos, covar, w in zip(gmm.means_, gmm.covariances_, gmm.weights_):
        draw_ellipse(pos, covar, alpha=w * w_factor)

在圓形數(shù)據(jù)上的聚類結(jié)果

gmm = GaussianMixture(n_components=4, random_state=42)
plot_gmm(gmm, X)

在偏斜拉伸數(shù)據(jù)上的聚類結(jié)果

gmm = GaussianMixture(n_components=4, covariance_type='full', random_state=42)
plot_gmm(gmm, X_stretched)

11.3 將GMM用作密度估計(jì)

GMM本質(zhì)上是一個(gè)密度估計(jì)算法；也就是說(shuō),，從技術(shù)的角度考慮，

一個(gè) GMM 擬合的結(jié)果并不是一個(gè)聚類模型,，而是描述數(shù)據(jù)分布的生成概率模型,。

非線性邊界的情況

# 構(gòu)建非線性可分?jǐn)?shù)據(jù)

from sklearn.datasets import make_moons
Xmoon, ymoon = make_moons(200, noise=.05, random_state=0)
plt.scatter(Xmoon[:, 0], Xmoon[:, 1]);

? 如果使用2個(gè)成分聚類（即廢了結(jié)果設(shè)置為2），基本沒(méi)什么效果

gmm2 = GaussianMixture(n_components=2, covariance_type='full', random_state=0)
plot_gmm(gmm2, Xmoon)

? 如果設(shè)置為多個(gè)聚類成分

gmm16 = GaussianMixture(n_components=16, covariance_type='full', random_state=0)
plot_gmm(gmm16, Xmoon, label=False)

這里采用 16 個(gè)高斯曲線的混合形式不是為了找到數(shù)據(jù)的分隔的簇,，而是為了對(duì)輸入數(shù)據(jù)的總體分布建模,。

11.4 由分布函數(shù)得到生成模型

分布函數(shù)的生成模型可以生成新的，與輸入數(shù)據(jù)類似的隨機(jī)分布函數(shù)（生成新的數(shù)據(jù)點(diǎn)）

用 GMM 擬合原始數(shù)據(jù)獲得的 16 個(gè)成分生成的 400 個(gè)新數(shù)據(jù)點(diǎn)

Xnew = gmm16.sample(400)
Xnew[0][:5]
Xnew = gmm16.sample(400)
plt.scatter(Xnew[0][:, 0], Xnew[0][:, 1]);

11.5 需要多少成分,？

作為一種生成模型,，GMM 提供了一種確定數(shù)據(jù)集最優(yōu)成分?jǐn)?shù)量的方法。

赤池信息量準(zhǔn)則（Akaike information criterion) AIC
貝葉斯信息準(zhǔn)則（Bayesian information criterion) BIC

n_components = np.arange(1, 21)
models = [GaussianMixture(n, covariance_type='full', random_state=0).fit(Xmoon)
          for n in n_components]

plt.plot(n_components, [m.bic(Xmoon) for m in models], label='BIC')
plt.plot(n_components, [m.aic(Xmoon) for m in models], label='AIC')
plt.legend(loc='best')
plt.xlabel('n_components');

觀察可得,，在 8~12 個(gè)主成分的時(shí)候,，AIC 較小

評(píng)價(jià)指標(biāo)

一、分類算法

常用指標(biāo)選擇方式

平衡分類問(wèn)題：

分類準(zhǔn)確度,、ROC曲線

類別不平衡問(wèn)題：

精準(zhǔn)率,、召回率

對(duì)于二分類問(wèn)題，常用的指標(biāo)是 f1 ,、 roc_auc

多分類問(wèn)題,，可用的指標(biāo)為 f1_weighted

1.分類準(zhǔn)確度

一般用于平衡分類問(wèn)題（每個(gè)類比的可能性相同）

from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predict)  #(真值，預(yù)測(cè)值)

2. 混淆矩陣,、精準(zhǔn)率,、召回率

精準(zhǔn)率：正確預(yù)測(cè)為1 的數(shù)量，占,，所有預(yù)測(cè)為1的比例
召回率：正確預(yù)測(cè)為1 的數(shù)量,，占， 所有確實(shí)為1的比例

# 先真實(shí)值，后預(yù)測(cè)值
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_log_predict)

from sklearn.metrics import precision_score
precision_score(y_test, y_log_predict)

from sklearn.metrics import recall_score
recall_score(y_test, y_log_predict)

多分類問(wèn)題中的混淆矩陣

多分類結(jié)果的精準(zhǔn)率

from sklearn.metrics import precision_score
precision_score(y_test, y_predict, average="micro")

多分類問(wèn)題中的混淆矩陣

from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_predict)

移除對(duì)角線上分類正確的結(jié)果,，可視化查看其它分類錯(cuò)誤的情況同樣,，橫坐標(biāo)為預(yù)測(cè)值，縱坐標(biāo)為真實(shí)值

cfm = confusion_matrix(y_test, y_predict)
row_sums = np.sum(cfm, axis=1)
err_matrix = cfm / row_sums
np.fill_diagonal(err_matrix, 0)

plt.matshow(err_matrix, cmap=plt.cm.gray)
plt.show()

3.F1-score

F1-score是精準(zhǔn)率precision和召回率recall的調(diào)和平均數(shù)

from sklearn.metrics import f1_score

f1_score(y_test, y_predict)

4.精準(zhǔn)率和召回率的平衡

可以通過(guò)調(diào)整閾值,，改變精確率和召回率（默認(rèn)閾值為0）

拉高閾值,，會(huì)提高精準(zhǔn)率，降低召回率
降低閾值,，會(huì)降低精準(zhǔn)率,，提高召回率

# 返回模型算法預(yù)測(cè)得到的成績(jī)
# 這里是以  邏輯回歸算法  為例
decision_score = log_reg.decision_function(X_test)

# 調(diào)整閾值為5
y_predict_2 = np.array(decision_score >= 5, dtype='int')
# 返回的結(jié)果是0 、1

5.精準(zhǔn)率-召回率曲線（PR曲線）

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_test, decision_score)
# 這里的decision_score是上面由模型對(duì)X_test預(yù)測(cè)得到的對(duì)象

繪制PR曲線

# 精確率召回率曲線
plt.plot(precisions, recalls)
plt.show()

將精準(zhǔn)率和召回率曲線,，繪制在同一張圖中

注意,，當(dāng)取“最大的” threshold值的時(shí)候，精準(zhǔn)率=1,，召回率=0,，
但是，這個(gè)最大的threshold沒(méi)有對(duì)應(yīng)的值
因此thresholds會(huì)少一個(gè)

plt.plot(thresholds, precisions[:-1], color='r')
plt.plot(thresholds, recalls[:-1], color='b')
plt.show()

6.ROC曲線

Reciver Operation Characteristic Curve

TPR： True Positive rate
FPR： False Positive RateFPR=FPTN+FPFPR=FPTN+FP

繪制ROC曲線

from sklearn.metrics import roc_curve

fprs, tprs, thresholds = roc_curve(y_test, decision_scores)

plt.plot(fprs, tprs)
plt.show()

計(jì)算ROC曲線下方的面積的函數(shù)

roc_ area_ under_ curve_score

from sklearn.metrics import roc_auc_score

roc_auc_score(y_test, decision_scores)

曲線下方的面積可用于比較兩個(gè)模型的好壞

總之,，上面提到的decision_score 是一個(gè)概率值,，如0 1 二分類問(wèn)題，應(yīng)該是將每個(gè)樣本預(yù)測(cè)為1的概率,，

如某個(gè)樣本的y_test為1,，y_predict_probablity為0.875

每個(gè)測(cè)試樣本對(duì)應(yīng)一個(gè)預(yù)測(cè)的概率值

通常在模型fit完成之后，都會(huì)有相應(yīng)的得到概率的函數(shù),，如

model.predict_prob(X_test)

model.decision_function(X_test)

二,、回歸算法

1.均方誤差 MSE

from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_predict)

2.平均絕對(duì)值誤差 MAE

from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_predict)

3.均方根誤差 RMSE

scikit中沒(méi)有單獨(dú)定于均方根誤差，需要自己對(duì)均方誤差MSE開(kāi)平方根

4.R2評(píng)分

from sklearn.metrics import r2_score
r2_score(y_test, y_predict)

5.學(xué)習(xí)曲線

觀察模型在訓(xùn)練數(shù)據(jù)集和測(cè)試數(shù)據(jù)集上的評(píng)分,，隨著訓(xùn)練數(shù)據(jù)集樣本數(shù)增加的變化趨勢(shì),。

import numpy as np
import matplot.pyplot as plt
from sklearn.metrics import mean_squared_error


def plot_learning_curve(algo, X_train, X_test, y_train, y_test):
    
    train_score = []
    test_score = []
    for i in range(1, len(X_train)+1):
        algo.fit(X_train[:i], y_train[:i])

        y_train_predict = algo.predict(X_train[:i])
        train_score.append(mean_squared_error(y_train[:i], y_train_predict))

        y_test_predict = algo.predict(X_test)
        test_score.append(mean_squared_error(y_test, y_test_predict))
    
    plt.plot([i for i in range(1, len(X_train)+1)], np.sqrt(train_score), label="train")
    plt.plot([i for i in range(1, len(X_train)+1)], np.sqrt(test_score), label="test")
    plt.legend()
    plt.axis([0,len(X_train)+1, 0, 4])

    plt.show()
 
# 調(diào)用
plot_learning_curve(LinearRegression(), X_train, X_test, y_train, y_test )

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布,，不代表本站觀點(diǎn),。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購(gòu)買(mǎi)等信息,，謹(jǐn)防詐騙,。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào),。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來(lái)自：天下小糧倉(cāng) > 《深度學(xué)習(xí)》

舉報(bào)/認(rèn)領(lǐng)

0條評(píng)論

發(fā)表

請(qǐng)遵守用戶評(píng)論公約

類似文章 更多

天下小糧倉(cāng)

關(guān)注對(duì)話

TA的最新館藏

后不后悔,？寧夏那個(gè)一笑成名拒絕百萬(wàn)簽約的小女孩,，現(xiàn)在成了這樣
1949年渡江戰(zhàn)役,，國(guó)軍司令戴戎光如夢(mèng)初醒：原來(lái)只剩我一個(gè)國(guó)民黨
一口氣吃30多顆,！女子疼得直打滾，醫(yī)生傻眼：肚子里密密麻麻......
1934年粟裕被當(dāng)炮灰調(diào)虎離山，三年后聯(lián)系延安,，中央震驚你還活著
陽(yáng)氣如何補(bǔ)到根？關(guān)鍵有3步：先通,，后補(bǔ),，再固！
脾虛萬(wàn)病生！真正脾虛分六種，亂吃藥更傷脾 6種中成藥用對(duì)恢復(fù)快

喜歡該文的人也喜歡更多

久久国产成人av_抖音国产毛片_a片网站免费观看_A片无码播放手机在线观看,色五月在线观看,亚洲精品m在线观看,女人自慰的免费网址,悠悠在线观看精品视频,一级日本片免费的,亚洲精品久,国产精品成人久久久久久久