【手把手教你】使用Python全面分析股票數(shù)據(jù)特征

ramdisk 2022-06-18 發(fā)布于上海

展開全文

導(dǎo)讀： 本文主要從股市數(shù)據(jù)變量的特征分布及特征重要性兩個(gè)角度對(duì)數(shù)據(jù)進(jìn)行分析。

通過繪制圖表等方法分析特征本身對(duì)分布狀況或特征間相互關(guān)系,。通過機(jī)器學(xué)習(xí)模型方法分析出特種重要性排序，選出對(duì)結(jié)果貢獻(xiàn)較大對(duì)那幾個(gè)特征，這對(duì)后面建模對(duì)模型效果有著不可小覷對(duì)效果,。

????點(diǎn)擊關(guān)注｜選擇星標(biāo)｜干貨速遞????

數(shù)據(jù)準(zhǔn)備

此處數(shù)據(jù)獲取可參見金融數(shù)據(jù)準(zhǔn)備。

df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1260 entries, 2015-12-31 to 2020-12-31
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       1260 non-null   float64
 1   High       1260 non-null   float64
 2   Low        1260 non-null   float64
 3   Close      1260 non-null   float64
 4   Adj Close  1260 non-null   float64
 5   Volume     1260 non-null   int64  
dtypes: float64(5), int64(1)
memory usage: 68.9 KB

特征構(gòu)造

df['H-L'] = df['High'] - df['Low']
df['O-C'] = df['Adj Close'] - df['Open']
df['3day MA'] = df['Adj Close'].shift(1).rolling(window=3).mean()
df['10day MA'] = df['Adj Close'].shift(1).rolling(window=10).mean()
df['30day MA'] = df['Adj Close'].shift(1).rolling(window=30).mean()
df['Std_dev'] = df['Adj Close'].rolling(5).std()
df.dtypes

描述性統(tǒng)計(jì)

df.describe().T

缺失值分析

檢查缺失值

df.isnull().sum()

Open          0
High          0
Low           0
Close         0
Adj Close     0
Volume        0
H-L           0
O-C           0
3day MA       3
10day MA     10
30day MA     30
Std_dev       4
dtype: int64

缺失值可視化

這里使用Series的屬性plot直接繪制條形圖,。

df_missing_count = df.isnull().sum() 
# -1表示缺失數(shù)據(jù)
# 另一個(gè)不常見的設(shè)置畫布的方法
plt.rcParams['figure.figsize'] = (15,8)
df_missing_count.plot.bar()
plt.show()

for column in df:
   print("column nunique  NaN")
    print("{0:15} {1:6d} {2:6}".format(
          column, df[column].nunique(), 
          (df[column] == -1).sum()))

column            nunique   NaN
Open              1082      0
High              1083      0
Low               1025      0
Close             1098      0
Adj Close         1173      0
Volume            1250      0
H-L                357      0
O-C               1237      2
3day MA           1240      0
10day MA          1244      0
30day MA          1230      0
Std_dev           1252      0

特征間相關(guān)性分析

import seaborn as sns
# 一個(gè)設(shè)置色板的方法
# cmap = sns.diverging_palette(220, 10,
                        as_cmap=True)
sns.heatmap(df.iloc[:df.shape[0]].corr()
        ,annot = True, cmap = 'Blues')

特征值分布

直方圖

columns_multi = [x for x in list(df.columns)]
df.hist(layout = (3,4), column = columns_multi)
# 一種不常用的調(diào)整畫布大小的方法
fig=plt.gcf()
fig.set_size_inches(20,9)

密度圖

names = columns_multi
df.plot(kind='density', subplots=True, 
        layout=(3,4), sharex=False)

特征間的關(guān)系

函數(shù)可視化探索數(shù)據(jù)特征間的關(guān)系

sns.pairplot(df, size=3, 
             diag_kind="kde")

特征重要性

通過多種方式對(duì)特征重要性進(jìn)行評(píng)估,，將每個(gè)特征的特征重要的得分取均值，最后以均值大小排序繪制特征重要性排序圖，直觀查看特征重要性,。

導(dǎo)入相關(guān)模塊

from sklearn.feature_selection import RFE,RFECV, f_regression
from sklearn.linear_model import (LinearRegression, Ridge, Lasso,，LarsCV)
from stability_selection import StabilitySelection, RandomizedLasso
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVR

線性回歸系數(shù)大小排序

回歸系數(shù)(regression coefficient)在回歸方程中表示自變量對(duì)因變量影響大小的參數(shù)?；貧w系數(shù)越大表示對(duì) 影響越大,。

創(chuàng)建排序函數(shù)

df = df.dropna()
Y = df['Adj Close'].values
X = df.values
colnames = df.columns
# 定義字典來存儲(chǔ)的排名
ranks = {}
# 創(chuàng)建函數(shù)，它將特征排名存儲(chǔ)到rank字典中
def ranking(ranks, names, order=1):
    minmax = MinMaxScaler()
    ranks = minmax.fit_transform(
          order*np.array([ranks]).T).T[0]
    ranks = map(lambda x: round(x,2), ranks)
    res = dict(zip(names, ranks))
    return res

多個(gè)回歸模型系數(shù)排序

# 使用線性回歸
lr = LinearRegression(normalize=True)
lr.fit(X,Y)
ranks["LinReg"] = ranking(np.abs(lr.coef_), colnames)
# 使用 Ridge 
ridge = Ridge(alpha = 7)
ridge.fit(X,Y)
ranks['Ridge'] = ranking(np.abs(ridge.coef_), colnames)
# 使用 Lasso
lasso = Lasso(alpha=.05)
lasso.fit(X, Y)
ranks["Lasso"] = ranking(np.abs(lasso.coef_), colnames)

隨機(jī)森林特征重要性排序

隨機(jī)森林得到的特征重要性的原理是我們平時(shí)用的較頻繁的一種方法,，無論是對(duì)分類型任務(wù)還是連續(xù)型任務(wù),，都有較好對(duì)效果。在隨機(jī)森林中某個(gè)特征X的重要性的計(jì)算方法如下：

對(duì)于隨機(jī)森林中的每一顆決策樹,，使用相應(yīng)的OOB(袋外數(shù)據(jù))數(shù)據(jù)來計(jì)算它的袋外數(shù)據(jù)誤差 ,，記為 .
隨機(jī)地對(duì)袋外數(shù)據(jù)OOB所有樣本的特征X加入噪聲干擾 (就可以隨機(jī)的改變樣本在特征X處的值)，再次計(jì)算它的袋外數(shù)據(jù)誤差 ,，記為 .
假設(shè)隨機(jī)森林中有棵樹,，那么對(duì)于特征X的重要性，之所以可以用這個(gè)表達(dá)式來作為相應(yīng)特征的重要性的度量值是因?yàn)椋喝艚o某個(gè)特征隨機(jī)加入噪聲之后,，袋外的準(zhǔn)確率大幅度降低,，則說明這個(gè)特征對(duì)于樣本的分類結(jié)果影響很大，也就是說它的重要程度比較高,。

連續(xù)型特征重要性

對(duì)于連續(xù)型任務(wù)的特征重要性,，可以使用回歸模型RandomForestRegressor中feature_importances_屬性。

X_1 = dataset[['Open', 'High', 'Low', 'Volume', 
               'Increase_Decrease','Buy_Sell_on_Open',
               'Buy_Sell', 'Returns']]
y_1 = dataset['Adj Close']
# 創(chuàng)建決策樹分類器對(duì)象
clf = RandomForestRegressor(random_state=0, n_jobs=-1)
# 訓(xùn)練模型
model = clf.fit(X_1, y_1)
# 計(jì)算特征重要性
importances = model.feature_importances_
# 按降序排序特性的重要性
indices = np.argsort(importances)[::-1]
# 重新排列特性名稱,，使它們與已排序的特性重要性相匹配
names = [dataset.columns[i] for i in indices]
# 創(chuàng)建畫布
plt.figure(figsize=(10,6))
# 添加標(biāo)題
plt.title("Feature Importance")
# 添加柱狀圖
plt.bar(range(X.shape[1]), importances[indices])
# 為x軸添加特征名
plt.xticks(range(X.shape[1]), names, rotation=90)

分類型特征重要性

當(dāng)該任務(wù)是分類型,，需要用分類型模型時(shí)，可以使用RandomForestClassifier中的feature_importances_屬性,。

X2 = dataset[['Open', 'High', 'Low','Adj Close',
              'Volume', 'Buy_Sell_on_Open', 
              'Buy_Sell', 'Returns']]
y2 = dataset['Increase_Decrease']
clf = RandomForestClassifier(random_state=0, n_jobs=-1)
model = clf.fit(X2, y2)
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]
names = [dataset.columns[i] for i in indices]
plt.figure(figsize=(10,6))
plt.title("Feature Importance")
plt.bar(range(X2.shape[1]), importances[indices])
plt.xticks(range(X2.shape[1]), names, rotation=90)
plt.show()

本案例中使用回歸模型

rf = RandomForestRegressor(n_jobs=-1, n_estimators=50, verbose=3)
rf.fit(X,Y)
ranks["RF"] = ranking(rf.feature_importances_, colnames);

下面介紹兩個(gè)頂層特征選擇算法,，之所以叫做頂層，是因?yàn)樗麄兌际墙⒃诨谀Ｐ偷奶卣鬟x擇方法基礎(chǔ)之上的,，例如回歸和SVM,，在不同的子集上建立模型，然后匯總最終確定特征得分,。

RandomizedLasso

RandomizedLasso的選擇穩(wěn)定性方法排序,。

穩(wěn)定性選擇是一種基于二次抽樣和選擇算法相結(jié)合較新的方法，選擇算法可以是回歸,、SVM或其他類似的方法,。它的主要思想是在不同的數(shù)據(jù)子集和特征子集上運(yùn)行特征選擇算法，不斷的重復(fù),，最終匯總特征選擇結(jié)果,，比如可以統(tǒng)計(jì)某個(gè)特征被認(rèn)為是重要特征的頻率（被選為重要特征的次數(shù)除以它所在的子集被測(cè)試的次數(shù)）,。理想情況下，重要特征的得分會(huì)接近100%,。稍微弱一點(diǎn)的特征得分會(huì)是非0的數(shù),，而最無用的特征得分將會(huì)接近于0。

lambda_grid = np.linspace(0.001, 0.5, num=100)
rlasso = RandomizedLasso(alpha=0.04)
selector = StabilitySelection(base_estimator=rlasso, lambda_name='alpha',
                              lambda_grid=lambda_grid, threshold=0.9, verbose=1)
selector.fit(X, Y)
# 運(yùn)行隨機(jī)Lasso的選擇穩(wěn)定性方法
ranks["rlasso/Stability"] = ranking(np.abs(selector.stability_scores_.max(axis=1)), colnames)
print('finished')

{'Open': 1.0, 'High': 1.0, 'Low': 0.76, 
'Close': 1.0, 'Adj Close': 0.99, 'Volume': 0.0, 
'H-L': 0.0, 'O-C': 1.0, '3day MA': 1.0, 
'10day MA': 0.27, '30day MA': 0.75, 'Std_dev': 0.0}
finished

穩(wěn)定性得分可視化

fig, ax = plot_stability_path(selector)
fig.set_size_inches(15,6)
fig.show()

查看得分超過閾值的變量索引及其得分

# 獲取所選特征的掩碼或整數(shù)索引
selected_variables = selector.get_support(indices=True)
selected_scores = selector.stability_scores_.max(axis=1)
print('Selected variables are:')
print('-----------------------')
for idx, (variable, score) in enumerate(
                zip(selected_variables, 
                    selected_scores[selected_variables])):
    print('Variable %d: [%d], 
          score %.3f' % (idx + 1, variable, score))

Selected variables are:
-----------------------
Variable 1: [0], score 1.000
Variable 2: [1], score 1.000
Variable 3: [3], score 1.000
Variable 4: [4], score 0.990
Variable 5: [7], score 1.000
Variable 6: [8], score 1.000

RFE遞歸特征消除特征排序

基于遞歸特征消除的特征排序,。

給定一個(gè)給特征賦權(quán)的外部評(píng)估器(如線性模型的系數(shù)),，遞歸特征消除(RFE)的目標(biāo)是通過遞歸地考慮越來越小的特征集來選擇特征。

主要思想是反復(fù)的構(gòu)建模型(如SVM或者回歸模型)然后選出最好的(或者最差的)的特征(可以根據(jù)系數(shù)來選),。

首先,，在初始特征集上訓(xùn)練評(píng)估器，并通過任何特定屬性或可調(diào)用屬性來獲得每個(gè)特征的重要性,。
然后,，從當(dāng)前的特征集合中剔除最不重要的特征。
這個(gè)過程在訓(xùn)練集上遞歸地重復(fù),，直到最終達(dá)到需要選擇的特征數(shù),。

這個(gè)過程中特征被消除的次序就是特征的排序。因此,，這是一種尋找最優(yōu)特征子集的貪心算法,。

RFE的穩(wěn)定性很大程度上取決于在迭代的時(shí)候底層用哪種模型。例如,，假如RFE采用的普通的回歸,，沒有經(jīng)過正則化的回歸是不穩(wěn)定的，那么RFE就是不穩(wěn)定的,；假如采用的是Ridge,，而用Ridge正則化的回歸是穩(wěn)定的，那么RFE就是穩(wěn)定的,。

sklearn.feature_selection.RFE(estimator,
        *, n_features_to_select=None,
        step=1, verbose=0, 
        importance_getter='auto')

estimator Estimator instance
一種帶有""擬合""方法的監(jiān)督學(xué)評(píng)估器，它提供關(guān)于特征重要性的信息(例如"coef_",、"feature_importances_"),。

n_features_to_select int or float, default=None
要選擇的功能的數(shù)量。如果'None',，則選擇一半的特性,。如果為整數(shù)，則該參數(shù)為要選擇的特征的絕對(duì)數(shù)量,。如果浮點(diǎn)數(shù)在0和1之間,，則表示要選擇的特征的分?jǐn)?shù)。

step int or float, default=1
如果大于或等于1,，那么'step'對(duì)應(yīng)于每次迭代要?jiǎng)h除的(整數(shù))特征數(shù),。如果在(0.0,1.0)范圍內(nèi),，則'step'對(duì)應(yīng)于每次迭代中要?jiǎng)h除的特性的百分比(向下舍入)。

verbose int, default=0
控制輸出的冗長(zhǎng),。

importance_getter str or callable, default='auto'
如果是'auto',，則通過估計(jì)器的'coef_'或'feature_importances_'屬性使用特征重要性。

lr = LinearRegression(normalize=True)
lr.fit(X,Y)
# 當(dāng)且僅當(dāng)剩下最后一個(gè)特性時(shí)停止搜索
rfe = RFE(lr, n_features_to_select=1, verbose =3)
rfe.fit(X,Y)
ranks["RFE"] = ranking(list(map(float, rfe.ranking_)),
                       colnames, order=-1)

Fitting estimator with 12 features.
...
Fitting estimator with 2 features.

RFECV

遞歸特征消除交叉驗(yàn)證,。

Sklearn提供了 RFE 包,，可以用于特征消除，還提供了 RFECV ,，可以通過交叉驗(yàn)證來對(duì)的特征進(jìn)行排序,。

# 實(shí)例化估計(jì)器和特征選擇器
svr_mod = SVR(kernel="linear")
rfecv = RFECV(svr_mod, cv=5)
# 訓(xùn)練模型
rfecv.fit(X, Y)
ranks["RFECV"] = ranking(list(map(float, rfecv.ranking_)), colnames, order=-1)
# Print support and ranking
print(rfecv.support_)
print(rfecv.ranking_)
print(X.columns)

LarsCV

最小角度回歸模型(Least Angle Regression)交叉驗(yàn)證。

# 刪除第二步中不重要的特征
# X = X.drop('sex', axis=1)
# 實(shí)例化
larscv = LarsCV(cv=5, normalize=False)
# 訓(xùn)練模型
larscv.fit(X, Y)
ranks["LarsCV"] = ranking(list(map(float, larscv.ranking_)), colnames, order=-1)
# 輸出r方和估計(jì)alpha值
print(larscv.score(X, Y))
print(larscv.alpha_)

以上是兩個(gè)交叉驗(yàn)證,，在對(duì)特征重要性要求高時(shí)可以使用,。因運(yùn)行時(shí)間有點(diǎn)長(zhǎng)，這里大家可以自行運(yùn)行得到結(jié)果,。

創(chuàng)建特征排序矩陣

創(chuàng)建一個(gè)空字典來存儲(chǔ)所有分?jǐn)?shù),，并求其平均值。

r = {}
for name in colnames:
    r[name] = round(np.mean([ranks[method][name] 
                             for method in ranks.keys()]), 2)
methods = sorted(ranks.keys())
ranks["Mean"] = r
methods.append("Mean")
print("\t%s" % "\t".join(methods))
for name in colnames:
    print("%s\t%s" % (name, "\t".join(map(str, 
                         [ranks[method][name] for method in methods]))))

	Lasso	LinReg	RF	RFE	Ridge	rlasso/Stability	Mean
Open	1.0	1.0	0.02	0.91	0.47	1.0	0.73
High	0.14	0.0	0.1	0.36	0.06	1.0	0.28
Low	0.02	0.0	0.08	0.73	0.05	0.76	0.27
Close	0.14	0.0	0.64	0.55	0.32	1.0	0.44
Adj Close	0.02	1.0	1.0	0.82	1.0	0.99	0.8
Volume	0.0	0.0	0.0	0.0	0.0	0.0	0.0
H-L	0.0	0.0	0.0	0.45	0.01	0.0	0.08
O-C	0.85	1.0	0.0	1.0	0.53	1.0	0.73
3day MA	0.0	0.0	0.0	0.27	0.01	1.0	0.21
10day MA	0.0	0.0	0.02	0.09	0.0	0.27	0.06
30day MA	0.0	0.0	0.0	0.18	0.0	0.75	0.16
Std_dev	0.0	0.0	0.0	0.64	0.01	0.0	0.11

繪制特征重要性排序圖

將平均得到創(chuàng)建DataFrame數(shù)據(jù)框,，從高到低排序,，并利用可視化方法將結(jié)果展示出。這樣就一目了然,，每個(gè)特征重要性大小,。

meanplot = pd.DataFrame(list(r.items()), columns= ['Feature','Mean Ranking'])
# 排序
meanplot = meanplot.sort_values('Mean Ranking', ascending=False)
g=sns.factorplot(x="Mean Ranking", y="Feature", data = meanplot, kind="bar", 
               size=14, aspect=1.9, palette='coolwarm')

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布,，不代表本站觀點(diǎn),。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息,，謹(jǐn)防詐騙,。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào),。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自： ramdisk > 《我的圖書館》

舉報(bào)/認(rèn)領(lǐng)