機器學習

公彥棟 2017-10-08

展開全文

http://blog.csdn.net/python_learn/article/details/45008073

特征選擇是實用機器學習的重要一步,，一般數據集都帶有太多的特征用于模型構建,，如何找出有用特征是值得關注的內容,。

1. Feature selection: All-relevant selection with the Boruta package

特征選擇兩種方法用于分析：（1）最少最優(yōu)特征選擇（minimal-optimal feature selection)識別少量特征集合（理想狀況最少）給出盡可能優(yōu)的分類結果,；（2）所有相關特征選擇（all-relevant feature selection)識別所有與分類有關的所有特征,。

本文使用Boruta包,，它使用隨機森林分類算法,，測量每個特征的重要行（z score)。

2. 使用caret包

使用遞歸特征消除法,，rfe參數

x,，預測變量的矩陣或數據框

y，輸出結果向量（數值型或因子型）

sizes,，用于測試的特定子集大小的整型向量

rfeControl,，用于指定預測模型和方法的一系列選項

一些列函數可以用于rfeControl$functions，包括：線性回歸（lmFuncs）,，隨機森林（rfFuncs）,，樸素貝葉斯(nbFuncs),，bagged trees（treebagFuncs)和可以用于caret的train函數的函數（caretFuncs）。

1）移除冗余特征

移除高度關聯(lián)的特征,。

Caret R包提供findCorrelation函數,，分析特征的關聯(lián)矩陣，移除冗余特征
set.seed(7)
# load the library
library(mlbench)
library(caret)
# load the data
data(PimaIndiansDiabetes)
#P calculate correlation matrix
correlationMatrix <- cor(PimaIndiansDiabetes[,1:8])
# summarize the correlation matrix
print(correlationMatrix)
# find attributes that are highly corrected (ideally >0.75)
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5)
# print indexes of highly correlated attributes
print(highlyCorrelated)

2) 根據重要性進行特征排序

特征重要性可以通過構建模型獲取,。一些模型,，諸如決策樹，內建有特征重要性的獲取機制,。另一些模型,，每個特征重要性利用ROC曲線分析獲取。

下例加載Pima Indians Diabetes數據集,，構建一個Learning Vector Quantization（LVQ）模型,。varImp用于獲取特征重要性。從圖中可以看出glucose, mass和age是前三個最重要的特征,，insulin是最不重要的特征,。

# ensure results are repeatable
set.seed(7)
# load the library
library(mlbench)
library(caret)
# load the dataset
data(PimaIndiansDiabetes)
# prepare training scheme
control <- trainControl(method='repeatedcv', number=10, repeats=3)
# train the model
model <- train(diabetes~., data=PimaIndiansDiabetes, method='lvq', preProcess='scale', trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
# plot importance
plot(importance)

3）特征選擇

自動特征選擇用于構建不同子集的許多模型，識別哪些特征有助于構建準確模型,，哪些特征沒什么幫助,。

特征選擇的一個流行的自動方法稱為遞歸特征消除（Recursive Feature Elimination）或RFE。

下例在Pima Indians Diabetes數據集上提供RFE方法例子,。隨機森林算法用于每一輪迭代中評估模型的方法,。該算法用于探索所有可能的特征子集。從圖中可以看出當使用4個特征時即可獲取與較高性能相差無幾的結果,。

# ensure the results are repeatable
set.seed(7)
# load the library
library(mlbench)
library(caret)
# load the data
data(PimaIndiansDiabetes)
# define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method='cv', number=10)
# run the RFE algorithm
results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), rfeControl=control)
# summarize the results
print(results)
# list the chosen features
predictors(results)
# plot the results
plot(results, type=c('g', 'o'))