spark機(jī)器學(xué)習(xí)筆記：（二）用Spark Python進(jìn)行數(shù)據(jù)處理和特征提取

昵稱50318161 2017-12-06

展開(kāi)全文

下面用“|”字符來(lái)分隔各行數(shù)據(jù),。這將生成一個(gè)RDD,其中每一個(gè)記錄對(duì)應(yīng)一個(gè)Python列表,各列表由用戶ID(user ID),、年齡(age)、性別(gender),、職業(yè)(occupation)和郵編(ZIP code)五個(gè)屬性構(gòu)成,。4之后再統(tǒng)計(jì)用戶、性別,、職業(yè)和郵編的數(shù)目。這可通過(guò)如下代碼實(shí)現(xiàn),。該數(shù)據(jù)集不大,故這里并未緩存它。

user_fields = user_data.map(lambda line: line.split('|')) num_users = user_fields.map(lambda fields: fields[0]).count() #統(tǒng)計(jì)用戶數(shù) num_genders = user_fields.map(lambda fields : fields[2]).distinct().count() #統(tǒng)計(jì)性別個(gè)數(shù) num_occupations = user_fields.map(lambda fields: fields[3]).distinct().count() #統(tǒng)計(jì)職業(yè)個(gè)數(shù) num_zipcodes = user_fields.map(lambda fields: fields[4]).distinct().count() #統(tǒng)計(jì)郵編個(gè)數(shù) print 'Users: %d, genders: %d, occupations: %d, ZIP codes: %d'%(num_users,num_genders,num_occupations,num_zipcodes)

輸出結(jié)果：Users: 943, genders: 2, occupations: 21, ZIP codes: 795

畫(huà)出用戶的年齡分布圖：

%matplotlib inline import matplotlib.pyplot as plt from matplotlib.pyplot import hist ages = user_fields.map(lambda x: int(x[1])).collect() hist(ages, bins=20, color='lightblue',normed=True) fig = plt.gcf() fig.set_size_inches(12,6) plt.show()

畫(huà)出用戶的職業(yè)的分布圖：

#畫(huà)出用戶的職業(yè)的分布圖： import numpy as np count_by_occupation = user_fields.map(lambda fields: (fields[3],1)).reduceByKey(lambda x,y:x y).collect() print count_by_occupation x_axis1 = np.array([c[0] for c in count_by_occupation]) y_axis1 = np.array([c[1] for c in count_by_occupation]) x_axis = x_axis1[np.argsort(y_axis1)] y_axis = y_axis1[np.argsort(y_axis1)] pos = np.arange(len(x_axis)) width = 1.0 ax = plt.axes() ax.set_xticks(pos (width)/2) ax.set_xticklabels(x_axis) plt.bar(pos, y_axis, width, color='lightblue') plt.xticks(rotation=30) fig = plt.gcf() fig.set_size_inches(12,6) plt.show()

輸出結(jié)果：

[(u'administrator', 79), (u'retired', 14), (u'lawyer', 12), (u'none', 9), (u'student', 196), (u'technician', 27), (u'programmer', 66), (u'salesman', 12), (u'homemaker', 7), (u'executive', 32), (u'doctor', 7), (u'entertainment', 18), (u'marketing', 26), (u'writer', 45), (u'scientist', 31), (u'educator', 95), (u'healthcare', 16), (u'librarian', 51), (u'artist', 28), (u'other', 105), (u'engineer', 67)]

Spark對(duì)RDD提供了一個(gè)名為countByValue的便捷函數(shù)。它會(huì)計(jì)算RDD里各不同值所分別出現(xiàn)的次數(shù),并將其以Pythondict函數(shù)的形式(或是Scala,、Java下的Map函數(shù))返回給驅(qū)動(dòng)程序:

count_by_occupation2 = user_fields.map(lambda fields: fields[3]).countByValue() print 'Map-reduce approach:' print dict(count_by_occupation2) print '========================' print 'countByValue approach:' print dict(count_by_occupation)

輸出結(jié)果：

Map-reduce approach: {u'administrator': 79, u'retired': 14, u'lawyer': 12, u'healthcare': 16, u'marketing': 26, u'executive': 32, u'scientist': 31, u'student': 196, u'technician': 27, u'librarian': 51, u'programmer': 66, u'salesman': 12, u'homemaker': 7, u'engineer': 67, u'none': 9, u'doctor': 7, u'writer': 45, u'entertainment': 18, u'other': 105, u'educator': 95, u'artist': 28} ======================== countByValue approach: {u'administrator': 79, u'writer': 45, u'retired': 14, u'lawyer': 12, u'doctor': 7, u'marketing': 26, u'executive': 32, u'none': 9, u'entertainment': 18, u'healthcare': 16, u'scientist': 31, u'student': 196, u'educator': 95, u'technician': 27, u'librarian': 51, u'programmer': 66, u'artist': 28, u'salesman': 12, u'other': 105, u'homemaker': 7, u'engineer': 67}

2.2 探索電影數(shù)據(jù)

接下來(lái)了解下電影分類數(shù)據(jù)的特征,。如之前那樣,我們可以先簡(jiǎn)單看一下某行記錄,然后再統(tǒng)計(jì)電影總數(shù),。

movie_data = sc.textFile('/Users/youwei.tan/ml-100k//u.item') print movie_data.first() num_movies = movie_data.count() print 'Movies: %d' % num_movies

輸出結(jié)果：

1|Toy Story (1995)|01-Jan-1995||http://us./M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0 Movies: 1682

繪制電影的age分布圖：

繪制電影年齡的分布圖的方法和之前對(duì)用戶年齡和職業(yè)分布的處理類似,。電影年齡即其發(fā)行年份相對(duì)于現(xiàn)在過(guò)了多少年(在本數(shù)據(jù)中現(xiàn)在是1998年),。從下面的代碼可以看到,電影數(shù)據(jù)中有些數(shù)據(jù)不規(guī)整,故需要一個(gè)函數(shù)來(lái)處理解析releasedate時(shí)可能的解析錯(cuò)誤,。這里命名該函數(shù)為convert_year。

#畫(huà)出電影的age分布圖： def convert_year(x): try: return int(x[-4:]) except: return 1900 movie_fields = movie_data.map(lambda lines:lines.split('|')) years = movie_fields.map(lambda fields: fields[2]).map(lambda x: convert_year(x)) years_filtered = years.filter(lambda x: x!=1900) print years_filtered.count() movie_ages = years_filtered.map(lambda yr:1998-yr).countByValue() values = movie_ages.values() bins = movie_ages.keys() hist(values, bins=bins, color='lightblue',normed=True) fig = plt.gcf() fig.set_size_inches(12,6) plt.show()

輸出結(jié)果：

1681

2.3 探索評(píng)分?jǐn)?shù)據(jù)

查看評(píng)級(jí)數(shù)據(jù)記錄數(shù)量：

#查看數(shù)據(jù)記錄數(shù)量： rating_data = sc.textFile('/Users/youwei.tan/ml-100k/u.data') print rating_data.first() num_ratings = rating_data.count() print 'Ratings: %d'% num_ratings

輸出結(jié)果：

196 242 3 881250949 Ratings: 100000

接下來(lái)，我們做一些數(shù)據(jù)的基本統(tǒng)計(jì),以及繪制評(píng)級(jí)值分布的直方圖,。動(dòng)手吧：

#對(duì)數(shù)據(jù)進(jìn)行一些基本的統(tǒng)計(jì)： rating_data = rating_data.map(lambda line: line.split('\t')) ratings = rating_data.map(lambda fields: int(fields[2])) max_rating = ratings.reduce(lambda x,y:max(x,y)) min_rating = ratings.reduce(lambda x,y:min(x,y)) mean_rating = ratings.reduce(lambda x,y:x y)/num_ratings median_rating = np.median(ratings.collect()) ratings_per_user = num_ratings/num_users; ratings_per_movie = num_ratings/ num_movies print 'Min rating: %d' %min_rating print 'max rating: %d' % max_rating print 'Average rating: %2.2f' %mean_rating print 'Median rating: %d '%median_rating print 'Average # of ratings per user: %2.2f'%ratings_per_user print 'Average # of ratings per movie: %2.2f' % ratings_per_movie

輸出結(jié)果：

Min rating: 1 max rating: 5 Average rating: 3.00 Median rating: 4 Average # of ratings per user: 106.00 Average # of ratings per movie: 59.00

從上述結(jié)果可以看到,最低的評(píng)級(jí)為1,而最大的評(píng)級(jí)為5,。這并不意外,因?yàn)樵u(píng)級(jí)的范圍便是從1到5。

Spark對(duì)RDD也提供一個(gè)名為states的函數(shù),。該函數(shù)包含一個(gè)數(shù)值變量用于做類似的統(tǒng)計(jì):

ratings.stats()

輸出結(jié)果：

(count: 100000, mean: 3.52986, stdev: 1.12566797076, max: 5.0, min: 1.0)

可以看出,用戶對(duì)電影的平均評(píng)級(jí)(mean)是3.5左右,而評(píng)級(jí)中位數(shù)(median)為4,。這就能期待說(shuō)評(píng)級(jí)的分布稍傾向高點(diǎn)的得分。要驗(yàn)證這點(diǎn),可以創(chuàng)建一個(gè)評(píng)級(jí)值分布的條形圖,。具體做法和之前的類似:

count_by_rating = ratings.countByValue() x_axis = np.array(count_by_rating.keys()) y_axis = np.array([float(c) for c in count_by_rating.values()]) y_axis_normed = y_axis/y_axis.sum() pos = np.arange(len(x_axis)) width = 1.0 ax = plt.axes() ax.set_xticks(pos (width/2)) ax.set_xticklabels(x_axis) plt.bar(pos, y_axis_normed, width, color='lightblue') plt.xticks(rotation=30) fig = plt.gcf() fig.set_size_inches(12,6) plt.show()

輸出結(jié)果：

其特征和我們之前所期待的相同,即評(píng)級(jí)的分布的確偏向中等以上,。

接下來(lái)，計(jì)算每個(gè)用戶和其對(duì)應(yīng)的評(píng)價(jià)次數(shù)：

#計(jì)算每個(gè)用戶和其對(duì)應(yīng)的評(píng)價(jià)次數(shù)： user_ratings_grouped = rating_data.map(lambda fields:(int(fields[0]),int(fields[2]))).groupByKey() user_rating_byuser = user_ratings_grouped.map(lambda (k,v):(k,len(v))) user_rating_byuser.take(5)

輸出結(jié)果：

[(2, 62), (4, 24), (6, 211), (8, 59), (10, 184)]

繪制每個(gè)用戶的總共評(píng)價(jià)次數(shù)的分布圖：

user_ratings_byuser_local = user_rating_byuser.map(lambda (k,v):v).collect() hist(user_ratings_byuser_local, bins=200, color = 'lightblue',normed = True) fig = plt.gcf() fig.set_size_inches(12,6) plt.show()

結(jié)果如圖所示?？梢钥闯?大部分用戶的評(píng)級(jí)次數(shù)少于100,。但該分布也表明仍然有較多用戶做出過(guò)上百次的評(píng)級(jí)。

計(jì)算每部電影受到的評(píng)論次數(shù)：

# 為每部電影計(jì)算他的被評(píng)論的次數(shù) movie_ratings_group = rating_data.map(lambda fields: (int(fields[1]),int(fields[2]))).groupByKey() movie_ratings_byuser = movie_ratings_group.map(lambda (k,v):(k,len(v))) print movie_ratings_byuser.take(5)

輸出結(jié)果：

[(2, 131), (4, 209), (6, 26), (8, 219), (10, 89)]

繪制電影評(píng)論次數(shù)分布圖：

# 繪制電影評(píng)論次數(shù)分布圖： movie_ratings_byuser_local = movie_ratings_byuser.map(lambda (k,v):v).collect() hist(movie_ratings_byuser_local,bins=200,color='lightblue',normed=True) fig = plt.gcf() fig.set_size_inches(12,6) plt.show()

3. 數(shù)據(jù)處理與轉(zhuǎn)換

現(xiàn)在我們已對(duì)數(shù)據(jù)集進(jìn)行過(guò)探索性的分析,并了解了用戶和電影的一些特征,。那接下來(lái)做什么呢?

為讓原始數(shù)據(jù)可用于機(jī)器學(xué)習(xí)算法,需要先對(duì)其進(jìn)行清理,并可能需要將其進(jìn)行各種轉(zhuǎn)換,之后才能從轉(zhuǎn)換后的數(shù)據(jù)里提取有用的特征,。數(shù)據(jù)的轉(zhuǎn)換和特征提取聯(lián)系緊密。某些情況下,一些轉(zhuǎn)換本身便是特征提取的過(guò)程,。

在之前處理電影數(shù)據(jù)集時(shí)我們已經(jīng)看到數(shù)據(jù)清理的必要性,。一般來(lái)說(shuō),現(xiàn)實(shí)中的數(shù)據(jù)會(huì)存在信息不規(guī)整、數(shù)據(jù)點(diǎn)缺失和異常值問(wèn)題,。理想情況下,我們會(huì)修復(fù)非規(guī)整數(shù)據(jù),。但很多數(shù)據(jù)集都源于一些難以重現(xiàn)的收集過(guò)程(比如網(wǎng)絡(luò)活動(dòng)數(shù)據(jù)和傳感器數(shù)據(jù)),故實(shí)際上會(huì)難以修復(fù)。值缺失和異常也很常見(jiàn),且處理方式可與處理非規(guī)整信息類似,?？偟膩?lái)說(shuō),大致的處理方法如下。

? 過(guò)濾掉或刪除非規(guī)整或有值缺失的數(shù)據(jù):這通常是必須的,但的確會(huì)損失這些數(shù)據(jù)里那些好的信息,。

? 填充非規(guī)整或缺失的數(shù)據(jù):可以根據(jù)其他的數(shù)據(jù)來(lái)填充非規(guī)整或缺失的數(shù)據(jù),。方法包括用零值、全局期望或中值來(lái)填充,或是根據(jù)相鄰或類似的數(shù)據(jù)點(diǎn)來(lái)做插值(通常針對(duì)時(shí)序數(shù)據(jù))等,。選擇正確的方式并不容易,它會(huì)因數(shù)據(jù),、應(yīng)用場(chǎng)景和個(gè)人經(jīng)驗(yàn)而不同。

? 對(duì)異常值做魯棒處理:異常值的主要問(wèn)題在于即使它們是極值也不一定就是錯(cuò)的,。到底是對(duì)是錯(cuò)通常很難分辨,。異常值可被移除或是填充,但的確存在某些統(tǒng)計(jì)技術(shù)(如魯棒回歸)可用于處理異常值或是極值。

? 對(duì)可能的異常值進(jìn)行轉(zhuǎn)換:另一種處理異常值或極值的方法是進(jìn)行轉(zhuǎn)換,。對(duì)那些可能存在異常值或值域覆蓋過(guò)大的特征,利用如對(duì)數(shù)或高斯核對(duì)其轉(zhuǎn)換,。這類轉(zhuǎn)換有助于降低變量存在的值跳躍的影響,并將非線性關(guān)系變?yōu)榫€性的。

用指定值替換bad values和missing values:

#用指定值替換bad values和missing values years_pre_processed = movie_fields.map(lambda fields: fields[2]).map(lambda x: convert_year(x)).collect() years_pre_processed_array = np.array(years_pre_processed) mean_year = np.mean(years_pre_processed_array[years_pre_processed_array!=1900]) median_year = np.median(years_pre_processed_array[years_pre_processed_array!=1900]) index_bad_data = np.where(years_pre_processed_array==1900) years_pre_processed_array[index_bad_data] = median_year print 'Mean year of release: %d' % mean_year print 'Median year of release: %d ' % median_year print 'Index of '1900' after assigning median: %s'% np.where(years_pre_processed_array==1900)[0]

輸出結(jié)果：

Mean year of release: 1989 Median year of release: 1995 Index of '1900' after assigning median: []

4. 從數(shù)據(jù)中提取有用的特征

數(shù)據(jù)可以概括地氛圍如下幾種：

? 數(shù)值特征(numerical feature):這些特征通常為實(shí)數(shù)或整數(shù),比如之前例子中提到的年齡,。

? 類別特征(categorical feature):它們的取值只能是可能狀態(tài)集合中的某一種,。我們數(shù)據(jù)集中的用戶性別、職業(yè)或電影類別便是這類,。

? 文本特征(text feature):它們派生自數(shù)據(jù)中的文本內(nèi)容,比如電影名,、描述或是評(píng)論。

? 其他特征:大部分其他特征都最終表示為數(shù)值,。比如圖像,、視頻和音頻可被表示為數(shù)值數(shù)據(jù)的集合。地理位置則可由經(jīng)緯度或地理散列(geohash)表示,。

這里我們將談到數(shù)值,、類別以及文本類的特征。

4.1 數(shù)值特征

原始的數(shù)值和一個(gè)數(shù)值特征之間的區(qū)別是什么?實(shí)際上,任何數(shù)值數(shù)據(jù)都能作為輸入變量。但是,機(jī)器學(xué)習(xí)模型中所學(xué)習(xí)的是各個(gè)特征所對(duì)應(yīng)的向量的權(quán)值,。這些權(quán)值在特征值到輸出或是目標(biāo)變量(指在監(jiān)督學(xué)習(xí)模型中)的映射過(guò)程中扮演重要角色,。

由此我們會(huì)想使用那些合理的特征,讓模型能從這些特征學(xué)到特征值和目標(biāo)變量之間的關(guān)系。比如年齡就是一個(gè)合理的特征,。年齡的增加和某項(xiàng)支出之間可能就存在直接關(guān)系,。類似地,高度也是一個(gè)可直接使用的數(shù)值特征。

當(dāng)數(shù)值特征仍處于原始形式時(shí),其可用性相對(duì)較低,但可以轉(zhuǎn)化為更有用的表示形式,。位置信息便是如此,。若使用原始位置信息(比如用經(jīng)緯度表示的),我們的模型可能學(xué)習(xí)不到該信息和某個(gè)輸出之間的有用關(guān)系,這就使得該信息的可用性不高,除非數(shù)據(jù)點(diǎn)的確很密集。然而若對(duì)位置進(jìn)行聚合或挑選后(比如聚焦為一個(gè)城市或國(guó)家),便容易和特定輸出之間存在某種關(guān)聯(lián)了,。

4.2 類別特征

當(dāng)類別特征仍為原始形式時(shí),其取值來(lái)自所有可能取值所構(gòu)成的集合而不是一個(gè)數(shù)字,故不能作為輸入,。如之前的例子中的用戶職業(yè)便是一個(gè)類別特征變量,其可能取值有學(xué)生、程序員等,。

這樣的類別特征也稱作名義(nominal)變量,即其各個(gè)可能取值之間沒(méi)有順序關(guān)系,。相反,那些存在順序關(guān)系的(比如之前提到的評(píng)級(jí),從定義上說(shuō)評(píng)級(jí)5會(huì)高于或是好于評(píng)級(jí)1)則被稱為有序(ordinal)變量。

將類別特征表示為數(shù)字形式,?？山柚?/span>獨(dú)熱編碼進(jìn)行處理（詳見(jiàn)博客機(jī)器學(xué)習(xí)系列：（三）特征提取與處理）,。有序變量的原始值可能就能直接使用,但也常會(huì)采用獨(dú)熱編碼的方式進(jìn)行處理,。

all_occupations = user_fields.map(lambda fields:fields[3]).distinct().collect() all_occupations.sort() idx = 0 all_occupations_dict = {} for o in all_occupations: all_occupations_dict[o] = idx idx =1 print 'Encoding of 'doctor': %d' %all_occupations_dict['doctor'] print 'Encoding of 'programmer': %d' % all_occupations_dict['programmer']

輸出結(jié)果：

Encoding of 'doctor': 2 Encoding of 'programmer': 14

上面將categorical features轉(zhuǎn)換到數(shù)值型的,，但是經(jīng)常我們?cè)谧鰯?shù)據(jù)處理的時(shí)候，這類彼此之間沒(méi)有潛在排序信息的數(shù)據(jù),，應(yīng)該進(jìn)行dummies處理：

K=len(all_occupations_dict) binary_x = np.zeros(K) k_programmer = all_occupations_dict['programmer'] binary_x[k_programmer] = 1 print 'Binary feature vector: %s'%binary_x print 'Length of binray vector: %d' %K

輸出結(jié)果：

Binary feature vector: [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] Length of binray vector: 21

4.3 派生特征

從現(xiàn)有的一個(gè)或多個(gè)變量派生出新的特征常常是有幫助的,。理想情況下,派生出的特征能比原始特征帶來(lái)更多的信息。

比如,可以分別計(jì)算各用戶已有的電影評(píng)級(jí)的平均數(shù),。這將能給模型加入針對(duì)不同用戶的個(gè)性化特征(事實(shí)上,這常用于推薦系統(tǒng)),。在前文中我們也從原始的評(píng)級(jí)數(shù)據(jù)里創(chuàng)建了新的特征以學(xué)習(xí)出更好的模型。

從原始數(shù)據(jù)派生特征的例子包括計(jì)算平均值,、中位值,、方差、和,、差,、最大值或最小值以及計(jì)數(shù)。在先前內(nèi)容中,我們也看到是如何從電影的發(fā)行年份和當(dāng)前年份派生了新的movie age特征的,。這類轉(zhuǎn)換背后的想法常常是對(duì)數(shù)值數(shù)據(jù)進(jìn)行某種概括,并期望它能讓模型學(xué)習(xí)更容易,。

數(shù)值特征到類別特征的轉(zhuǎn)換也很常見(jiàn),比如劃分為區(qū)間特征。進(jìn)行這類轉(zhuǎn)換的變量常見(jiàn)的有年齡,、地理位置和時(shí)間,。

將時(shí)間戳轉(zhuǎn)為類別特征

下面以對(duì)評(píng)級(jí)時(shí)間的轉(zhuǎn)換為例,說(shuō)明如何將數(shù)值數(shù)據(jù)裝換為類別特征。

首先使用map將時(shí)間戳屬性轉(zhuǎn)換為Pythonint類型。然后通過(guò)extract_datetime函數(shù)將各時(shí)間戳轉(zhuǎn)為datetime類型的對(duì)象,進(jìn)而提取出其小時(shí)數(shù),，具體代碼如下：

def extract_datetime(ts): import datetime return datetime.datetime.fromtimestamp(ts) timestamps = rating_data.map(lambda fields:int(fields[3])) hour_of_day = timestamps.map(lambda ts: extract_datetime(ts).hour) hour_of_day.take(5)

輸出結(jié)果：

[23, 3, 15, 13, 13]

按時(shí)間段劃分為morning,lunch, afternoon, evening, night:

def assign_tod(hr): times_of_day = { 'morning':range(7,12), 'lunch': range(12,14), 'afternoon':range(14,18), 'evening':range(18,23), 'night': [23,24,1,2,3,4,5,6] } for k,v in times_of_day.iteritems(): if hr in v: return k time_of_day = hour_of_day.map(lambda hr: assign_tod(hr)) time_of_day.take(5)

輸出結(jié)果：

['night', 'night', 'afternoon', 'lunch', 'lunch']

然后對(duì)這些時(shí)間段做dummies處理,，編碼成[0,0,0,0,1]，操作類似于原來(lái)的職業(yè)統(tǒng)計(jì)處理的時(shí)候：

time_of_day_unique = time_of_day.map(lambda fields:fields).distinct().collect() time_of_day_unique.sort() idx = 0 time_of_day_unique_dict = {} for o in time_of_day_unique: time_of_day_unique_dict[o] = idx idx =1 print 'Encoding of 'afternoon': %d' %time_of_day_unique_dict['afternoon'] print 'Encoding of 'morning': %d' % time_of_day_unique_dict['morning'] print 'Encoding of 'lunch': %d' % time_of_day_unique_dict['lunch']

輸出結(jié)果：

Encoding of 'afternoon': 1 Encoding of 'morning': 4 Encoding of 'lunch': 3

4.4 文本特征

從某種意義上說(shuō),文本特征也是一種類別特征或派生特征,。下面以電影的描述(我們的數(shù)據(jù)集中不含該數(shù)據(jù))來(lái)舉例,。即便作為類別數(shù)據(jù),其原始的文本也不能直接使用。因?yàn)榧僭O(shè)每個(gè)單詞都是一種可能的取值,那單詞之間可能出現(xiàn)的組合有幾乎無(wú)限種,。這時(shí)模型幾乎看不到有相同的特征出現(xiàn)兩次,學(xué)習(xí)的效果也就不理想,。從中可以看出,我們會(huì)希望將原始的文本轉(zhuǎn)換為一種9更便于機(jī)器學(xué)習(xí)的形式。

文本的處理方式有很多種,。自然語(yǔ)言處理便是專注于文本內(nèi)容的處理,、表示和建模的一個(gè)領(lǐng)域。關(guān)于文本處理的完整內(nèi)容并不在本書(shū)的討論范圍內(nèi),但我們會(huì)介紹一種簡(jiǎn)單且標(biāo)準(zhǔn)化的文本特征提取方法,。該方法被稱為詞袋(bag-of-word)表示法,。

詞袋法將一段文本視為由其中的文本或數(shù)字組成的集合,其處理過(guò)程如下。

? 分詞(tokenization):首先會(huì)應(yīng)用某些分詞方法來(lái)將文本分隔為一個(gè)由詞(一般如單詞,、數(shù)字等)組成的集合,。可用的方法如空白分隔法,。這種方法在空白處對(duì)文本分隔并可能還刪除其他如標(biāo)點(diǎn)符號(hào)和其他非字母或數(shù)字字符,。

? 刪除停用詞(stop words removal):之后,它通常會(huì)刪除常見(jiàn)的單詞,比如the、and和but(這些詞被稱作停用詞),。

? 提取詞干(stemming):下一步則是詞干的提取,。這是指將各個(gè)詞簡(jiǎn)化為其基本的形式或者干詞。常見(jiàn)的例子如復(fù)數(shù)變?yōu)閱螖?shù)(比如dogs變?yōu)?/span>dog等),。提取的方法有很多種,文本處理算法庫(kù)中常常會(huì)包括多種詞干提取方法,。

? 向量化(vectorization):最后一步就是用向量來(lái)表示處理好的詞。二元向量可能是最為簡(jiǎn)單的表示方式,。它用1和0來(lái)分別表示是否存在某個(gè)詞,。從根本上說(shuō),這與之前提到的k之1編碼相同。與k之1相同,它需要一個(gè)詞的字典來(lái)實(shí)現(xiàn)詞到索引序號(hào)的映射,。隨著遇到的詞增多,各種詞可能達(dá)數(shù)百萬(wàn),。由此,使用稀疏矩陣來(lái)表示就很關(guān)鍵。這種表示只記錄某個(gè)詞是否出現(xiàn)過(guò),從而節(jié)省內(nèi)存和磁盤(pán)空間,以及計(jì)算時(shí)間,。

簡(jiǎn)單的文本特征提取:

提取出titles

def extract_title(raw): import re grps = re.search('\((\w )\)',raw) if grps: return raw[:grps.start()].strip() else: return raw raw_titles = movie_fields.map(lambda fields: fields[1]) for raw_title in raw_titles.take(5): print extract_title(raw_title)

輸出結(jié)果：

Toy Story GoldenEye Four Rooms Get Shorty Copycat

分詞處理

movie_titles = raw_titles.map(lambda m: extract_title(m)) title_terms = movie_titles.map(lambda m:m.split(' ')) print title_terms.take(5)

輸出結(jié)果：

[[u'Toy', u'Story'], [u'GoldenEye'], [u'Four', u'Rooms'], [u'Get', u'Shorty'], [u'Copycat']]

然后將所有titles出現(xiàn)的word去重,，然后就可以看到所有的word的list:

all_terms = title_terms.flatMap(lambda x: x).distinct().collect() idx = 0 all_terms_dict = {} for term in all_terms: all_terms_dict[term] = idx idx =1 print 'Total number of terms: %d' % len(all_terms_dict) print 'Index of term 'Dead': %d' % all_terms_dict['Dead'] print 'Index of term 'Rooms': %d' % all_terms_dict['Rooms']

上面的代碼也可以用Spark內(nèi)置的zipWithIndex來(lái)完成：

all_terms_dict2 = title_terms.flatMap(lambda x:x).distinct().zipWithIndex().collectAsMap() print 'Index of term 'Dead %d' % all_terms_dict['Dead'] print 'Index of term 'Rooms': %d' % all_terms_dict['Rooms']

輸出結(jié)果：

Index of term 'Dead 147 Index of term 'Rooms': 1963

結(jié)果與上面的一樣。

到了這里,，我們就要想著如何把這些數(shù)據(jù)存儲(chǔ)下來(lái),，如何使用,，如果按前面對(duì)categorical var的處理方式，做dummies處理直接存儲(chǔ),，顯然會(huì)浪費(fèi)太多的空間,，我們?cè)谶@里采用壓縮稀疏(csc_matrix)的存儲(chǔ)方式。

def create_vector(terms, term_dict): from scipy import sparse as sp num_terms = len(term_dict) x = sp.csc_matrix((1,num_terms)) for t in terms: if t in term_dict: idx = term_dict[t] x[0,idx] = 1 return x all_terms_bcast = sc.broadcast(all_terms_dict) term_vectors = title_terms.map(lambda terms: create_vector(terms,all_terms_bcast.value)) term_vectors.take(5)

輸出結(jié)果：

[<1x2645 sparse matrix of type '<type 'numpy.float64'>' with 2 stored elements in Compressed Sparse Column format>, <1x2645 sparse matrix of type '<type 'numpy.float64'>' with 1 stored elements in Compressed Sparse Column format>, <1x2645 sparse matrix of type '<type 'numpy.float64'>' with 2 stored elements in Compressed Sparse Column format>, <1x2645 sparse matrix of type '<type 'numpy.float64'>' with 2 stored elements in Compressed Sparse Column format>, <1x2645 sparse matrix of type '<type 'numpy.float64'>' with 1 stored elements in Compressed Sparse Column format>]

4.5 規(guī)范化特征

用numpy來(lái)規(guī)范化特征：

np.random.seed(42) x = np.random.randn(10) norm_x_2 = np.linalg.norm(x) normalized_x = x / norm_x_2 print 'x:\n%s' % x print '2-Norm of x: %2.4f' % norm_x_2 print 'Normalized x:\n%s' % normalized_x print '2-Norm of normalized_x: %2.4f' %np.linalg.norm(normalized_x)

輸出結(jié)果：

x: [ 0.49671415 -0.1382643 0.64768854 1.52302986 -0.23415337 -0.23413696 1.57921282 0.76743473 -0.46947439 0.54256004] 2-Norm of x: 2.5908 Normalized x: [ 0.19172213 -0.05336737 0.24999534 0.58786029 -0.09037871 -0.09037237 0.60954584 0.29621508 -0.1812081 0.20941776] 2-Norm of normalized_x: 1.0000

使用MLlib來(lái)做特征歸一化

from pyspark.mllib.feature import Normalizer normlizer = Normalizer() vector = sc.parallelize([x]) normalized_x_mllib = normlizer.transform(vector).first().toArray() print 'x:\n%s' % x print '2-Norm of x: %2.4f' % norm_x_2 print 'Normalized x:\n%s' % normalized_x print 'Normalized x MLlib:\n%s' % normalized_x_mllib print '2-Norm of normalized_x_mllib: %2.4f' % np.linalg.norm(normalized_x_mllib)

輸出結(jié)果：

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間,，所有內(nèi)容均由用戶發(fā)布,，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式,、誘導(dǎo)購(gòu)買(mǎi)等信息,，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,，請(qǐng)點(diǎn)擊一鍵舉報(bào),。

久久国产成人av_抖音国产毛片_a片网站免费观看_A片无码播放手机在线观看,色五月在线观看,亚洲精品m在线观看,女人自慰的免费网址,悠悠在线观看精品视频,一级日本片免费的,亚洲精品久,国产精品成人久久久久久久

spark機(jī)器學(xué)習(xí)筆記：（二）用Spark Python進(jìn)行數(shù)據(jù)處理和特征提取

2.2 探索電影數(shù)據(jù)

2.3 探索評(píng)分?jǐn)?shù)據(jù)

3. 數(shù)據(jù)處理與轉(zhuǎn)換

4. 從數(shù)據(jù)中提取有用的特征

4.1 數(shù)值特征

4.2 類別特征

4.3 派生特征

4.4 文本特征

4.5 規(guī)范化特征