來(lái)源:MachineLearningPlus《Python Web開(kāi)發(fā)從入門(mén)到精通》pandas是基于NumPy的一種數(shù)據(jù)分析工具,,在機(jī)器學(xué)習(xí)任務(wù)中,,我們首先需要對(duì)數(shù)據(jù)進(jìn)行清洗和編輯等工作,,pandas庫(kù)大大簡(jiǎn)化了我們的工作量,熟練并掌握pandas常規(guī)用法是正確構(gòu)建機(jī)器學(xué)習(xí)模型的第一步,。 2. 如何導(dǎo)入pandas庫(kù)和查詢(xún)相應(yīng)的版本信息3. pandas數(shù)據(jù)類(lèi)型最常用的方法是通過(guò)Anaconda安裝,,在終端或命令符輸入如下命令安裝: 若未安裝Anaconda,使用Python自帶的包管理工具pip來(lái)安裝:pip install pandas 2. 如何導(dǎo)入pandas庫(kù)和查詢(xún)相應(yīng)的版本信息import numpy as np # pandas和numpy常常結(jié)合在一起使用,,導(dǎo)入numpy庫(kù) import pandas as pd # 導(dǎo)入pandas庫(kù)
print(pd.__version__) # 打印pandas版本信息
#> 0.23.4 3. pandas數(shù)據(jù)類(lèi)型pandas包含兩種數(shù)據(jù)類(lèi)型:series和dataframe,。 series是一種一維數(shù)據(jù)結(jié)構(gòu),每一個(gè)元素都帶有一個(gè)索引,,與一維數(shù)組的含義相似,,其中索引可以為數(shù)字或字符串。series結(jié)構(gòu)名稱(chēng):dataframe是一種二維數(shù)據(jù)結(jié)構(gòu),,數(shù)據(jù)以表格形式(與excel類(lèi)似)存儲(chǔ),,有對(duì)應(yīng)的行和列。dataframe結(jié)構(gòu)名稱(chēng):1. 如何從列表,數(shù)組,,字典構(gòu)建seriesmylist = list('abcedfghijklmnopqrstuvwxyz') # 列表 myarr = np.arange(26) # 數(shù)組 mydict = dict(zip(mylist, myarr)) # 字典
# 構(gòu)建方法 ser1 = pd.Series(mylist) ser2 = pd.Series(myarr) ser3 = pd.Series(mydict) print(ser3.head()) # 打印前5個(gè)數(shù)據(jù)
#> a 0 b 1 c 2 d 4 e 3 dtype:int64 2. 如何使series的索引列轉(zhuǎn)化為dataframe的列mylist = list('abcedfghijklmnopqrstuvwxyz') myarr = np.arange(26) mydict = dict(zip(mylist, myarr)) ser = pd.Series(mydict)
# series轉(zhuǎn)換為dataframe df = ser.to_frame() # 索引列轉(zhuǎn)換為dataframe的列 df.reset_index(inplace=True) print(df.head())
#> index 0 0 a 0 1 b 1 2 c 2 3 e 3 4 d 4 3. 如何結(jié)合多個(gè)series組成dataframe# 構(gòu)建series1 ser1 = pd.Series(list('abcedfghijklmnopqrstuvwxyz')) # 構(gòu)建series2 ser2 = pd.Series(np.arange(26))
# 方法1,,axis=1表示列拼接,0表示行拼接 df = pd.concat([ser1, ser2], axis=1)
# 與方法1相比,,方法2設(shè)置了列名 df = pd.DataFrame({'col1': ser1, 'col2': ser2}) print(df.head())
#> col1 col2 0 a 0 1 b 1 2 c 2 3 e 3 4 d 4 ser = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))
# 命名索引列名稱(chēng) ser.name = 'alphabets' # 顯示前5行數(shù)據(jù) ser.head()
#>0 a 1 b 2 c 3 e 4 d Name: alphabets, dtype: object 5. 如何獲得series對(duì)象A中不包含series對(duì)象B的元素ser1 = pd.Series([1, 2, 3, 4, 5]) ser2 = pd.Series([4, 5, 6, 7, 8])
# 返回ser1不包含ser2的布爾型series ser3=~ser1.isin(ser2) # 獲取ser不包含ser2的元素 ser1[ser3]
#>0 1 1 2 2 3 dtype: int64 6. 如何獲得seriesA和seriesB不相同的項(xiàng)ser1 = pd.Series([1, 2, 3, 4, 5]) ser2 = pd.Series([4, 5, 6, 7, 8])
# 求ser1和ser2的并集 ser_u = pd.Series(np.union1d(ser1, ser2)) # 求ser1和ser2的交集 ser_i = pd.Series(np.intersect1d(ser1, ser2)) # ser_i在ser_u的補(bǔ)集就是ser1和ser2不相同的項(xiàng) ser_u[~ser_u.isin(ser_i)]
#>0 1 1 2 2 3 5 6 6 7 7 8 dtype: int64 7. 如何獲得數(shù)值series的四分位值# 設(shè)置隨機(jī)數(shù)種子 state = np.random.RandomState(100) # 從均值為5標(biāo)準(zhǔn)差為25的正態(tài)分布隨機(jī)抽取5個(gè)點(diǎn)構(gòu)成series ser = pd.Series(state.normal(10, 5, 25)) # 求ser的四分位數(shù) np.percentile(ser, q=[0, 25, 50, 75, 100])
#> array([ 1.25117263, 7.70986507, 10.92259345, 13.36360403, 18.0949083 ]) 8. 如何獲得series中單一項(xiàng)的頻率計(jì)數(shù)#從0~7隨機(jī)抽取30個(gè)列表值,,組成series ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30))) # 對(duì)該series進(jìn)行計(jì)數(shù) ser.value_counts()
#>d 8 g 6 b 6 a 5 e 2 h 2 f 1 dtype: int64 9. 如何保留series中前兩個(gè)頻次最多的項(xiàng),其他項(xiàng)替換為‘other’np.random.RandomState(100) # 從1~4均勻采樣12個(gè)點(diǎn)組成series ser = pd.Series(np.random.randint(1, 5, [12])) # 除前兩行索引對(duì)應(yīng)的值不變,,后幾行索引對(duì)應(yīng)的值為Other ser[~ser.isin(ser.value_counts().index[:2])] = 'Other' ser
#>0 Other 1 4 2 2 3 2 4 4 5 Other 6 Other 7 Other 8 4 9 4 10 4 11 2 dtype: object 10. 如何對(duì)數(shù)值series分成10個(gè)相同數(shù)目的組換個(gè)角度理解,對(duì)數(shù)值series離散化成10個(gè)類(lèi)別(categorical)值ser = pd.Series(np.random.random(20))
# 離散化10個(gè)類(lèi)別值,只顯示前5行的數(shù)據(jù) pd.qcut(ser, q=[0, .10, .20, .3, .4, .5, .6, .7, .8, .9, 1], labels=['1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th']).head()
#> 0 3rd 1 1st 2 6th 3 6th 4 9th dtype: category Categories (10, object): [1st < 2nd < 3rd < 4th ... 7th < 8th < 9th < 10th] 11. 如何使numpy數(shù)組轉(zhuǎn)化為給定形狀的dataframeser = pd.Series(np.random.randint(1, 10, 35)) # serier類(lèi)型轉(zhuǎn)換numpy類(lèi)型,,然后重構(gòu) df = pd.DataFrame(ser.values.reshape(7,5)) print(df)
#> 0 1 2 3 4 0 1 2 1 2 5 1 1 2 4 5 2 2 1 3 3 2 8 3 8 6 4 9 6 4 2 1 1 8 5 5 3 2 8 5 6 6 1 5 5 4 6 12. 如何找到series的值是3的倍數(shù)的位置ser = pd.Series(np.random.randint(1, 10, 7)) print(ser)
# 獲取值是3倍數(shù)的索引 np.argwhere(ser % 3==0)
#>0 6 1 8 2 6 3 7 4 6 5 2 6 4 dtype: int64
#>array([[0], [2], [4]]) 13. 獲取series中給定索引的元素(items)ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz')) index = [0, 4, 8, 14, 20]
# 獲取指定索引的元素 ser.take(index)
#>0 a 4 e 8 i 14 o 20 u dtype: object ser1 = pd.Series(range(5)) ser2 = pd.Series(list('abcde'))
# 垂直拼接 df = pd.concat([ser1, ser2], axis=0)
# 水平拼接 df = pd.concat([ser1, ser2], axis=1) print(df)
#> 0 1 0 0 a 1 1 b 2 2 c 3 3 d 4 4 e 15.如何獲取series對(duì)象A中包含series對(duì)象B元素的位置# ser1必須包含ser2,,否則會(huì)報(bào)錯(cuò) ser1 = pd.Series([10, 9, 6, 5, 3, 1, 12, 8, 13]) ser2 = pd.Series([1, 3, 10, 13])
# 方法 1 [np.where(i == ser1)[0].tolist()[0] for i in ser2]
# 方法 2 [pd.Index(ser1).get_loc(i) for i in ser2]
#> [5, 4, 0, 8] truth = pd.Series(range(10)) pred = pd.Series(range(10)) + np.random.random(10)
# 均方差 np.mean((truth-pred)**2)
#> 0.25508722434194103 17.如何使series中每個(gè)元素的首字母為大寫(xiě)# series的元素為str類(lèi)型 ser = pd.Series(['how', 'to', 'kick', 'ass?'])
# 方法 1 ser.map(lambda x: x.title())
# 方法 2 ,字符串相加 ser.map(lambda x: x[0].upper() + x[1:])
# 方法 3 pd.Series([i.title() for i in ser])
#>0 How 1 To 2 Kick 3 Ass? dtype: object 18.如何計(jì)算series中每個(gè)元素的字符串長(zhǎng)度
ser = pd.Series(['how', 'to', 'kick', 'ass?'])
# 方法 ser.map(lambda x: len(x))
#>0 3 1 2 2 4 3 4 dtype: int64 19.如何計(jì)算series的一階導(dǎo)和二階導(dǎo)ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])
# 求一階導(dǎo)并轉(zhuǎn)化為列表類(lèi)型 print(ser.diff().tolist()) # 求二階導(dǎo)并轉(zhuǎn)化為列表類(lèi)型 print(ser.diff().diff().tolist())
#>[nan, 2.0, 3.0, 4.0, 5.0, 6.0, 6.0, 8.0] [nan, nan, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0] 20.如何將一系列日期字符串轉(zhuǎn)換為timeseriesser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])
pd.to_datetime(ser)
#>0 2010-01-01 00:00:00 1 2011-02-02 00:00:00 2 2012-03-03 00:00:00 3 2013-04-04 00:00:00 4 2014-05-05 00:00:00 5 2015-06-06 12:20:00 dtype: datetime64[ns] 21. 如何從一個(gè)series中獲取至少包含兩個(gè)元音的元素ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])
# 方法 from collections import Counter # Counter是一個(gè)類(lèi)字典類(lèi)型,,鍵是元素值,,值是元素出現(xiàn)的次數(shù),滿(mǎn)足條件的元素返回True mask = ser.map(lambda x: sum([Counter(x.lower()).get(i, 0) for i in list('aeiou')]) >= 2) ser[mask]
#>0 Apple 1 Orange 4 Money dtype: object 22. 如何計(jì)算根據(jù)另一個(gè)series分組后的series均值fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10)) weights = pd.Series(np.linspace(1, 10, 10))
# 根據(jù)fruit對(duì)weight分組 weightsGrouped = weights.groupby(fruit) print(weightsGrouped.indices) # 對(duì)分組后series求每個(gè)索引的平均值 weightsGrouped.mean()
#>{'apple': array([0, 3], dtype=int64), 'banana': array([1, 2, 4, 8], dtype=int64), 'carrot': array([5, 6, 7, 9], dtype=int64)}
#>apple 2.50 banana 4.75 carrot 7.75 dtype: float64 23. 如何計(jì)算兩個(gè)series之間的歐氏距離p = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) q = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])
# 方法1 sum((p - q)**2)**.5
# 方法2 np.linalg.norm(p-q)
#>18.16590212458495 24. 如何在數(shù)值series中找局部最大值局部最大值對(duì)應(yīng)二階導(dǎo)局部最小值ser = pd.Series([2, 10, 3, 4, 9, 10, 2, 7, 3])
# 二階導(dǎo) dd = np.diff(np.sign(np.diff(ser))) # 二階導(dǎo)的最小值對(duì)應(yīng)的值為最大值,返回最大值的索引 peak_locs = np.where(dd == -2)[0] + 1 peak_locs
#>array([1, 5, 7], dtype=int64) 25. 如何用最少出現(xiàn)的字符替換空格符my_str = 'dbc deb abed gade'
# 方法 ser = pd.Series(list('dbc deb abed gade')) # 統(tǒng)計(jì)元素的頻數(shù) freq = ser.value_counts() print(freq) # 求最小頻數(shù)的字符 least_freq = freq.dropna().index[-1] # 替換 ''.join(ser.replace(' ', least_freq))
#>d 4 3 b 3 e 3 a 2 c 1 g 1 dtype: int64
#>'dbcgdebgabedggade' 26. 如何計(jì)算數(shù)值series的自相關(guān)系數(shù)ser = pd.Series(np.arange(20) + np.random.normal(1, 10, 20))
# 求series的自相關(guān)系數(shù),,i為偏移量 autocorrelations = [ser.autocorr(i).round(2) for i in range(11)] print(autocorrelations[1:]) # 選擇最大的偏移量 print('Lag having highest correlation: ', np.argmax(np.abs(autocorrelations[1:]))+1)
#>[0.33, 0.41, 0.48, 0.01, 0.21, 0.16, -0.11, 0.05, 0.34, -0.24] #>Lag having highest correlation: 3 27. 如何對(duì)series進(jìn)行算術(shù)運(yùn)算操作# 如何對(duì)series之間進(jìn)行算法運(yùn)算 import pandas as pd series1 = pd.Series([3,4,4,4],['index1','index2','index3','index4']) series2 = pd.Series([2,2,2,2],['index1','index2','index33','index44']) # 加法 series_add = series1 + series2 print(series_add) # 減法 series_minus = series1 - series2 # series_minus # 乘法 series_multi = series1 * series2 # series_multi # 除法 series_div = series1/series2 series_div series是基于索引進(jìn)行算數(shù)運(yùn)算操作的,,pandas會(huì)根據(jù)索引對(duì)數(shù)據(jù)進(jìn)行運(yùn)算,若series之間有不同的索引,,對(duì)應(yīng)的值就為Nan,。結(jié)果如下:#加法: index1 5.0 index2 6.0 index3 NaN index33 NaN index4 NaN index44 NaN dtype: float64 #除法: index1 1.5 index2 2.0 index3 NaN index33 NaN index4 NaN index44 NaN dtype: float64 1. 如何從csv文件只讀取前幾行的數(shù)據(jù)# 只讀取前2行和指定列的數(shù)據(jù) df = pd.read_csv('https://raw./selva86/datasets/master/Cars93_miss.csv',nrows=2,usecols=['Model','Length']) df
#>ModelLength 0Integra177 1Legend195 2. 如何從csv文件中每隔n行來(lái)創(chuàng)建dataframe# 每隔50行讀取一行數(shù)據(jù) df = pd.read_csv('https://raw./selva86/datasets/master/BostonHousing.csv', chunksize=50) df2 = pd.DataFrame() for chunk in df: # 獲取series df2 = df2.append(chunk.iloc[0,:])
#顯示前5行 print(df2.head())
#> crim zn indus chas nox rm age \ 0 0.21977 0.0 6.91 0 0.44799999999999995 5.602 62.0 1 0.0686 0.0 2.89 0 0.445 7.416 62.5 2 2.7339700000000002 0.0 19.58 0 0.871 5.597 94.9 3 0.0315 95.0 1.47 0 0.40299999999999997 6.975 15.3 4 0.19072999999999998 22.0 5.86 0 0.431 6.718 17.5
dis rad tax ptratio b lstat medv 0 6.0877 3 233 17.9 396.9 16.2 19.4 1 3.4952 2 276 18.0 396.9 6.19 33.2 2 1.5257 5 403 14.7 351.85 21.45 15.4 3 7.6534 3 402 17.0 396.9 4.56 34.9 4 7.8265 7 330 19.1 393.74 6.56 26.2 改變列名‘medv’的值,當(dāng)列值≤25時(shí),,賦值為‘Low’,;列值>25時(shí),賦值為‘High’.# 使用converters參數(shù),,改變medv列的值 df = pd.read_csv('https://raw./selva86/datasets/master/BostonHousing.csv', converters={'medv': lambda x: 'High' if float(x) > 25 else 'Low'}) print(df.head())
#> b lstat medv 0 396.90 4.98 Low 1 396.90 9.14 Low 2 392.83 4.03 High 3 394.63 2.94 High 4 396.90 5.33 High # 導(dǎo)入指定的列:crim和medv df = pd.read_csv('https://raw./selva86/datasets/master/BostonHousing.csv', usecols=['crim', 'medv']) # 打印前四行dataframe信息 print(df.head())
#> crim medv 0 0.00632 24.0 1 0.02731 21.6 2 0.02729 34.7 3 0.03237 33.4 4 0.06905 36.2 5. 如何得到dataframe的行,,列,每一列的類(lèi)型和相應(yīng)的描述統(tǒng)計(jì)信息df = pd.read_csv('https://raw./selva86/datasets/master/Cars93_miss.csv')
# 打印dataframe的行和列 print(df.shape)
# 打印dataframe每列元素的類(lèi)型顯示前5行 print(df.dtypes.head())
# 統(tǒng)計(jì)各類(lèi)型的數(shù)目,方法1 print(df.get_dtype_counts()) # 統(tǒng)計(jì)各類(lèi)型的數(shù)目,方法2 # print(df.dtypes.value_counts())
# 描述每列的統(tǒng)計(jì)信息,,如std,,四分位數(shù)等 df_stats = df.describe() # dataframe轉(zhuǎn)化數(shù)組 df_arr = df.values # 數(shù)組轉(zhuǎn)化為列表 df_list = df.values.tolist()
#>(93, 27) Manufacturer object Model object Type object Min.Price float64 Price float64 dtype: object float64 18 object 9 dtype: int64 import numpy as np df = pd.read_csv('https://raw./selva86/datasets/master/Cars93_miss.csv') # print(df) # 獲取最大值的行和列 row, col = np.where(df.values == np.max(df.Price)) # 行和列獲取最大值 print(df.iat[row[0], col[0]]) df.iloc[row[0], col[0]]
# 行索引和列名獲取最大值 df.at[row[0], 'Price'] df.get_value(row[0], 'Price')
#>61.9 df1 = pd.DataFrame(data=np.array([[18,50],[19,51],[20,55]]),index=['man1','man2','man3'],columns=['age','weight']) print(df1) # 修改列名 print('\nchange columns :\n') #方法1 df1.rename(columns={'weight':'stress'}) #方法2 df1.columns.values[1] = 'stress' print(df1)
#> age weight man1 18 50 man2 19 51 man3 20 55
change columns :
age stress man1 18 50 man2 19 51 man3 20 55 df = pd.read_csv('https://raw./selva86/datasets/master/Cars93_miss.csv')
# 若有缺失值,則為T(mén)ure df.isnull().values.any()
#>True 9. 如何統(tǒng)計(jì)dataframe的每列中缺失值的個(gè)數(shù)df = pd.read_csv('https://raw./selva86/datasets/master/Cars93_miss.csv')
# 獲取每列的缺失值個(gè)數(shù) n_missings_each_col = df.apply(lambda x: x.isnull().sum()) print(n_missings_each_col.head())
#>Manufacturer 4 Model 1 Type 3 Min.Price 7 Price 2 dtype: int64 10. 如何用平均值替換相應(yīng)列的缺失值df = pd.read_csv('https://raw./selva86/datasets/master/Cars93_miss.csv',nrows=10) print(df[['Min.Price','Max.Price']].head()) # 平均值替換缺失值 df_out = df[['Min.Price', 'Max.Price']] = df[['Min.Price', 'Max.Price']].apply(lambda x: x.fillna(x.mean())) print(df_out.head())
#> Min.Price Max.Price 0 12.9 18.8 1 29.2 38.7 2 25.9 32.3 3 NaN 44.6 4 NaN NaN
#> Min.Price Max.Price 0 12.9 18.8 1 29.2 38.7 2 25.9 32.3 3 23.0 44.6 4 23.0 29.9 11. 如何用全局變量作為apply函數(shù)的附加參數(shù)處理指定的列df = pd.read_csv('https://raw./selva86/datasets/master/Cars93_miss.csv') print(df[['Min.Price', 'Max.Price']].head()) # 全局變量 d = {'Min.Price': np.nanmean, 'Max.Price': np.nanmedian} # 列名Min.Price的缺失值用平均值代替,,Max.Price的缺失值用中值代替 df[['Min.Price', 'Max.Price']] = df[['Min.Price', 'Max.Price']].apply(lambda x, d: x.fillna(d[x.name](x)), args=(d, )) print(df[['Min.Price', 'Max.Price']].head())
#> Min.Price Max.Price 0 12.9 18.8 1 29.2 38.7 2 25.9 32.3 3 NaN 44.6 #> Min.Price Max.Price 0 12.900000 18.80 1 29.200000 38.70 2 25.900000 32.30 3 17.118605 44.60 4 17.118605 19.15 12. 如何以dataframe的形式選擇特定的列df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde')) # print(df)
# 以dataframe的形式選擇特定的列 type(df[['a']]) type(df.loc[:, ['a']]) print(type(df.iloc[:, [0]]))
# 以series的形式選擇特定的列 type(df.a) type(df['a']) type(df.loc[:, 'a']) print(type(df.iloc[:, 1]))
#><class 'pandas.core.frame.DataFrame'> <class 'pandas.core.series.Series'> df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))
print(df) # 交換col1和col2 def switch_columns(df, col1=None, col2=None): colnames = df.columns.tolist() i1, i2 = colnames.index(col1), colnames.index(col2) colnames[i2], colnames[i1] = colnames[i1], colnames[i2] return df[colnames]
df1 = switch_columns(df, 'a', 'c') print(df1)
#> a b c d e 0 0 1 2 3 4 1 5 6 7 8 9 2 10 11 12 13 14 3 15 16 17 18 19 #> c b a d e 0 2 1 0 3 4 1 7 6 5 8 9 2 12 11 10 13 14 3 17 16 15 18 19 df = pd.DataFrame(np.random.random(4)**10, columns=['random']) print(df) # 顯示小數(shù)點(diǎn)后四位 df.apply(lambda x: '%.4f' % x, axis=1) print(df)
#> random 0 3.539348e-04 1 3.864140e-10 2 2.973575e-02 3 1.414061e-01 #> random 0 3.539348e-04 1 3.864140e-10 2 2.973575e-02 3 1.414061e-01 15. 如何將dataframe中的所有值以百分?jǐn)?shù)的格式表示df = pd.DataFrame(np.random.random(4), columns=['random'])
# 格式化為小數(shù)點(diǎn)后兩位的百分?jǐn)?shù) out = df.style.format({ 'random': '{0:.2%}'.format, })
out
#>random 048.54% 191.51% 290.83% 320.45% 16.如何從dataframe中每隔n行構(gòu)建dataframedf = pd.read_csv('https://raw./selva86/datasets/master/Cars93_miss.csv')
# 每隔20行讀dataframe數(shù)據(jù) print(df.iloc[::20, :][['Manufacturer', 'Model', 'Type']])
#> Manufacturer Model Type 0 Acura Integra Small 20 Chrysler LeBaron Compact 40 Honda Prelude Sporty 60 Mercury Cougar Midsize 80 Subaru Loyale Small 17. 如何得到列中前n個(gè)最大值對(duì)應(yīng)的索引df = pd.DataFrame(np.random.randint(1, 15, 15).reshape(5,-1), columns=list('abc')) print(df) # 取'a'列前3個(gè)最大值對(duì)應(yīng)的行 n = 5 df['a'].argsort()[::-1].iloc[:3]
#> a b c 0 5 5 2 1 12 7 1 2 5 2 12 3 5 14 12 4 1 13 13
#>4 1 3 3 2 2 Name: a, dtype: int64 18. 如何獲得dataframe行的和大于100的最末n行索引df = pd.DataFrame(np.random.randint(10, 40, 16).reshape(-1, 4)) print(df) # dataframe每行的和 rowsums = df.apply(np.sum, axis=1)
# 選取大于100的最末兩行索引 # last_two_rows = df.iloc[np.where(rowsums > 100)[0][-2:], :] nline = np.where(rowsums > 100)[0][-2:] nline
#> 0 1 2 3 0 19 34 15 12 1 38 35 14 26 2 39 32 18 20 3 28 27 36 38
#>array([2, 3], dtype=int64) ser = pd.Series(np.logspace(-2, 2, 30))
# 小于low_per分位的數(shù)賦值為low,,大于low_per分位的數(shù)賦值為high def cap_outliers(ser, low_perc, high_perc): low, high = ser.quantile([low_perc, high_perc]) print(low_perc, '%ile: ', low, '|', high_perc, '%ile: ', high) ser[ser < low] = low ser[ser > high] = high return(ser)
capped_ser = cap_outliers(ser, .05, .95)
#>0.05 %ile: 0.016049294076965887 | 0.95 %ile: 63.876672220183934 df = pd.DataFrame(np.arange(9).reshape(3, -1)) print(df) # 函數(shù) def swap_rows(df, i1, i2): a, b = df.iloc[i1, :].copy(), df.iloc[i2, :].copy() # 通過(guò)iloc換行 df.iloc[i1, :], df.iloc[i2, :] = b, a return df
# 2和3行互換 print(swap_rows(df, 1, 2))
#> 0 1 2 0 0 1 2 1 3 4 5 2 6 7 8
#> 0 1 2 0 0 1 2 1 6 7 8 2 3 4 5 21. 如何倒轉(zhuǎn)dataframe的行df = pd.DataFrame(np.arange(9).reshape(3, -1)) print(df)
# 方法 1 df.iloc[::-1, :]
# 方法 2 print(df.loc[df.index[::-1], :])
#> 0 1 2 0 0 1 2 1 3 4 5 2 6 7 8
#> 0 1 2 2 6 7 8 1 3 4 5 0 0 1 2 22. 如何對(duì)分類(lèi)變量進(jìn)行one-hot編碼df = pd.DataFrame(np.arange(25).reshape(5,-1), columns=list('abcde')) print(df) # 對(duì)列'a'進(jìn)行onehot編碼 df_onehot = pd.concat([pd.get_dummies(df['a']), df[list('bcde')]], axis=1) print(df_onehot)
#> a b c d e 0 0 1 2 3 4 1 5 6 7 8 9 2 10 11 12 13 14 3 15 16 17 18 19 4 20 21 22 23 24
#> 0 5 10 15 20 b c d e 0 1 0 0 0 0 1 2 3 4 1 0 1 0 0 0 6 7 8 9 2 0 0 1 0 0 11 12 13 14 3 0 0 0 1 0 16 17 18 19 4 0 0 0 0 1 21 22 23 24 23. 如何獲取dataframe行方向上最大值個(gè)數(shù)最多的列df = pd.DataFrame(np.random.randint(1,100, 9).reshape(3, -1)) print(df) # 獲取每列包含行方向上最大值的個(gè)數(shù) count_series = df.apply(np.argmax, axis=1).value_counts() print(count_series) # 輸出行方向最大值個(gè)數(shù)最多的列的索引 print('Column with highest row maxes: ', count_series.index[0])
#> 0 1 2 0 46 31 34 1 38 13 6 2 1 18 15
#>統(tǒng)計(jì)列的最大值的個(gè)數(shù) 0 2 1 1 dtype: int64
#>Column with highest row maxes: 0 24. 如何得到列之間最大的相關(guān)系數(shù)df = pd.DataFrame(np.random.randint(1,100, 16).reshape(4, -1), columns=list('pqrs'), index=list('abcd')) # df print(df) # 得到四個(gè)列的相關(guān)系數(shù) abs_corrmat = np.abs(df.corr()) print(abs_corrmat) # 得到每個(gè)列名與其他列的最大相關(guān)系數(shù) max_corr = abs_corrmat.apply(lambda x: sorted(x)[-2]) # 顯示每列與其他列的相關(guān)系數(shù) print('Maximum Correlation possible for each column: ', np.round(max_corr.tolist(), 2))
#> p q r s a 59 99 1 34 b 89 60 97 40 c 43 35 14 6 d 70 59 30 53 #> p q r s p 1.000000 0.200375 0.860051 0.744529 q 0.200375 1.000000 0.236619 0.438541 r 0.860051 0.236619 1.000000 0.341399 s 0.744529 0.438541 0.341399 1.000000
#>Maximum Correlation possible for each column: [0.86 0.44 0.86 0.74] 25. 如何創(chuàng)建包含每行最小值與最大值比例的列df = pd.DataFrame(np.random.randint(1,100, 9).reshape(3, -1)) print(df) # 方法1:axis=1表示行方向, min_by_max = df.apply(lambda x: np.min(x)/np.max(x), axis=1)
# 方法2 min_by_max = np.min(df, axis=1)/np.max(df, axis=1)
min_by_max
#> 0 1 2 0 81 68 59 1 45 73 23 2 20 22 69
#>0 0.728395 1 0.315068 2 0.289855 dtype: float64 26. 如何創(chuàng)建包含每行第二大值的列df = pd.DataFrame(np.random.randint(1,100, 9).reshape(3, -1)) print(df) # 行方向上取第二大的值組成series out = df.apply(lambda x: x.sort_values().unique()[-2], axis=1) # 構(gòu)建dataframe新的列 df['penultimate'] = out print(df)
#> 0 1 2 0 28 77 1 1 43 19 69 2 29 30 72
#> 0 1 2 penultimate 0 28 77 1 28 1 43 19 69 43 2 29 30 72 30 df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
# 正態(tài)分布?xì)w一化 out1 = df.apply(lambda x: ((x - x.mean())/x.std()).round(2)) print('Solution Q1\n',out1)
# 線(xiàn)性歸一化 out2 = df.apply(lambda x: ((x.max() - x)/(x.max() - x.min())).round(2)) print('Solution Q2\n', out2) 28. 如何計(jì)算每一行與下一行的相關(guān)性df = pd.DataFrame(np.random.randint(1,100, 25).reshape(5, -1))
# 行與行之間的相關(guān)性 [df.iloc[i].corr(df.iloc[i+1]).round(2) for i in range(df.shape[0])[:-1]] 29. 如何用0賦值dataframe的主對(duì)角線(xiàn)和副對(duì)角線(xiàn)df = pd.DataFrame(np.random.randint(1,100, 25).reshape(5, -1)) print(df) # zhu for i in range(df.shape[0]): df.iat[i, i] = 0 df.iat[df.shape[0]-i-1, i] = 0 print(df)
#> 0 1 2 3 4 0 51 35 71 71 79 1 78 25 71 85 44 2 90 97 72 14 4 3 27 91 37 25 48 4 1 26 68 70 20
#> 0 1 2 3 4 0 0 35 71 71 0 1 78 0 71 0 44 2 90 97 0 14 4 3 27 0 37 0 48 4 0 26 68 70 0 30.如何得到按列分組的dataframe的平均值和標(biāo)準(zhǔn)差df = pd.DataFrame({'col1': ['apple', 'banana', 'orange'] * 2, 'col2': np.random.randint(0,15,6), 'col3': np.random.randint(0, 15, 6)}) print(df) # 按列col1分組后的平均值 df_grouped_mean = df.groupby(['col1']).mean() print(df_grouped_mean) # 按列col1分組后的標(biāo)準(zhǔn)差 df_grouped_std = df.groupby(['col1']).mean() print(df_grouped_std)
#> col1 col2 col3 0 apple 2 14 1 banana 11 8 2 orange 8 10 3 apple 5 2 4 banana 6 12 5 orange 11 13 #> col2 col3 col1 apple 3.5 8.0 banana 8.5 10.0 orange 9.5 11.5 #> col2 col3 col1 apple 3.5 8.0 banana 8.5 10.0 orange 9.5 11.5 df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 2, 'taste': np.random.rand(6), 'price': np.random.randint(0, 15, 6)})
print(df)
# teste列按fruit分組 df_grpd = df['taste'].groupby(df.fruit) # teste列中banana元素的信息 x=df_grpd.get_group('banana') # 排序并找第2大的值 s = x.sort_values().iloc[-2] print(s)
#> fruit taste price 0 apple 0.521990 7 1 banana 0.640444 0 2 orange 0.460509 9 3 apple 0.818963 4 4 banana 0.646138 7 5 orange 0.917056 12
#>0.6404436436085967 32. 如何計(jì)算分組dataframe的平均值,,并將分組列保留為另一列df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 2, 'rating': np.random.rand(6), 'price': np.random.randint(0, 15, 6)})
# 按fruit分組后,,price列的平均值,并將分組置為一列 out = df.groupby('fruit', as_index=False)['price'].mean() print(out)
#> fruit price 0 apple 4.0 1 banana 6.5 2 orange 11.0 df = pd.DataFrame({'fruit1': np.random.choice(['apple', 'orange', 'banana'], 3), 'fruit2': np.random.choice(['apple', 'orange', 'banana'], 3)})
print(df) # 獲取兩列元素相等的行 np.where(df.fruit1 == df.fruit2)
#> fruit1 fruit2 0 apple banana 1 apple apple 2 orange apple
#>(array([1], dtype=int64),) 34. 如何創(chuàng)建指定列偏移后的新列df = pd.DataFrame(np.random.randint(1, 100, 20).reshape(-1, 4), columns = list('abcd'))
# 創(chuàng)建往下偏移后的列 df['a_lag1'] = df['a'].shift(1) # 創(chuàng)建往上偏移后的列 df['b_lead1'] = df['b'].shift(-1) print(df)
#> a b c d a_lag1 b_lead1 0 29 90 43 24 NaN 36.0 1 94 36 67 66 29.0 76.0 2 81 76 44 49 94.0 97.0 3 55 97 10 74 81.0 43.0 4 32 43 62 62 55.0 NaN 35. 如何獲得dataframe中單一值的頻數(shù)df = pd.DataFrame(np.random.randint(1, 10, 20).reshape(-1, 4), columns = list('abcd'))
# 統(tǒng)計(jì)元素值的個(gè)數(shù) pd.value_counts(df.values.ravel())
#>9 3 7 3 3 3 1 3 6 2 5 2 4 2 8 1 2 1 dtype: int64 36. 如何將文本拆分為兩個(gè)單獨(dú)的列df = pd.DataFrame(['STD, City State', '33, Kolkata West Bengal', '44, Chennai Tamil Nadu', '40, Hyderabad Telengana', '80, Bangalore Karnataka'], columns=['row'])
print(df) # expand=True表示以分割符把字符串分成兩列 df_out = df.row.str.split(',|\t', expand=True)
# 獲取新的列 new_header = df_out.iloc[0] # 重新賦值 df_out = df_out[1:] df_out.columns = new_header print(df_out) #> row 0 STD, City State 1 33, Kolkata West Bengal 2 44, Chennai Tamil Nadu 3 40, Hyderabad Telengana 4 80, Bangalore Karnataka
#>0 STD City State 1 33 Kolkata West Bengal 2 44 Chennai Tamil Nadu 3 40 Hyderabad Telengana 4 80 Bangalore Karnataka 37.如何構(gòu)建多級(jí)索引的dataframe我們利用元組(Tuple)構(gòu)建多級(jí)索引,,然后定義dataframe.# 如何構(gòu)建多級(jí)索引的dataframe # 先通過(guò)元組方式構(gòu)建多級(jí)索引 import numpy as np outside = ['A','A','A','B','B','B'] inside =[1,2,3,1,2,3] my_index = list(zip(outside,inside)) # my_index # 轉(zhuǎn)化為pd格式的索引 my_index = pd.MultiIndex.from_tuples(my_index) # my_index # 構(gòu)建多級(jí)索引dataframe df = pd.DataFrame(np.random.randn(6,2),index =my_index,columns=['fea1','fea2']) df 獲取多索引dataframe的數(shù)據(jù):df.loc['A'].iloc[1]
#>fea1 -0.794461 fea2 0.882104 Name: 2, dtype: float64
df.loc['A'].iloc[1]['fea1']
#>-0.7944609970323794 pandas庫(kù)在機(jī)器學(xué)習(xí)項(xiàng)目中的應(yīng)用主要有兩個(gè)步驟:(1)讀取文件,(2)數(shù)據(jù)清洗和編輯工作,,該步驟中,,我們常常需要借組numpy數(shù)組來(lái)處理數(shù)據(jù),。希望這篇文章能夠讓你很好的入門(mén)pandas庫(kù),多多練習(xí)才是王道 ,。讀者能夠看到這里的都是真愛(ài),,點(diǎn)個(gè)在看和廣告唄!參考:https://blog.csdn.net/qq_42156420/article/details/82813482
|