太香了,！最強(qiáng)的 pandas 入門(mén)教程

新用戶(hù)18985384 2020-11-14

展開(kāi)全文

翻譯：石頭

來(lái)源：MachineLearningPlus

文末贈(zèng)送5本書(shū)

《Python Web開(kāi)發(fā)從入門(mén)到精通》

pandas是基于NumPy的一種數(shù)據(jù)分析工具,，在機(jī)器學(xué)習(xí)任務(wù)中,，我們首先需要對(duì)數(shù)據(jù)進(jìn)行清洗和編輯等工作,，pandas庫(kù)大大簡(jiǎn)化了我們的工作量，熟練并掌握pandas常規(guī)用法是正確構(gòu)建機(jī)器學(xué)習(xí)模型的第一步,。

1. 如何安裝pandas

2. 如何導(dǎo)入pandas庫(kù)和查詢(xún)相應(yīng)的版本信息

3. pandas數(shù)據(jù)類(lèi)型

4. series教程

5. dataframe教程

6. 小結(jié)

1. 如何安裝Pandas

最常用的方法是通過(guò)Anaconda安裝,，在終端或命令符輸入如下命令安裝：

conda install pandas

若未安裝Anaconda，使用Python自帶的包管理工具pip來(lái)安裝：

pip install pandas

2. 如何導(dǎo)入pandas庫(kù)和查詢(xún)相應(yīng)的版本信息

import numpy as np # pandas和numpy常常結(jié)合在一起使用,，導(dǎo)入numpy庫(kù)
import pandas as pd # 導(dǎo)入pandas庫(kù)

print(pd.__version__) # 打印pandas版本信息

#> 0.23.4

3. pandas數(shù)據(jù)類(lèi)型

pandas包含兩種數(shù)據(jù)類(lèi)型：series和dataframe,。

series是一種一維數(shù)據(jù)結(jié)構(gòu)，每一個(gè)元素都帶有一個(gè)索引,，與一維數(shù)組的含義相似,，其中索引可以為數(shù)字或字符串。series結(jié)構(gòu)名稱(chēng)：

dataframe是一種二維數(shù)據(jù)結(jié)構(gòu),，數(shù)據(jù)以表格形式（與excel類(lèi)似）存儲(chǔ),，有對(duì)應(yīng)的行和列。dataframe結(jié)構(gòu)名稱(chēng)：

4. series教程

1. 如何從列表，數(shù)組,，字典構(gòu)建series

mylist = list('abcedfghijklmnopqrstuvwxyz')   # 列表
myarr = np.arange(26)                      # 數(shù)組
mydict = dict(zip(mylist, myarr))             # 字典

# 構(gòu)建方法
ser1 = pd.Series(mylist)
ser2 = pd.Series(myarr)
ser3 = pd.Series(mydict)
print(ser3.head())                 # 打印前5個(gè)數(shù)據(jù)

#>  a    0
    b    1
    c    2
    d    4
    e    3
    dtype:int64

2. 如何使series的索引列轉(zhuǎn)化為dataframe的列

mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))
ser = pd.Series(mydict)

# series轉(zhuǎn)換為dataframe
df = ser.to_frame()
# 索引列轉(zhuǎn)換為dataframe的列
df.reset_index(inplace=True)
print(df.head())

#> index 0
0 a 0
1 b 1
2 c 2
3 e 3
4 d 4

3. 如何結(jié)合多個(gè)series組成dataframe

# 構(gòu)建series1
ser1 = pd.Series(list('abcedfghijklmnopqrstuvwxyz')) 
# 構(gòu)建series2
ser2 = pd.Series(np.arange(26))

# 方法1,，axis=1表示列拼接，0表示行拼接
df = pd.concat([ser1, ser2], axis=1)

# 與方法1相比,，方法2設(shè)置了列名
df = pd.DataFrame({'col1': ser1, 'col2': ser2})
print(df.head())

#>  col1  col2
0    a     0
1    b     1
2    c     2
3    e     3
4    d     4

4. 如何命名列索引的名稱(chēng)

ser = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))

# 命名索引列名稱(chēng)
ser.name = 'alphabets'
# 顯示前5行數(shù)據(jù)
ser.head()

#>0 a
1 b
2 c
3 e
4 d
Name: alphabets, dtype: object

5. 如何獲得series對(duì)象A中不包含series對(duì)象B的元素

ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

# 返回ser1不包含ser2的布爾型series
ser3=~ser1.isin(ser2)
# 獲取ser不包含ser2的元素
ser1[ser3]

#>0    1
1    2
2    3
dtype: int64

6. 如何獲得seriesA和seriesB不相同的項(xiàng)

ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

# 求ser1和ser2的并集
ser_u = pd.Series(np.union1d(ser1, ser2))
# 求ser1和ser2的交集
ser_i = pd.Series(np.intersect1d(ser1, ser2))
# ser_i在ser_u的補(bǔ)集就是ser1和ser2不相同的項(xiàng)
ser_u[~ser_u.isin(ser_i)]

#>0 1
1 2
2 3
5 6
6 7
7 8
dtype: int64

7. 如何獲得數(shù)值series的四分位值

# 設(shè)置隨機(jī)數(shù)種子
state = np.random.RandomState(100)
# 從均值為5標(biāo)準(zhǔn)差為25的正態(tài)分布隨機(jī)抽取5個(gè)點(diǎn)構(gòu)成series
ser = pd.Series(state.normal(10, 5, 25))
# 求ser的四分位數(shù)
np.percentile(ser, q=[0, 25, 50, 75, 100])

#> array([ 1.25117263,  7.70986507, 10.92259345, 13.36360403, 18.0949083 ])

8. 如何獲得series中單一項(xiàng)的頻率計(jì)數(shù)

#從0~7隨機(jī)抽取30個(gè)列表值,，組成series
ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))
# 對(duì)該series進(jìn)行計(jì)數(shù)
ser.value_counts()

#>d 8
g 6
b 6
a 5
e 2
h 2
f 1
dtype: int64

9. 如何保留series中前兩個(gè)頻次最多的項(xiàng)，其他項(xiàng)替換為‘other’

np.random.RandomState(100)
# 從1~4均勻采樣12個(gè)點(diǎn)組成series
ser = pd.Series(np.random.randint(1, 5, [12]))
# 除前兩行索引對(duì)應(yīng)的值不變,，后幾行索引對(duì)應(yīng)的值為Other
ser[~ser.isin(ser.value_counts().index[:2])] = 'Other'
ser

#>0     Other
1         4
2         2
3         2
4         4
5     Other
6     Other
7     Other
8         4
9         4
10        4
11        2
dtype: object

10. 如何對(duì)數(shù)值series分成10個(gè)相同數(shù)目的組

換個(gè)角度理解，對(duì)數(shù)值series離散化成10個(gè)類(lèi)別（categorical）值

ser = pd.Series(np.random.random(20))

# 離散化10個(gè)類(lèi)別值,只顯示前5行的數(shù)據(jù)
pd.qcut(ser, q=[0, .10, .20, .3, .4, .5, .6, .7, .8, .9, 1],
labels=['1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th']).head()

#>
0 3rd
1 1st
2 6th
3 6th
4 9th
dtype: category
Categories (10, object): [1st < 2nd < 3rd < 4th ... 7th < 8th < 9th < 10th]

11. 如何使numpy數(shù)組轉(zhuǎn)化為給定形狀的dataframe

ser = pd.Series(np.random.randint(1, 10, 35))
# serier類(lèi)型轉(zhuǎn)換numpy類(lèi)型,，然后重構(gòu)
df = pd.DataFrame(ser.values.reshape(7,5))
print(df)

#>   0  1  2  3  4
0  1  2  1  2  5
1  1  2  4  5  2
2  1  3  3  2  8
3  8  6  4  9  6
4  2  1  1  8  5
5  3  2  8  5  6
6  1  5  5  4  6

12. 如何找到series的值是3的倍數(shù)的位置

ser = pd.Series(np.random.randint(1, 10, 7))
print(ser)

# 獲取值是3倍數(shù)的索引
np.argwhere(ser % 3==0)

#>0 6
1 8
2 6
3 7
4 6
5 2
6 4
dtype: int64

#>array([[0],
[2],
[4]])

13. 獲取series中給定索引的元素（items）

ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz'))
index = [0, 4, 8, 14, 20]

# 獲取指定索引的元素
ser.take(index)

#>0     a
4     e
8     i
14    o
20    u
dtype: object

14. 如何垂直和水平的拼接series

ser1 = pd.Series(range(5))
ser2 = pd.Series(list('abcde'))

# 垂直拼接
df = pd.concat([ser1, ser2], axis=0)

# 水平拼接
df = pd.concat([ser1, ser2], axis=1)
print(df)

#> 0 1
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e

15.如何獲取series對(duì)象A中包含series對(duì)象B元素的位置

# ser1必須包含ser2,，否則會(huì)報(bào)錯(cuò)
ser1 = pd.Series([10, 9, 6, 5, 3, 1, 12, 8, 13])
ser2 = pd.Series([1, 3, 10, 13])

# 方法 1
[np.where(i == ser1)[0].tolist()[0] for i in ser2]

# 方法 2
[pd.Index(ser1).get_loc(i) for i in ser2]

#> [5, 4, 0, 8]

16.如何計(jì)算series之間的均方差

truth = pd.Series(range(10))
pred = pd.Series(range(10)) + np.random.random(10)

# 均方差
np.mean((truth-pred)**2)

#> 0.25508722434194103

17.如何使series中每個(gè)元素的首字母為大寫(xiě)

# series的元素為str類(lèi)型
ser = pd.Series(['how', 'to', 'kick', 'ass?'])

# 方法 1
ser.map(lambda x: x.title())

# 方法 2 ，字符串相加
ser.map(lambda x: x[0].upper() + x[1:])

# 方法 3
pd.Series([i.title() for i in ser])

#>0     How
1      To
2    Kick
3    Ass?
dtype: object

18.如何計(jì)算series中每個(gè)元素的字符串長(zhǎng)度

ser = pd.Series(['how', 'to', 'kick', 'ass?'])

# 方法
ser.map(lambda x: len(x))

#>0 3
1 2
2 4
3 4
dtype: int64

19.如何計(jì)算series的一階導(dǎo)和二階導(dǎo)

ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])

# 求一階導(dǎo)并轉(zhuǎn)化為列表類(lèi)型
print(ser.diff().tolist())
# 求二階導(dǎo)并轉(zhuǎn)化為列表類(lèi)型
print(ser.diff().diff().tolist())

#>[nan, 2.0, 3.0, 4.0, 5.0, 6.0, 6.0, 8.0]
[nan, nan, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0]

20.如何將一系列日期字符串轉(zhuǎn)換為timeseries

ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])

pd.to_datetime(ser)

#>0 2010-01-01 00:00:00
1 2011-02-02 00:00:00
2 2012-03-03 00:00:00
3 2013-04-04 00:00:00
4 2014-05-05 00:00:00
5 2015-06-06 12:20:00
dtype: datetime64[ns]

21. 如何從一個(gè)series中獲取至少包含兩個(gè)元音的元素

ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])

# 方法
from collections import Counter
# Counter是一個(gè)類(lèi)字典類(lèi)型,，鍵是元素值,，值是元素出現(xiàn)的次數(shù),滿(mǎn)足條件的元素返回True
mask = ser.map(lambda x: sum([Counter(x.lower()).get(i, 0) for i in list('aeiou')]) >= 2)
ser[mask]

#>0     Apple
1    Orange
4     Money
dtype: object

22. 如何計(jì)算根據(jù)另一個(gè)series分組后的series均值

fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
weights = pd.Series(np.linspace(1, 10, 10))

# 根據(jù)fruit對(duì)weight分組
weightsGrouped = weights.groupby(fruit)
print(weightsGrouped.indices)
# 對(duì)分組后series求每個(gè)索引的平均值
weightsGrouped.mean()

#>{'apple': array([0, 3], dtype=int64), 'banana': array([1, 2, 4, 8],
dtype=int64), 'carrot': array([5, 6, 7, 9], dtype=int64)}

#>apple 2.50
banana 4.75
carrot 7.75
dtype: float64

23. 如何計(jì)算兩個(gè)series之間的歐氏距離

p = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
q = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])

# 方法1
sum((p - q)**2)**.5

# 方法2
np.linalg.norm(p-q)

#>18.16590212458495

24. 如何在數(shù)值series中找局部最大值

局部最大值對(duì)應(yīng)二階導(dǎo)局部最小值

ser = pd.Series([2, 10, 3, 4, 9, 10, 2, 7, 3])

# 二階導(dǎo)
dd = np.diff(np.sign(np.diff(ser)))
# 二階導(dǎo)的最小值對(duì)應(yīng)的值為最大值，返回最大值的索引
peak_locs = np.where(dd == -2)[0] + 1
peak_locs

#>array([1, 5, 7], dtype=int64)

25. 如何用最少出現(xiàn)的字符替換空格符

my_str = 'dbc deb abed gade'

# 方法
ser = pd.Series(list('dbc deb abed gade'))
# 統(tǒng)計(jì)元素的頻數(shù)
freq = ser.value_counts()
print(freq)
# 求最小頻數(shù)的字符
least_freq = freq.dropna().index[-1]
# 替換
''.join(ser.replace(' ', least_freq))

#>d    4
     3
b    3
e    3
a    2
c    1
g    1
dtype: int64

#>'dbcgdebgabedggade'

26. 如何計(jì)算數(shù)值series的自相關(guān)系數(shù)

ser = pd.Series(np.arange(20) + np.random.normal(1, 10, 20))

# 求series的自相關(guān)系數(shù),，i為偏移量
autocorrelations = [ser.autocorr(i).round(2) for i in range(11)]
print(autocorrelations[1:])
# 選擇最大的偏移量
print('Lag having highest correlation: ', np.argmax(np.abs(autocorrelations[1:]))+1)

#>[0.33, 0.41, 0.48, 0.01, 0.21, 0.16, -0.11, 0.05, 0.34, -0.24]
#>Lag having highest correlation: 3

27. 如何對(duì)series進(jìn)行算術(shù)運(yùn)算操作

# 如何對(duì)series之間進(jìn)行算法運(yùn)算
import pandas as pd
series1 = pd.Series([3,4,4,4],['index1','index2','index3','index4'])
series2 = pd.Series([2,2,2,2],['index1','index2','index33','index44'])
# 加法
series_add = series1 + series2
print(series_add)
# 減法
series_minus = series1 - series2
# series_minus
# 乘法
series_multi = series1 * series2
# series_multi
# 除法
series_div = series1/series2
series_div

series是基于索引進(jìn)行算數(shù)運(yùn)算操作的,，pandas會(huì)根據(jù)索引對(duì)數(shù)據(jù)進(jìn)行運(yùn)算，若series之間有不同的索引,，對(duì)應(yīng)的值就為Nan,。結(jié)果如下：

#加法：
index1 5.0
index2 6.0
index3 NaN
index33 NaN
index4 NaN
index44 NaN
dtype: float64
#除法：
index1 1.5
index2 2.0
index3 NaN
index33 NaN
index4 NaN
index44 NaN
dtype: float64

3. dataframe教程

1. 如何從csv文件只讀取前幾行的數(shù)據(jù)

# 只讀取前2行和指定列的數(shù)據(jù)
df = pd.read_csv('https://raw./selva86/datasets/master/Cars93_miss.csv',nrows=2,usecols=['Model','Length'])
df

#>ModelLength
0Integra177
1Legend195

2. 如何從csv文件中每隔n行來(lái)創(chuàng)建dataframe

# 每隔50行讀取一行數(shù)據(jù)
df = pd.read_csv('https://raw./selva86/datasets/master/BostonHousing.csv', chunksize=50)
df2 = pd.DataFrame()
for chunk in df:
# 獲取series
df2 = df2.append(chunk.iloc[0,:])

#顯示前5行
print(df2.head())

#> crim zn indus chas nox rm age \
0 0.21977 0.0 6.91 0 0.44799999999999995 5.602 62.0
1 0.0686 0.0 2.89 0 0.445 7.416 62.5
2 2.7339700000000002 0.0 19.58 0 0.871 5.597 94.9
3 0.0315 95.0 1.47 0 0.40299999999999997 6.975 15.3
4 0.19072999999999998 22.0 5.86 0 0.431 6.718 17.5

dis rad tax ptratio b lstat medv
0 6.0877 3 233 17.9 396.9 16.2 19.4
1 3.4952 2 276 18.0 396.9 6.19 33.2
2 1.5257 5 403 14.7 351.85 21.45 15.4
3 7.6534 3 402 17.0 396.9 4.56 34.9
4 7.8265 7 330 19.1 393.74 6.56 26.2

3. 如何改變導(dǎo)入csv文件的列值

改變列名‘medv’的值，當(dāng)列值≤25時(shí),，賦值為‘Low’,；列值>25時(shí)，賦值為‘High’.

# 使用converters參數(shù),，改變medv列的值
df = pd.read_csv('https://raw./selva86/datasets/master/BostonHousing.csv', 
                 converters={'medv': lambda x: 'High' if float(x) > 25 else 'Low'})
print(df.head())

#>        b  lstat  medv
0  396.90   4.98   Low  
1  396.90   9.14   Low  
2  392.83   4.03  High  
3  394.63   2.94  High  
4  396.90   5.33  High

4. 如何從csv文件導(dǎo)入指定的列

# 導(dǎo)入指定的列：crim和medv
df = pd.read_csv('https://raw./selva86/datasets/master/BostonHousing.csv', usecols=['crim', 'medv'])
# 打印前四行dataframe信息
print(df.head())

#> crim medv
0 0.00632 24.0
1 0.02731 21.6
2 0.02729 34.7
3 0.03237 33.4
4 0.06905 36.2

5. 如何得到dataframe的行,，列，每一列的類(lèi)型和相應(yīng)的描述統(tǒng)計(jì)信息

df = pd.read_csv('https://raw./selva86/datasets/master/Cars93_miss.csv')

#  打印dataframe的行和列
print(df.shape)

# 打印dataframe每列元素的類(lèi)型顯示前5行
print(df.dtypes.head())

# 統(tǒng)計(jì)各類(lèi)型的數(shù)目,方法1
print(df.get_dtype_counts())
# 統(tǒng)計(jì)各類(lèi)型的數(shù)目,方法2
# print(df.dtypes.value_counts())

# 描述每列的統(tǒng)計(jì)信息,，如std,，四分位數(shù)等
df_stats = df.describe()
# dataframe轉(zhuǎn)化數(shù)組
df_arr = df.values
# 數(shù)組轉(zhuǎn)化為列表
df_list = df.values.tolist()

#>(93, 27)
Manufacturer     object
Model            object
Type             object
Min.Price       float64
Price           float64
dtype: object
float64    18
object      9
dtype: int64

6. 如何獲取給定條件的行和列

import numpy as np
df = pd.read_csv('https://raw./selva86/datasets/master/Cars93_miss.csv')
# print(df)
# 獲取最大值的行和列
row, col = np.where(df.values == np.max(df.Price))
# 行和列獲取最大值
print(df.iat[row[0], col[0]])
df.iloc[row[0], col[0]]

# 行索引和列名獲取最大值
df.at[row[0], 'Price']
df.get_value(row[0], 'Price')

#>61.9

7. 如何重命名dataframe的特定列

df1 = pd.DataFrame(data=np.array([[18,50],[19,51],[20,55]]),index=['man1','man2','man3'],columns=['age','weight'])
print(df1)
# 修改列名
print('\nchange columns :\n')
#方法1
df1.rename(columns={'weight':'stress'})
#方法2
df1.columns.values[1] = 'stress'
print(df1)

#>      age  weight
man1   18      50
man2   19      51
man3   20      55

change columns :

      age  stress
man1   18      50
man2   19      51
man3   20      55

8. 如何檢查dataframe中是否有缺失值

df = pd.read_csv('https://raw./selva86/datasets/master/Cars93_miss.csv')

# 若有缺失值，則為T(mén)ure
df.isnull().values.any()

#>True

9. 如何統(tǒng)計(jì)dataframe的每列中缺失值的個(gè)數(shù)

df = pd.read_csv('https://raw./selva86/datasets/master/Cars93_miss.csv')

# 獲取每列的缺失值個(gè)數(shù)
n_missings_each_col = df.apply(lambda x: x.isnull().sum())
print(n_missings_each_col.head())

#>Manufacturer    4
Model           1
Type            3
Min.Price       7
Price           2
dtype: int64

10. 如何用平均值替換相應(yīng)列的缺失值

df = pd.read_csv('https://raw./selva86/datasets/master/Cars93_miss.csv',nrows=10)
print(df[['Min.Price','Max.Price']].head())
# 平均值替換缺失值
df_out = df[['Min.Price', 'Max.Price']] = df[['Min.Price', 'Max.Price']].apply(lambda x: x.fillna(x.mean()))
print(df_out.head())

#> Min.Price Max.Price
0 12.9 18.8
1 29.2 38.7
2 25.9 32.3
3 NaN 44.6
4 NaN NaN

#> Min.Price Max.Price
0 12.9 18.8
1 29.2 38.7
2 25.9 32.3
3 23.0 44.6
4 23.0 29.9

11. 如何用全局變量作為apply函數(shù)的附加參數(shù)處理指定的列

df = pd.read_csv('https://raw./selva86/datasets/master/Cars93_miss.csv')
print(df[['Min.Price', 'Max.Price']].head())
# 全局變量
d = {'Min.Price': np.nanmean, 'Max.Price': np.nanmedian}
# 列名Min.Price的缺失值用平均值代替,，Max.Price的缺失值用中值代替
df[['Min.Price', 'Max.Price']] = df[['Min.Price', 'Max.Price']].apply(lambda x, d: x.fillna(d[x.name](x)), args=(d, ))
print(df[['Min.Price', 'Max.Price']].head())

#>   Min.Price  Max.Price
0       12.9       18.8
1       29.2       38.7
2       25.9       32.3
3        NaN       44.6
4        NaN        NaN

#>   Min.Price  Max.Price
0  12.900000      18.80
1  29.200000      38.70
2  25.900000      32.30
3  17.118605      44.60
4  17.118605      19.15

12. 如何以dataframe的形式選擇特定的列

df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))
# print(df)

# 以dataframe的形式選擇特定的列
type(df[['a']])
type(df.loc[:, ['a']])
print(type(df.iloc[:, [0]]))

# 以series的形式選擇特定的列
type(df.a)
type(df['a'])
type(df.loc[:, 'a'])
print(type(df.iloc[:, 1]))

#><class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>

13. 如何改變dataframe中的列順序

df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))

print(df)
# 交換col1和col2
def switch_columns(df, col1=None, col2=None):
    colnames = df.columns.tolist()
    i1, i2 = colnames.index(col1), colnames.index(col2)
    colnames[i2], colnames[i1] = colnames[i1], colnames[i2]
    return df[colnames]

df1 = switch_columns(df, 'a', 'c')
print(df1)

#>    a   b   c   d   e
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19
#>    c   b   a   d   e
0   2   1   0   3   4
1   7   6   5   8   9
2  12  11  10  13  14
3  17  16  15  18  19

14. 如何格式化dataframe的值

df = pd.DataFrame(np.random.random(4)**10, columns=['random'])
print(df)
# 顯示小數(shù)點(diǎn)后四位
df.apply(lambda x: '%.4f' % x, axis=1)
print(df)

#> random
0 3.539348e-04
1 3.864140e-10
2 2.973575e-02
3 1.414061e-01
#> random
0 3.539348e-04
1 3.864140e-10
2 2.973575e-02
3 1.414061e-01

15. 如何將dataframe中的所有值以百分?jǐn)?shù)的格式表示

df = pd.DataFrame(np.random.random(4), columns=['random'])

# 格式化為小數(shù)點(diǎn)后兩位的百分?jǐn)?shù)
out = df.style.format({
    'random': '{0:.2%}'.format,
})

out

#>random
048.54%
191.51%
290.83%
320.45%

16.如何從dataframe中每隔n行構(gòu)建dataframe

df = pd.read_csv('https://raw./selva86/datasets/master/Cars93_miss.csv')

# 每隔20行讀dataframe數(shù)據(jù)
print(df.iloc[::20, :][['Manufacturer', 'Model', 'Type']])

#> Manufacturer Model Type
0 Acura Integra Small
20 Chrysler LeBaron Compact
40 Honda Prelude Sporty
60 Mercury Cougar Midsize
80 Subaru Loyale Small

17. 如何得到列中前n個(gè)最大值對(duì)應(yīng)的索引

df = pd.DataFrame(np.random.randint(1, 15, 15).reshape(5,-1), columns=list('abc'))
print(df)
# 取'a'列前3個(gè)最大值對(duì)應(yīng)的行
n = 5
df['a'].argsort()[::-1].iloc[:3]

#>    a   b   c
0   5   5   2
1  12   7   1
2   5   2  12
3   5  14  12
4   1  13  13

#>4    1
3    3
2    2
Name: a, dtype: int64

18. 如何獲得dataframe行的和大于100的最末n行索引

df = pd.DataFrame(np.random.randint(10, 40, 16).reshape(-1, 4))
print(df)
# dataframe每行的和
rowsums = df.apply(np.sum, axis=1)

# 選取大于100的最末兩行索引
# last_two_rows = df.iloc[np.where(rowsums > 100)[0][-2:], :]
nline = np.where(rowsums > 100)[0][-2:]
nline

#> 0 1 2 3
0 19 34 15 12
1 38 35 14 26
2 39 32 18 20
3 28 27 36 38

#>array([2, 3], dtype=int64)

19. 如何從series中查找異常值并賦值

ser = pd.Series(np.logspace(-2, 2, 30))

# 小于low_per分位的數(shù)賦值為low,，大于low_per分位的數(shù)賦值為high
def cap_outliers(ser, low_perc, high_perc):
    low, high = ser.quantile([low_perc, high_perc])
    print(low_perc, '%ile: ', low, '|', high_perc, '%ile: ', high)
    ser[ser < low] = low
    ser[ser > high] = high
    return(ser)

capped_ser = cap_outliers(ser, .05, .95)

#>0.05 %ile:  0.016049294076965887 | 0.95 %ile:  63.876672220183934

20. 如何交換dataframe的兩行

df = pd.DataFrame(np.arange(9).reshape(3, -1))
print(df)
# 函數(shù)
def swap_rows(df, i1, i2):
a, b = df.iloc[i1, :].copy(), df.iloc[i2, :].copy()
# 通過(guò)iloc換行
df.iloc[i1, :], df.iloc[i2, :] = b, a
return df

# 2和3行互換
print(swap_rows(df, 1, 2))

#> 0 1 2
0 0 1 2
1 3 4 5
2 6 7 8

#> 0 1 2
0 0 1 2
1 6 7 8
2 3 4 5

21. 如何倒轉(zhuǎn)dataframe的行

df = pd.DataFrame(np.arange(9).reshape(3, -1))
print(df)

# 方法 1
df.iloc[::-1, :]

# 方法 2
print(df.loc[df.index[::-1], :])

#>   0  1  2
0  0  1  2
1  3  4  5
2  6  7  8

#>   0  1  2
2  6  7  8
1  3  4  5
0  0  1  2

22. 如何對(duì)分類(lèi)變量進(jìn)行one-hot編碼

df = pd.DataFrame(np.arange(25).reshape(5,-1), columns=list('abcde'))
print(df)
# 對(duì)列'a'進(jìn)行onehot編碼
df_onehot = pd.concat([pd.get_dummies(df['a']), df[list('bcde')]], axis=1)
print(df_onehot)

#> a b c d e
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24

#> 0 5 10 15 20 b c d e
0 1 0 0 0 0 1 2 3 4
1 0 1 0 0 0 6 7 8 9
2 0 0 1 0 0 11 12 13 14
3 0 0 0 1 0 16 17 18 19
4 0 0 0 0 1 21 22 23 24

23. 如何獲取dataframe行方向上最大值個(gè)數(shù)最多的列

df = pd.DataFrame(np.random.randint(1,100, 9).reshape(3, -1))
print(df)
# 獲取每列包含行方向上最大值的個(gè)數(shù)
count_series = df.apply(np.argmax, axis=1).value_counts()
print(count_series)
# 輸出行方向最大值個(gè)數(shù)最多的列的索引
print('Column with highest row maxes: ', count_series.index[0])

#>    0   1   2
0  46  31  34
1  38  13   6
2   1  18  15

#>統(tǒng)計(jì)列的最大值的個(gè)數(shù)
0    2
1    1
dtype: int64

#>Column with highest row maxes:  0

24. 如何得到列之間最大的相關(guān)系數(shù)

df = pd.DataFrame(np.random.randint(1,100, 16).reshape(4, -1), columns=list('pqrs'), index=list('abcd'))
# df
print(df)
# 得到四個(gè)列的相關(guān)系數(shù)
abs_corrmat = np.abs(df.corr())
print(abs_corrmat)
# 得到每個(gè)列名與其他列的最大相關(guān)系數(shù)
max_corr = abs_corrmat.apply(lambda x: sorted(x)[-2])
# 顯示每列與其他列的相關(guān)系數(shù)
print('Maximum Correlation possible for each column: ', np.round(max_corr.tolist(), 2))

#> p q r s
a 59 99 1 34
b 89 60 97 40
c 43 35 14 6
d 70 59 30 53
#> p q r s
p 1.000000 0.200375 0.860051 0.744529
q 0.200375 1.000000 0.236619 0.438541
r 0.860051 0.236619 1.000000 0.341399
s 0.744529 0.438541 0.341399 1.000000

#>Maximum Correlation possible for each column: [0.86 0.44 0.86 0.74]

25. 如何創(chuàng)建包含每行最小值與最大值比例的列

df = pd.DataFrame(np.random.randint(1,100, 9).reshape(3, -1))
print(df)
# 方法1：axis=1表示行方向，
min_by_max = df.apply(lambda x: np.min(x)/np.max(x), axis=1)

# 方法2
min_by_max = np.min(df, axis=1)/np.max(df, axis=1)

min_by_max

#>    0   1   2
0  81  68  59
1  45  73  23
2  20  22  69

#>0    0.728395
1    0.315068
2    0.289855
dtype: float64

26. 如何創(chuàng)建包含每行第二大值的列

df = pd.DataFrame(np.random.randint(1,100, 9).reshape(3, -1))
print(df)
# 行方向上取第二大的值組成series
out = df.apply(lambda x: x.sort_values().unique()[-2], axis=1)
# 構(gòu)建dataframe新的列
df['penultimate'] = out
print(df)

#> 0 1 2
0 28 77 1
1 43 19 69
2 29 30 72

#> 0 1 2 penultimate
0 28 77 1 28
1 43 19 69 43
2 29 30 72 30

27. 如何歸一化dataframe的所有列

df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

# 正態(tài)分布?xì)w一化
out1 = df.apply(lambda x: ((x - x.mean())/x.std()).round(2))
print('Solution Q1\n',out1)

# 線(xiàn)性歸一化
out2 = df.apply(lambda x: ((x.max() - x)/(x.max() - x.min())).round(2))
print('Solution Q2\n', out2)

28. 如何計(jì)算每一行與下一行的相關(guān)性

df = pd.DataFrame(np.random.randint(1,100, 25).reshape(5, -1))

# 行與行之間的相關(guān)性
[df.iloc[i].corr(df.iloc[i+1]).round(2) for i in range(df.shape[0])[:-1]]

29. 如何用0賦值dataframe的主對(duì)角線(xiàn)和副對(duì)角線(xiàn)

df = pd.DataFrame(np.random.randint(1,100, 25).reshape(5, -1))
print(df)
# zhu
for i in range(df.shape[0]):
    df.iat[i, i] = 0
    df.iat[df.shape[0]-i-1, i] = 0
print(df)

#>    0   1   2   3   4
0  51  35  71  71  79
1  78  25  71  85  44
2  90  97  72  14   4
3  27  91  37  25  48
4   1  26  68  70  20

#>    0   1   2   3   4
0   0  35  71  71   0
1  78   0  71   0  44
2  90  97   0  14   4
3  27   0  37   0  48
4   0  26  68  70   0

30.如何得到按列分組的dataframe的平均值和標(biāo)準(zhǔn)差

df = pd.DataFrame({'col1': ['apple', 'banana', 'orange'] * 2,
'col2': np.random.randint(0,15,6),
'col3': np.random.randint(0, 15, 6)})
print(df)
# 按列col1分組后的平均值
df_grouped_mean = df.groupby(['col1']).mean()
print(df_grouped_mean)
# 按列col1分組后的標(biāo)準(zhǔn)差
df_grouped_std = df.groupby(['col1']).mean()
print(df_grouped_std)

#> col1 col2 col3
0 apple 2 14
1 banana 11 8
2 orange 8 10
3 apple 5 2
4 banana 6 12
5 orange 11 13
#> col2 col3
col1
apple 3.5 8.0
banana 8.5 10.0
orange 9.5 11.5
#> col2 col3
col1
apple 3.5 8.0
banana 8.5 10.0
orange 9.5 11.5

31. 如何得到按列分組后另一列的第n大的值

df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 2,
                   'taste': np.random.rand(6),
                   'price': np.random.randint(0, 15, 6)})

print(df)

# teste列按fruit分組
df_grpd = df['taste'].groupby(df.fruit)
# teste列中banana元素的信息
x=df_grpd.get_group('banana')
# 排序并找第2大的值
s = x.sort_values().iloc[-2]
print(s)

#>    fruit     taste  price
0   apple  0.521990      7
1  banana  0.640444      0
2  orange  0.460509      9
3   apple  0.818963      4
4  banana  0.646138      7
5  orange  0.917056     12

#>0.6404436436085967

32. 如何計(jì)算分組dataframe的平均值,，并將分組列保留為另一列

df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 2,
'rating': np.random.rand(6),
'price': np.random.randint(0, 15, 6)})

# 按fruit分組后,，price列的平均值，并將分組置為一列
out = df.groupby('fruit', as_index=False)['price'].mean()
print(out)

#> fruit price
0 apple 4.0
1 banana 6.5
2 orange 11.0

33.如何獲取兩列值元素相等的位置（并非索引）

df = pd.DataFrame({'fruit1': np.random.choice(['apple', 'orange', 'banana'], 3),
                    'fruit2': np.random.choice(['apple', 'orange', 'banana'], 3)})

print(df)
# 獲取兩列元素相等的行
np.where(df.fruit1 == df.fruit2)

#>   fruit1  fruit2
0   apple  banana
1   apple   apple
2  orange   apple

#>(array([1], dtype=int64),)

34. 如何創(chuàng)建指定列偏移后的新列

df = pd.DataFrame(np.random.randint(1, 100, 20).reshape(-1, 4), columns = list('abcd'))

# 創(chuàng)建往下偏移后的列
df['a_lag1'] = df['a'].shift(1)
# 創(chuàng)建往上偏移后的列
df['b_lead1'] = df['b'].shift(-1)
print(df)

#> a b c d a_lag1 b_lead1
0 29 90 43 24 NaN 36.0
1 94 36 67 66 29.0 76.0
2 81 76 44 49 94.0 97.0
3 55 97 10 74 81.0 43.0
4 32 43 62 62 55.0 NaN

35. 如何獲得dataframe中單一值的頻數(shù)

df = pd.DataFrame(np.random.randint(1, 10, 20).reshape(-1, 4), columns = list('abcd'))

# 統(tǒng)計(jì)元素值的個(gè)數(shù)
pd.value_counts(df.values.ravel())

#>9    3
7    3
3    3
1    3
6    2
5    2
4    2
8    1
2    1
dtype: int64

36. 如何將文本拆分為兩個(gè)單獨(dú)的列

df = pd.DataFrame(['STD, City State',
'33, Kolkata West Bengal',
'44, Chennai Tamil Nadu',
'40, Hyderabad Telengana',
'80, Bangalore Karnataka'], columns=['row'])

print(df)
# expand=True表示以分割符把字符串分成兩列
df_out = df.row.str.split(',|\t', expand=True)

# 獲取新的列
new_header = df_out.iloc[0]
# 重新賦值
df_out = df_out[1:]
df_out.columns = new_header
print(df_out)

#>                          row
0          STD, City    State
1  33, Kolkata    West Bengal
2   44, Chennai    Tamil Nadu
3  40, Hyderabad    Telengana
4  80, Bangalore    Karnataka

#>0 STD     City       State
1  33   Kolkata    West Bengal
2  44    Chennai    Tamil Nadu
3  40   Hyderabad    Telengana
4  80   Bangalore    Karnataka

37.如何構(gòu)建多級(jí)索引的dataframe

我們利用元組（Tuple）構(gòu)建多級(jí)索引,，然后定義dataframe.

# 如何構(gòu)建多級(jí)索引的dataframe
# 先通過(guò)元組方式構(gòu)建多級(jí)索引
import numpy as np
outside = ['A','A','A','B','B','B']
inside =[1,2,3,1,2,3]
my_index = list(zip(outside,inside))
# my_index
# 轉(zhuǎn)化為pd格式的索引
my_index = pd.MultiIndex.from_tuples(my_index)
# my_index
# 構(gòu)建多級(jí)索引dataframe
df = pd.DataFrame(np.random.randn(6,2),index =my_index,columns=['fea1','fea2'])
df

多索引dataframe結(jié)果：

獲取多索引dataframe的數(shù)據(jù)：

df.loc['A'].iloc[1]

#>fea1   -0.794461
fea2    0.882104
Name: 2, dtype: float64

df.loc['A'].iloc[1]['fea1']

#>-0.7944609970323794

六,、小結(jié)

pandas庫(kù)在機(jī)器學(xué)習(xí)項(xiàng)目中的應(yīng)用主要有兩個(gè)步驟：（1）讀取文件，（2）數(shù)據(jù)清洗和編輯工作,，該步驟中,，我們常常需要借組numpy數(shù)組來(lái)處理數(shù)據(jù),。希望這篇文章能夠讓你很好的入門(mén)pandas庫(kù)，多多練習(xí)才是王道 ,。讀者能夠看到這里的都是真愛(ài),，點(diǎn)個(gè)在看和廣告唄！

參考:https://blog.csdn.net/qq_42156420/article/details/82813482

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間,，所有內(nèi)容均由用戶(hù)發(fā)布,，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式,、誘導(dǎo)購(gòu)買(mǎi)等信息,，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,，請(qǐng)點(diǎn)擊一鍵舉報(bào),。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來(lái)自：新用戶(hù)18985384 > 《Python》

舉報(bào)/認(rèn)領(lǐng)