總括
pandas的索引函數(shù)主要有三種:
loc 標(biāo)簽索引,,行和列的名稱(chēng)
iloc 整型索引(絕對(duì)位置索引),絕對(duì)意義上的幾行幾列,,起始索引為0
ix 是 iloc 和 loc的合體
at是loc的快捷方式
iat是iloc的快捷方式
建立測(cè)試數(shù)據(jù)集:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c'],'c': ["A","B","C"]})
print(df)
a b c
0 1 a A
1 2 b B
2 3 c C
行操作
選擇某一行
print(df.loc[1,:])
a 2
b b
c B
Name: 1, dtype: object
選擇多行
print(df.loc[1:2,:])#選擇1:2行,,slice為1
a b c
1 2 b B
2 3 c C
print(df.loc[::-1,:])#選擇所有行,slice為-1,,所以為倒序
a b c
2 3 c C
1 2 b B
0 1 a A
print(df.loc[0:2:2,:])#選擇0至2行,,slice為2,等同于print(df.loc[0:2:2,:])因?yàn)橹挥?行
a b c
0 1 a A
2 3 c C
條件篩選
普通條件篩選
print(df.loc[:,"a"]>2)#原理是首先做了一個(gè)判斷,,然后再篩選
0 False
1 False
2 True
Name: a, dtype: bool
print(df.loc[df.loc[:,"a"]>2,:])
a b c
2 3 c C
另外條件篩選還可以集邏輯運(yùn)算符 | for or, & for and, and ~for not
In [129]: s = pd.Series(range(-3, 4))
In [132]: s[(s < -1) | (s > 0.5)]
Out[132]:
0 -3
1 -2
4 1
5 2
6 3
dtype: int64
isin
非索引列使用isin
In [141]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')
In [143]: s.isin([2, 4, 6])
Out[143]:
4 False
3 False
2 True
1 False
0 True
dtype: bool
In [144]: s[s.isin([2, 4, 6])]
Out[144]:
2 2
0 4
dtype: int64
索引列使用isin
In [145]: s[s.index.isin([2, 4, 6])]
Out[145]:
4 0
2 2
dtype: int64
# compare it to the following
In [146]: s[[2, 4, 6]]
Out[146]:
2 2.0
4 0.0
6 NaN
dtype: float64
結(jié)合any()/all()在多列索引時(shí)
In [151]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],
.....: 'ids2': ['a', 'n', 'c', 'n']})
.....:
In [156]: values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}
In [157]: row_mask = df.isin(values).all(1)
In [158]: df[row_mask]
Out[158]:
ids ids2 vals
0 a a 1
where()
In [1]: dates = pd.date_range('1/1/2000', periods=8)
In [2]: df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
In [3]: df
Out[3]:
A B C D
2000-01-01 0.469112 -0.282863 -1.509059 -1.135632
2000-01-02 1.212112 -0.173215 0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929 1.071804
2000-01-04 0.721555 -0.706771 -1.039575 0.271860
2000-01-05 -0.424972 0.567020 0.276232 -1.087401
2000-01-06 -0.673690 0.113648 -1.478427 0.524988
2000-01-07 0.404705 0.577046 -1.715002 -1.039268
2000-01-08 -0.370647 -1.157892 -1.344312 0.844885
In [162]: df.where(df < 0, -df)
Out[162]:
A B C D
2000-01-01 -2.104139 -1.309525 -0.485855 -0.245166
2000-01-02 -0.352480 -0.390389 -1.192319 -1.655824
2000-01-03 -0.864883 -0.299674 -0.227870 -0.281059
2000-01-04 -0.846958 -1.222082 -0.600705 -1.233203
2000-01-05 -0.669692 -0.605656 -1.169184 -0.342416
2000-01-06 -0.868584 -0.948458 -2.297780 -0.684718
2000-01-07 -2.670153 -0.114722 -0.168904 -0.048048
2000-01-08 -0.801196 -1.392071 -0.048788 -0.808838
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
DataFrame.where() differs from numpy.where()的區(qū)別
In [172]: df.where(df < 0, -df) == np.where(df < 0, df, -df)
當(dāng)series對(duì)象使用where()時(shí),,則返回一個(gè)序列
In [141]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')
In [159]: s[s > 0]
Out[159]:
3 1
2 2
1 3
0 4
dtype: int64
In [160]: s.where(s > 0)
Out[160]:
4 NaN
3 1.0
2 2.0
1 3.0
0 4.0
dtype: float64
抽樣篩選
DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)
當(dāng)在有權(quán)重篩選時(shí),未賦值的列權(quán)重為0,,如果權(quán)重和不為1,,則將會(huì)將每個(gè)權(quán)重除以總和。random_state可以設(shè)置抽樣的種子(seed),。axis可是設(shè)置列隨機(jī)抽樣,。
In [105]: df2 = pd.DataFrame({'col1':[9,8,7,6], 'weight_column':[0.5, 0.4, 0.1, 0]})
In [106]: df2.sample(n = 3, weights = 'weight_column')
Out[106]:
col1 weight_column
1 8 0.4
0 9 0.5
2 7 0.1
增加行
df.loc[3,:]=4
a b c
0 1.0 a A
1 2.0 b B
2 3.0 c C
3 4.0 4 4
插入行
pandas里并沒(méi)有直接指定索引的插入行的方法,所以要自己設(shè)置
line = pd.DataFrame({df.columns[0]:"--",df.columns[1]:"--",df.columns[2]:"--"},index=[1])
df = pd.concat([df.loc[:0],line,df.loc[1:]]).reset_index(drop=True)#df.loc[:0]這里不能寫(xiě)成df.loc[0],,因?yàn)閐f.loc[0]返回的是series
a b c
0 1.0 a A
1 -- -- --
2 2.0 b B
3 3.0 c C
4 4.0 4 4
交換行
df.loc[[1,2],:]=df.loc[[2,1],:].values
a b c
0 1 a A
1 3 c C
2 2 b B
刪除行
df.drop(0,axis=0,inplace=True)
print(df)
a b c
1 2 b B
2 3 c C
注意
在以時(shí)間作為索引的數(shù)據(jù)框中,,索引是以整形的方式來(lái)的。
In [39]: dfl = pd.DataFrame(np.random.randn(5,4), columns=list('ABCD'), index=pd.date_range('20130101',periods=5))
In [40]: dfl
Out[40]:
A B C D
2013-01-01 1.075770 -0.109050 1.643563 -1.469388
2013-01-02 0.357021 -0.674600 -1.776904 -0.968914
2013-01-03 -1.294524 0.413738 0.276662 -0.472035
2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061
2013-01-05 0.895717 0.805244 -1.206412 2.565646
In [41]: dfl.loc['20130102':'20130104']
Out[41]:
A B C D
2013-01-02 0.357021 -0.674600 -1.776904 -0.968914
2013-01-03 -1.294524 0.413738 0.276662 -0.472035
2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061
列操作
選擇某一列
print(df.loc[:,"a"])
0 1
1 2
2 3
Name: a, dtype: int64
選擇多列
print(df.loc[:,"a":"b"])
a b
0 1 a
1 2 b
2 3 c
增加列,如果對(duì)已有的列,則是賦值
df.loc[:,"d"]=4
a b c d
0 1 a A 4
1 2 b B 4
2 3 c C 4
交換兩列的值
df.loc[:,['b', 'a']] = df.loc[:,['a', 'b']].values
print(df)
a b c
0 a 1 A
1 b 2 B
2 c 3 C
刪除列
1)直接del DF[‘column-name’]
2)采用drop方法,,有下面三種等價(jià)的表達(dá)式:
DF= DF.drop(‘column_name’, 1),;
DF.drop(‘column_name’,axis=1, inplace=True)
DF.drop([DF.columns[[0,1,]]], axis=1,inplace=True)
df.drop("a",axis=1,inplace=True)
print(df)
b c
0 a A
1 b B
2 c C
還有一些其他的功能:
切片df.loc[::,::]
選擇隨機(jī)抽樣df.sample()
去重.duplicated()
查詢.lookup
|