首次公開,，用了三年的 pandas 速查表

禁忌石 2022-03-18

展開全文

導(dǎo)讀：Pandas 是一個強(qiáng)大的分析結(jié)構(gòu)化數(shù)據(jù)的工具集，它的使用基礎(chǔ)是 Numpy（提供高性能的矩陣運算）,，用于數(shù)據(jù)挖掘和數(shù)據(jù)分析,，同時也提供數(shù)據(jù)清洗功能。
本文收集了 Python 數(shù)據(jù)分析庫 Pandas 及相關(guān)工具的日常使用方法,，備查，持續(xù)更新中,。

作者：李慶輝

來源：華章科技

縮寫說明：

df：任意的 Pandas DataFrame 對象
s：任意的 Pandas Series 對象
注：有些屬性方法 df 和 s 都可以使用

推薦資源：

pandas 在線教程
https://www./p/pandas-tutorial
書籍 《深入淺出Pandas：利用Python進(jìn)行數(shù)據(jù)處理與分析》

01 環(huán)境搭建

# https://mirrors.tuna./anaconda/archive/# https://mirrors.tuna./anaconda/miniconda/# https://docs./en/latest/miniconda.html# excel 處理相關(guān)包 xlrd / openpyxl / xlsxwriter# 解析網(wǎng)頁包 requests / lxml / html5lib / BeautifulSoup4# 計算包：scipypip install jupyter pandas matplotlib# 國外網(wǎng)絡(luò)慢,，可指定國內(nèi)源快速下載安裝pip install pandas -i https://pypi.tuna./simple

Conda 多 Python 版本環(huán)境：

# 創(chuàng)建新環(huán)境，<環(huán)境名稱>, python 版本conda create -n py39 python=3.9# 刪除環(huán)境conda remove -n py39 --all# 進(jìn)入,、激活環(huán)境conda activate py39# 退出環(huán)境conda deactivate# 查看所有虛擬環(huán)境及當(dāng)前環(huán)境conda info -e

02 Jupyter Notebook 快捷鍵

啟動 Jupyter Notebook：jupyter notebook

快捷鍵及功能：

<tab>：代碼提示
Shift+ Enter：執(zhí)行本行并定位到新增的行
Shift+Tab(1-3次)：查看函數(shù)方法說明
D, D：雙擊 D 刪除本行
A / B：向上 / 下增加一行
M / Y：Markdown / 代碼模式

03 導(dǎo)入庫包

import pandas as pd # 最新為 1.4.1 版本 (2022-02-12)import numpy as npimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inline

04 導(dǎo)入數(shù)據(jù)

# 從 CSV 文件導(dǎo)入數(shù)據(jù)pd.read_csv('file.csv', name=['列名','列名2'])# 從限定分隔符的文本文件導(dǎo)入數(shù)據(jù)pd.read_table(filename, header=0)# Excel 導(dǎo)入,，指定 sheet 和表頭pd.read_excel('file.xlsx', sheet_name=' 表1', header=0)# 從 SQL 表/庫導(dǎo)入數(shù)據(jù)pd.read_sql(query, connection_object)# 從 JSON 格式的字符串導(dǎo)入數(shù)據(jù)pd.read_json(json_string)# 解析 URL、字符串或者 HTML 文件,，抽取其中的 tables 表格pd.read_html(url)# 從你的粘貼板獲取內(nèi)容,，并傳給 read_table()pd.read_clipboard()# 從字典對象導(dǎo)入數(shù)據(jù)，Key 是列名,，Value是數(shù)據(jù)pd.DataFrame(dict)# 導(dǎo)入字符串from io import StringIOpd.read_csv(StringIO(web_data.text))

05 導(dǎo)出輸出數(shù)據(jù)

# 導(dǎo)出數(shù)據(jù)到CSV文件df.to_csv('filename.csv')# 導(dǎo)出數(shù)據(jù)到Excel文件df.to_excel('filename.xlsx', index=True)# 導(dǎo)出數(shù)據(jù)到 SQL 表df.to_sql(table_name, connection_object)# 以Json格式導(dǎo)出數(shù)據(jù)到文本文件df.to_json(filename)# 其他df.to_html() # 顯示 HTML 代碼df.to_markdown() # 顯示 markdown 代碼df.to_string() # 顯示格式化字符df.to_latex(index=False) # LaTeX tabular, longtabledf.to_dict('split') # 字典, 格式 list/series/records/indexdf.to_clipboard(sep=',', index=False) # 存入系統(tǒng)剪貼板# 將兩個表格輸出到一個excel文件里面,導(dǎo)出到多個 sheetwriter=pd.ExcelWriter('new.xlsx')df_1.to_excel(writer,sheet_name='第一個', index=False)df_2.to_excel(writer,sheet_name='第二個', index=False)writer.save() # 必須運行writer.save(),，不然不能輸出到本地# 寫法2with pd.ExcelWriter('new.xlsx') as writer:df1.to_excel(writer, sheet_name='第一個')df2.to_excel(writer, sheet_name='第二個')# 用 xlsxwriter 導(dǎo)出支持合并單元格、顏色,、圖表等定制功能# https://xlsxwriter.readthedocs.io/working_with_pandas.html

06 創(chuàng)建測試對象

# 創(chuàng)建20行5列的隨機(jī)數(shù)組成的 DataFrame 對象pd.DataFrame(np.random.rand(20,5))# 從可迭代對象 my_list 創(chuàng)建一個 Series 對象pd.Series(my_list)# 增加一個日期索引df.index = pd.date_range('1900/1/30', periods=df.shape[0])# 創(chuàng)建隨機(jī)數(shù)據(jù)集df = pd.util.testing.makeDataFrame()# 創(chuàng)建隨機(jī)日期索引數(shù)據(jù)集df = pd.util.testing.makePeriodFrame()df = pd.util.testing.makeTimeDataFrame()# 創(chuàng)建隨機(jī)混合類型數(shù)據(jù)集df = pd.util.testing.makeMixedDataFrame()

07 查看,、檢查、統(tǒng)計,、屬性

df.head(n) # 查看 DataFrame 對象的前n行df.tail(n) # 查看 DataFrame 對象的最后n行df.sample(n) # 查看 n 個樣本,，隨機(jī)df.shape # 查看行數(shù)和列數(shù)df.info() # 查看索引、數(shù)據(jù)類型和內(nèi)存信息df.describe() # 查看數(shù)值型列的匯總統(tǒng)計df.dtypes # 查看各字段類型df.axes # 顯示數(shù)據(jù)行和列名df.mean() # 返回所有列的均值df.mean(1) # 返回所有行的均值,，下同df.corr() # 返回列與列之間的相關(guān)系數(shù)df.count() # 返回每一列中的非空值的個數(shù)df.max() # 返回每一列的最大值df.min() # 返回每一列的最小值df.median() # 返回每一列的中位數(shù)df.std() # 返回每一列的標(biāo)準(zhǔn)差df.var() # 方差s.mode() # 眾數(shù)s.prod() # 連乘s.cumprod() # 累積連乘,累乘df.cumsum(axis=0) # 累積連加,累加s.nunique() # 去重數(shù)量,，不同值的量df.idxmax() # 每列最大的值的索引名df.idxmin() # 最小df.columns # 顯示所有列名df.team.unique() # 顯示列中的不重復(fù)值# 查看 Series 對象的唯一值和計數(shù), 計數(shù)占比: normalize=Trues.value_counts(dropna=False)# 查看 DataFrame 對象中每一列的唯一值和計數(shù)df.apply(pd.Series.value_counts)df.duplicated() # 重復(fù)行df.drop_duplicates() # 刪除重復(fù)行# set_option、reset_option,、describe_option 設(shè)置顯示要求pd.get_option()# 設(shè)置行列最大顯示數(shù)量,，None 為不限制pd.options.display.max_rows = Nonepd.options.display.max_columns = Nonedf.col.argmin() # 最大值[最小值 .argmax()] 所在位置的自動索引df.col.idxmin() # 最大值[最小值 .idxmax()] 所在位置的定義索引# 累計統(tǒng)計ds.cumsum() # 前邊所有值之和ds.cumprod() # 前邊所有值之積ds.cummax() # 前邊所有值的最大值ds.cummin() # 前邊所有值的最小值# 窗口計算(滾動計算)ds.rolling(x).sum() #依次計算相鄰x個元素的和ds.rolling(x).mean() #依次計算相鄰x個元素的算術(shù)平均ds.rolling(x).var() #依次計算相鄰x個元素的方差ds.rolling(x).std() #依次計算相鄰x個元素的標(biāo)準(zhǔn)差ds.rolling(x).min() #依次計算相鄰x個元素的最小值ds.rolling(x).max() #依次計算相鄰x個元素的最大值

08 數(shù)據(jù)清理

df.columns = ['a','b','c'] # 重命名列名df.columns = df.columns.str.replace(' ', '_') # 列名空格換下劃線df.loc[df.AAA >= 5, ['BBB', 'CCC']] = 555 # 替換數(shù)據(jù)df['pf'] = df.site_id.map({2: '小程序', 7:'M 站'}) # 將枚舉換成名稱pd.isnull() # 檢查DataFrame對象中的空值，并返回一個 Boolean 數(shù)組pd.notnull() # 檢查DataFrame對象中的非空值,，并返回一個 Boolean 數(shù)組df.drop(['name'], axis=1) # 刪除列df.drop([0, 10], axis=0) # 刪除行del df['name'] # 刪除列df.dropna() # 刪除所有包含空值的行df.dropna(axis=1) # 刪除所有包含空值的列df.dropna(axis=1,thresh=n) # 刪除所有小于 n 個非空值的行df.fillna(x) # 用x替換DataFrame對象中所有的空值df.fillna(value={'prov':'未知'}) # 指定列的空值替換為指定內(nèi)容s.astype(float) # 將Series中的數(shù)據(jù)類型更改為 float 類型df.index.astype('datetime64[ns]') # 轉(zhuǎn)化為時間格式s.replace(1, 'one') # 用 'one’ 代替所有等于 1 的值s.replace([1, 3],['one','three']) # 用'one'代替 1,，用 'three' 代替 3df.rename(columns=lambda x: x + 1) # 批量更改列名df.rename(columns={'old_name': 'new_name'}) # 選擇性更改列名df.set_index('column_one') # 更改索引列df.rename(index=lambda x: x + 1) # 批量重命名索引# 重新命名表頭名稱df.columns = ['UID', '當(dāng)前待打款金額', '認(rèn)證姓名']df['是否設(shè)置提現(xiàn)賬號'] = df['狀態(tài)'] # 復(fù)制一列df.loc[:, ::-1] # 列順序反轉(zhuǎn)df.loc[::-1] # 行順序反轉(zhuǎn), 下方為重新定義索引df.loc[::-1].reset_index(drop=True)

09 數(shù)據(jù)處理：Filter、Sort

# 保留小數(shù)位,，四舍六入五成雙df.round(2) # 全部df.round({'A': 1, 'C': 2}) # 指定列df['Name'] = df.Name # 取列名的兩個方法df[df.index == 'Jude'] # 按索引查詢要用 .indexdf[df[col] > 0.5] # 選擇col列的值大于0.5的行# 多條件查詢df[(df['team'] == 'A') &( df['Q1'] > 80) &df.utype.isin(['老客', '老訪客'])]# 篩選為空的內(nèi)容df[df.order.isnull()]# 類似 SQL where indf[df.team.isin('A','B')]df[(df.team=='B') & (df.Q1 == 17)]df[~(df['team'] == 'A') | ( df['Q1'] > 80)] # 非,，或df[df.Name.str.contains('張')] # 包含字符df.sort_values(col1) # 按照列col1排序數(shù)據(jù)，默認(rèn)升序排列df.col1.sort_values() # 同上, -> sdf.sort_values(col2, ascending=False) # 按照列 col1 降序排列數(shù)據(jù)# 先按列col1升序排列,，后按col2降序排列數(shù)據(jù)df.sort_values([col1,col2], ascending=[True,False])df2 = pd.get_dummies(df, prefix='t_') # 將枚舉的那些列帶枚舉轉(zhuǎn)到列上s.set_index().plot()# 多索引處理dd.set_index(['utype', 'site_id', 'p_day'], inplace=True)dd.sort_index(inplace=True) # 按索引排序dd.loc['新訪客', 2, '2019-06-22'].plot.barh() # loc 中按順序指定索引內(nèi)容# 前100行, 不能指定行,，如：df[100]df[:100]# 只取指定行df1 = df.loc[0:, ['設(shè)計師ID', '姓名']]# 將ages平分成5個區(qū)間并指定 labelsages = np.array([1,5,10,40,36,12,58,62,77,89,100,18,20,25,30,32])pd.cut(ages, [0,5,20,30,50,100],labels=[u'嬰兒',u'青年',u'中年',u'壯年',u'老年'])daily_index.difference(df_work_day.index) # 取出差別# 格式化df.index.name # 索引的名稱 strdf.columns.tolist()df.values.tolist()df.總?cè)丝?values.tolist()data.apply(np.mean) # 對 DataFrame 中的每一列應(yīng)用函數(shù) np.meandata.apply(np.max,axis=1) # 對 DataFrame 中的每一行應(yīng)用函數(shù) np.maxdf.insert(1, 'three', 12, allow_duplicates=False) # 插入列 (位置,、列名、[值])df.pop('class') # 刪除列# 增加一行df.append(pd.DataFrame({'one':2,'two':3,'three': 4.4},index=['f']),sort=True)# 指定新列iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength']).head()df.assign(rate=lambda df: df.orders/df.uv)# shift 函數(shù)是對數(shù)據(jù)進(jìn)行平移動的操作df['增幅'] = df['國內(nèi)生產(chǎn)總值'] - df['國內(nèi)生產(chǎn)總值'].shift(-1)df.tshift(1) # 時間移動,，按周期# 和上相同,，diff 函數(shù)是用來將數(shù)據(jù)進(jìn)行移動之后與原數(shù)據(jù)差# 異數(shù)據(jù)，等于 df.shift()-dfdf['增幅'] = df['國內(nèi)生產(chǎn)總值'].diff(-1)# 留存數(shù)據(jù),，因為最大一般為數(shù)據(jù)池df.apply(lambda x: x/x.max(), axis=1)# 取 best 列中值為列名的值寫到 name 行上df['value'] = df.lookup(df['name'], df['best'])s.where(s > 1, 10) # 滿足條件下數(shù)據(jù)替換（10,，空為 NaN）s.mask(s > 0) # 留下滿足條件的，其他的默認(rèn)為 NaN# 所有值加 1 (加減乘除等)df + 1 / df.add(1)# 管道方法,，鏈?zhǔn)秸{(diào)用函數(shù),，f(df)=df.pipe(f)def gb(df, by):result = df.copy()result = result.groupby(by).sum()return result # 調(diào)用df.pipe(gb, by='team')# 窗口計算 '2s' 為兩秒df.rolling(2).sum()# 在窗口結(jié)果基礎(chǔ)上的窗口計算df.expanding(2).sum()# 超出（大于、小于）的值替換成對應(yīng)值df.clip(-4, 6)# AB 兩列想加增加 C 列df['C'] = df.eval('A+B')# 和上相同效果df.eval('C = A + B', inplace=True)# 數(shù)列的變化百分比s.pct_change(periods=2)# 分位數(shù), 可實現(xiàn)時間的中間點df.quantile(.5)# 排名 average, min,max,first,，dense, 默認(rèn) averages.rank()# 數(shù)據(jù)爆炸,，將本列的類列表數(shù)據(jù)和其他列的數(shù)據(jù)展開鋪開df.explode('A')# 枚舉更新status = {0:'未執(zhí)行', 1:'執(zhí)行中', 2:'執(zhí)行完畢', 3:'執(zhí)行異常'}df['taskStatus'] = df['taskStatus'].apply(status.get)df.assign(金額=0) # 新增字段df.loc[('bar', 'two'), 'A'] # 多索引查詢df.query('i0 == 'b' & i1 == 'b'') # 多索引查詢方法 2# 取多索引中指定級別的所有不重復(fù)值df.index.get_level_values(2).unique()# 去掉為零小數(shù)，12.00 -> 12df.astype('str').applymap(lambda x: x.replace('.00', ''))# 插入數(shù)據(jù),，在第三列加入「兩倍」列df.insert(3, '兩倍', df['值']*2)# 枚舉轉(zhuǎn)換df['gender'] = df.gender.map({'male':'男', 'female':'女'})# 增加本行之和列df['Col_sum'] = df.apply(lambda x: x.sum(), axis=1)# 對指定行進(jìn)行加和col_list= list(df)[2:] # 取請假范圍日期df['總天數(shù)'] = df[col_list].sum(axis=1) # 計算總請假天數(shù)# 對列求和,，匯總df.loc['col_sum'] = df.apply(lambda x: x.sum())# 按指定的列表順序顯示df.reindex(order_list)# 按指定的多列排序df.reindex(['col_1', 'col_5'], axis='columns')

10 數(shù)據(jù)選取

df[col] # 根據(jù)列名，并以Series的形式返回列df[[col1, col2]] # 以DataFrame形式返回多列df.loc[df['team'] == 'B',['name']] # 按條件查詢,，只顯示name 列s.iloc[0] # 按位置選取數(shù)據(jù)s.loc['index_one'] # 按索引選取數(shù)據(jù)df.loc[0,'A':'B'] # A到 B 字段的第一行 df.loc[2018:1990, '第一產(chǎn)業(yè)增加值':'第三產(chǎn)業(yè)增加值']df.loc[0,['A','B']] # d.loc[位置切片, 字段]df.iloc[0,:] # 返回第一行, iloc 只能是數(shù)字df.iloc[0,0] # 返回第一列的第一個元素dc.query('site_id > 8 and utype=='老客'').head() # 可以 and or / & |# 迭代器及使用for idx,row in df.iterrows(): row['id']# 迭代器對每個元素進(jìn)行處理df.loc[i,'鏈接'] = f'http://www./p/{slug}.html'for i in df.Name:print(i) # 迭代一個列# 按列迭代,，[列名, 列中的數(shù)據(jù)序列 S（索引名 值)]for label, content in df.items():print(label, content)# 按行迭代，迭代出整行包括索引的類似列表的內(nèi)容,，可row[2]取for row in df.itertuples():print(row)df.at[2018, '總?cè)丝?] # 按行列索引名取一個指定的單個元素df.iat[1, 2] # 索引和列的編號取單個元素s.nlargest(5).nsmallest(2) # 最大和最小的前幾個值df.nlargest(3, ['population', 'GDP'])df.take([0, 3]) # 指定多個行列位置的內(nèi)容# 按行列截取掉部分內(nèi)容,，支持日期索引標(biāo)簽ds.truncate(before=2, after=4)# 將 dataframe 轉(zhuǎn)成 seriesdf.iloc[:,0]float(str(val).rstrip('%')) # 百分?jǐn)?shù)轉(zhuǎn)數(shù)字df.reset_index(inplace=True) # 取消索引

11 數(shù)據(jù)處理 GroupBy 透視

df.groupby(col) # 返回一個按列col進(jìn)行分組的Groupby對象df.groupby([col1,col2]) # 返回一個按多列進(jìn)行分組的Groupby對象df.groupby(col1)[col2] # 返回按列col1進(jìn)行分組后，列col2的均值# 創(chuàng)建一個按列col1進(jìn)行分組,，并計算col2和col3的最大值的數(shù)據(jù)透視表df.pivot_table(index=col1,values=[col2,col3],aggfunc=max,as_index=False)# 同上df.pivot_table(index=['site_id', 'utype'],values=['uv_all', 'regist_num'],aggfunc=['max', 'mean'])df.groupby(col1).agg(np.mean) # 返回按列col1分組的所有列的均值# 按列將其他列轉(zhuǎn)行pd.melt(df, id_vars=['day'], var_name='city', value_name='temperature')# 交叉表是用于統(tǒng)計分組頻率的特殊透視表pd.crosstab(df.Nationality,df.Handedness)# groupby 后排序,，分組 agg 內(nèi)的元素取固定個數(shù)(df[(df.p_day >= '20190101')].groupby(['p_day', 'name']).agg({'uv':sum}).sort_values(['p_day','uv'], ascending=[False, False]).groupby(level=0).head(5) # 每天取5個頁面.unstack().plot())# 合并查詢經(jīng)第一個看（max, min, last, size:數(shù)量）df.groupby('結(jié)算類型').first()# 合并明細(xì)并分組統(tǒng)計加總（'max', `mean`, `median`,# `prod`, `sum`, `std`,`var`, 'nunique'）,'nunique'為去重的列表df1 = df.groupby(by='設(shè)計師ID').agg({'結(jié)算金額':sum})df.groupby(by=df.pf).ip.nunique() # groupby distinct, 分組+去重數(shù)df.groupby(by=df.pf).ip.value_counts() # groupby 分組+去重的值及數(shù)量df.groupby('name').agg(['sum', 'median', 'count'])

12 數(shù)據(jù)合并

# 合并拼接行# 將df2中的行添加到df1的尾部df1.append(df2)# 指定列合并成一個新表新列ndf = (df['提名1'].append(df['提名2'], ignore_index=True).append(df['提名3'], ignore_index=True))ndf = pd.DataFrame(ndf, columns=(['姓名']))# 將df2中的列添加到df1的尾部df.concat([df1, df2], axis=1)# 合并文件的各行df1 = pd.read_csv('111.csv', sep='\t')df2 = pd.read_csv('222.csv', sep='\t')excel_list = [df1, df2]# result = pd.concat(excel_list).fillna('')[:].astype('str')result = pd.concat(excel_list)[]result.to_excel('333.xlsx', index=False)# 合并指定目錄下所有的 excel (csv) 文件import globfiles = glob.glob('data/cs/*.xls')dflist = []for i in files:dflist.append(pd.read_excel(i, usecols=['ID', '時間', '名稱']))df = pd.concat(dflist)# 合并增加列# 對df1的列和df2的列執(zhí)行SQL形式的joindf1.join(df2,on=col1,how='inner')# 用 key 合并兩個表df_all = pd.merge(df_sku, df_spu, how='left',left_on=df_sku['product_id'],right_on=df_spu['p.product_id'])

13 時間處理時間序列

# 時間索引df.index = pd.DatetimeIndex(df.index)# 時間只保留日期df['date'] = df['time'].dt.date# 將指定字段格式化為時間類型df['date'] = pd.to_datetime(df['時間'])# 轉(zhuǎn)化為北京時間df['time'] = df['time'].dt.tz_convert('Asia/Shanghai')# 轉(zhuǎn)為指定格式，可能會失去秒以后的精度df['time'] = df['time'].dt.strftime('%Y-%m-%d %H:%M:%S')dc.index = pd.to_datetime(dc.index, format='%Y%m%d', errors='ignore')# 時間,，參與運算pd.DateOffset(days=2)# 當(dāng)前時間pd.Timestamp.now()pd.to_datetime('today')# 判斷時間是否當(dāng)天pd.datetime.today().year == df.start_work.dt.yeardf.time.astype('datetime64[ns]').dt.date == pd.to_datetime('today')# 定義個天數(shù)import datetimedays = lambda x: datetime.timedelta(days=x)days(2)# 同上,，直接用 pd 包裝的pd.Timedelta(days=2)# unix 時間戳pd.to_datetime(ted.film_date, unit='ms')# 按月（YMDHminS）采集合計數(shù)據(jù)df.set_index('date').resample('M')['quantity'].sum()df.set_index('date').groupby('name')['ext price'].resample('M').sum()# 按天匯總，index 是 datetime 時間類型df.groupby(by=df.index.date).agg({'uu':'count'})# 按周匯總df.groupby(by=df.index.weekday).uu.count()# 按月進(jìn)行匯總df.groupby(['name', pd.Grouper(key='date', freq='M')])['ext price'].sum()# 按月進(jìn)行匯總df.groupby(pd.Grouper(key='day', freq='1M')).sum()# 按照年度,，且截止到12月最后一天統(tǒng)計 ext price 的 sum 值df.groupby(['name', pd.Grouper(key='date', freq='A-DEC')])['ext price'].sum()# 按月的平均重新采樣df['Close'].resample('M').mean()# https://pandas./pandas-docs/stable/user_guide/timeseries.html#offset-aliases# 取時間范圍,，并取工作日rng = pd.date_range(start='6/1/2016',end='6/30/2016',freq='B')# 重新定時數(shù)據(jù)頻度，按一定補充方法df.asfreq('D', method='pad')# 時區(qū),，df.tz_convert('Europe/Berlin')df.time.tz_localize(tz='Asia/Shanghai')# 轉(zhuǎn)北京時間df['Time'] = df['Time'].dt.tz_localize('UTC').dt.tz_convert('Asia/Shanghai')# 查看所有時區(qū)from pytz import all_timezonesprint (all_timezones)# 時長,，多久，兩個時間間隔時間,，時差df['duration'] = pd.to_datetime(df['end']) - pd.to_datetime(df['begin'])# 指定時間進(jìn)行對比df.Time.astype('datetime64[ns]') < pd.to_datetime('2019-12-11 20:00:00', format='%Y-%m-%d %H:%M:%S')

14 常用備忘

# 解決科學(xué)計數(shù)法問題df = pd.read_csv('111.csv', sep='\t').fillna('')[:].astype('str')# 和訂單量相關(guān)性最大到小顯示dd.corr().total_order_num.sort_values(ascending=False)# 解析列表,、json 字符串import astast.literal_eval('[{'id': 7, 'name': 'Funny'}]')# Series apply method applies a function to# every element in a Series and returns a Seriested.ratings.apply(str_to_list).head()# lambda is a shorter alternativeted.ratings.apply(lambda x: ast.literal_eval(x))# an even shorter alternative is to apply the # function directly (without lambda)ted.ratings.apply(ast.literal_eval)# 索引 index 使用 apply()df.index.to_series().apply()

15 樣式顯示

# https:///styling-pandas.htmldf['per_cost'] = df['per_cost'].map('{:,.2f}%'.format) # 顯示%比形式# 指定列表（值大于0）加背景色df.style.applymap(lambda x: 'background-color: grey' if x>0 else '',subset=pd.IndexSlice[:, ['B', 'C']])# 最大值最小值加背景色df.style.highlight_max(color='lightgreen').highlight_min(color='#cd4f39')df.style.format('{:.2%}', subset=pd.IndexSlice[:, ['B']]) # 顯示百分號# 指定各列的樣式format_dict = {'sum':'${0:,.0f}','date': '{:%Y-%m}','pct_of_total': '{:.2%}''c': str.upper}# 一次性樣式設(shè)置(df.style.format(format_dict) # 多種樣式形式.hide_index()# 指定列按顏色深度表示值大小, cmap 為 matplotlib colormap.background_gradient(subset=['sum_num'], cmap='BuGn')# 表格內(nèi)作橫向 bar 代表值大小.bar(color='#FFA07A', vmin=100_000, subset=['sum'], align='zero')# 表格內(nèi)作橫向 bar 代表值大小.bar(color='lightgreen', vmin=0, subset=['pct_of_total'], align='zero')# 下降（小于0）為紅色, 上升為綠色.bar(color=['#ffe4e4','#bbf9ce'], vmin=0, vmax=1, subset=['增長率'], align='zero')# 給樣式表格起個名字.set_caption('2018 Sales Performance').hide_index())# 按條件給整行加背景色（樣式）def background_color(row):if row.pv_num >= 10000:return ['background-color: red'] * len(row)elif row.pv_num >= 100:return ['background-color: yellow'] * len(row)return [''] * len(row)# 使用df.style.apply(background_color, axis=1)

16 表格中的直方圖，sparkline 圖形

import sparklinesimport numpy as npdef sparkline_str(x):bins=np.histogram(x)[0]sl = ''.join(sparklines.sparklines(bins))return slsparkline_str.__name__ = 'sparkline'# 畫出趨勢圖,，保留兩位小數(shù)df.groupby('name')['quantity', 'ext price'].agg(['mean', sparkline_str]).round(2)# sparkline 圖形# https://hugoworld.wordpress.com/2019/01/26/sparklines-in-jupyter-notebooks-ipython-and-pandas/def sparkline(data, figsize=(4, 0.25), **kwargs):'''creates a sparkline'''# Turn off the max column width so the images won't be truncatedpd.set_option('display.max_colwidth', -1)# Turning off the max column will display all the data# if gathering into sets / array we might want to restrict to a few itemspd.set_option('display.max_seq_items', 3)#Monkey patch the dataframe so the sparklines are displayedpd.DataFrame._repr_html_ = lambda self: self.to_html(escape=False)from matplotlib import pyplot as pltimport base64from io import BytesIOdata = list(data)*_, ax = plt.subplots(1, 1, figsize=figsize, **kwargs)ax.plot(data)ax.fill_between(range(len(data)), data, len(data)*[min(data)], alpha=0.1)ax.set_axis_off()img = BytesIO()plt.savefig(img)plt.close()return '<img src='data:image/png;base64, {}' />'.format(base64.b64encode(img.getvalue()).decode())# 使用df.groupby('name')['quantity', 'ext price'].agg(['mean', sparkline])df.apply(sparkline, axis=1) # 僅支持橫向數(shù)據(jù)畫線,，可做 T 轉(zhuǎn)置

17 可視化

kind : str

'line' : line plot (default)
'bar' : vertical bar plot
'barh' : horizontal bar plot
'hist' : histogram
'box' : boxplot
'kde' : Kernel Density Estimation plot
'density' : same as 'kde'
'area' : area plot
'pie' : pie plot

常用方法：

df88.plot.bar(y='rate', figsize=(20, 10)) # 圖形大小,，單位英寸df_1[df_1.p_day > '2019-06-01'].plot.bar(x='p_day', y=['total_order_num','order_user'], figsize=(16, 6)) # 柱狀圖# 每條線一個站點，各站點的 home_remain, stack的意思是堆疊,，堆積# unstack 即“不要堆疊”(df[(df.p_day >= '2019-05-1') & (df.utype == '老客')].groupby(['p_day', 'site_id'])['home_remain'].sum().unstack().plot.line())# 折線圖,，多條, x 軸默認(rèn)為 indexdd.plot.line(x='p_day', y=['uv_all', 'home_remain'])dd.loc['新訪客', 2].plot.scatter(x='order_user', y='paid_order_user') # 散點圖dd.plot.bar(color='blue') # 柱狀圖, barh 為橫向柱狀圖sns.heatmap(dd.corr()) # 相關(guān)性可視化# 刻度從0開始，指定范圍 ylim=(0,100), x 軸相同s.plot.line(ylim=0)# 折線顏色 https:///examples/color/named_colors.html# 樣式( '-','--','-.',':' )# 折線標(biāo)記 https:///api/markers_api.html# grid=True 顯示刻度 etc: https:///api/_as_gen/matplotlib.pyplot.plot.htmls.plot.line(color='green', linestyle='-', marker='o')# 兩個圖繪在一起[df['數(shù)量'].plot.kde(), df['數(shù)量'].plot.hist()]# 對表中的數(shù)據(jù)按顏色可視化import seaborn as snscm = sns.light_palette('green', as_cmap=True)df.style.background_gradient(cmap=cm, axis=1)# 將數(shù)據(jù)轉(zhuǎn)化為二維數(shù)組[i for i in zip([i.strftime('%Y-%m-%d') for i in s.index.to_list()], s.to_list())]# 和 plot 用法一樣 https://hvplot./user_guide/Plotting.htmlimport hvplot.pandas# 打印 Sqlite 建表語句print(pd.io.sql.get_schema(fdf, 'table_name'))

18 Jupyter notebooks 問題

# jupyter notebooks plt 圖表配置import matplotlib.pyplot as pltplt.rcParams['figure.figsize'] = (15.0, 8.0) # 固定顯示大小plt.rcParams['font.family'] = ['sans-serif'] # 顯示中文問題plt.rcParams['font.sans-serif'] = ['SimHei'] # 顯示中文問題# 輸出單行全部變量from IPython.core.interactiveshell import InteractiveShellInteractiveShell.ast_node_interactivity = 'all'# jupyter notebooks 頁面自適應(yīng)寬度from IPython.core.display import display, HTMLdisplay(HTML('<style>.container { width:100% !important; }</style>'))# 背景白色 <style>#notebook_panel {background: #ffffff;}</style># jupyter notebooks 嵌入頁面內(nèi)容from IPython.display import IFrameIFrame('https:///pdf/1406.2661.pdf', width=800, height=450)# Markdown 一個 cell 不支持多張粘貼圖片# 一個文件打印打開只顯示一張圖片問題解決# /site-packages/notebook/static/notebook/js/main.min.js var key 處# 33502,、33504 行key = utils.uuid().slice(2,6)+encodeURIandParens(blob.name);key = utils.uuid().slice(2,6)+Object.keys(that.attachments).length;# https://github.com/ihnorton/notebook/commit/55687c2dc08817da587977cb6f19f8cc0103bab1# 多行輸出from IPython.core.interactiveshell import InteractiveShellInteractiveShell.ast_node_interactivity = 'all' #默認(rèn)為'last'# 執(zhí)行 shell 命令: ! <命令語句># 在線可視化工具h(yuǎn)ttps:///create

19 Slideshow 幻燈片

安裝 RISE 庫：pip install RISE

[Alt+r] 播放/退出播放
「,」逗號隱藏左側(cè)兩個大操作按鈕,，「t」總覽 ppt，「/」黑屏
Slide：主頁面,，通過按左右方向鍵進(jìn)行切換,。
Sub-Slide：副頁面，通過按上下方向鍵進(jìn)行切換,。全屏
Fragment：一開始是隱藏的，按空格鍵或方向鍵后顯示,，實現(xiàn)動態(tài)效果,。在一個頁面
Skip：在幻燈片中不顯示的單元。
Notes：作為演講者的備忘筆記,，也不在幻燈片中顯示,。

本站是提供個人知識管理的網(wǎng)絡(luò)存儲空間，所有內(nèi)容均由用戶發(fā)布,，不代表本站觀點,。請注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息,，謹(jǐn)防詐騙,。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請點擊一鍵舉報,。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自：禁忌石 > 《python》

舉報/認(rèn)領(lǐng)