源 / Python學(xué)習(xí)交流 假設(shè)你在網(wǎng)上搜索某個(gè)項(xiàng)目所需的原始數(shù)據(jù),但壞消息是數(shù)據(jù)存在于網(wǎng)頁(yè)中,,并且沒(méi)有可用于獲取原始數(shù)據(jù)的API,。
所以現(xiàn)在你必須浪費(fèi)30分鐘寫腳本來(lái)獲取數(shù)據(jù)(最后花費(fèi) 2小時(shí))。 這不難但是很浪費(fèi)時(shí)間,。 Pandas庫(kù)有一種內(nèi)置的方法,,可以從名為read_html()的html頁(yè)面中提取表格數(shù)據(jù): https://pandas./ import pandas as pd
tables = pd.read_html('https://apps./sdfiredispatch/')
print(tables[0])
就這么簡(jiǎn)單! Pandas可以在頁(yè)面上找到所有重要的html表,并將它們作為一個(gè)新的DataFrame對(duì)象返回,。 https://pandas./pandas-docs/stable/dsintro.html#dataframe 輸入表格0行有列標(biāo)題,,并要求它將基于文本的日期轉(zhuǎn)換為時(shí)間對(duì)象:
import pandas as pd
calls_df, = pd.read_html('http://apps./sdfiredispatch/', header=0, parse_dates=['Call Date'])
print(calls_df)
得到: Call Date Call Type Street Cross Streets Unit 2017-06-02 17:27:58 Medical HIGHLAND AV WIGHTMAN ST/UNIVERSITY AV E17 2017-06-02 17:27:58 Medical HIGHLAND AV WIGHTMAN ST/UNIVERSITY AV M34 2017-06-02 17:23:51 Medical EMERSON ST LOCUST ST/EVERGREEN ST E22 2017-06-02 17:23:51 Medical EMERSON ST LOCUST ST/EVERGREEN ST M47 2017-06-02 17:23:15 Medical MARAUDER WY BARON LN/FROBISHER ST E38 2017-06-02 17:23:15 Medical MARAUDER WY BARON LN/FROBISHER ST M41 是一行代碼,數(shù)據(jù)不能作為json記錄可用,。 import pandas as pd
calls_df, = pd.read_html('http://apps./sdfiredispatch/', header=0, parse_dates=['Call Date'])
print(calls_df.to_json(orient='records', date_format='iso'))
運(yùn)行下面的代碼你將得到一個(gè)漂亮的json輸出(即使有適當(dāng)?shù)腎SO 8601日期格式): [ { 'Call Date': '2017-06-02T17:34:00.000Z', 'Call Type': 'Medical', 'Street': 'ROSECRANS ST', 'Cross Streets': 'HANCOCK ST/ALLEY', 'Unit': 'M21' }, { 'Call Date': '2017-06-02T17:34:00.000Z', 'Call Type': 'Medical', 'Street': 'ROSECRANS ST', 'Cross Streets': 'HANCOCK ST/ALLEY', 'Unit': 'T20' }, { 'Call Date': '2017-06-02T17:30:34.000Z', 'Call Type': 'Medical', 'Street': 'SPORTS ARENA BL', 'Cross Streets': 'CAM DEL RIO WEST/EAST DR', 'Unit': 'E20' } // etc... ] 你甚至可以將數(shù)據(jù)保存到CSV或XLS文件中: import pandas as pd
calls_df, = pd.read_html('http://apps./sdfiredispatch/', header=0, parse_dates=['Call Date'])
calls_df.to_csv('calls.csv', index=False)
運(yùn)行并雙擊calls.csv在電子表格中打開(kāi): 當(dāng)然,,Pandas還可以更簡(jiǎn)單地對(duì)數(shù)據(jù)進(jìn)行過(guò)濾,分類或處理: >>> calls_df.describe()
Call Date Call Type Street Cross Streets Unit count 69 69 69 64 69 unique 29 2 29 27 60 top 2017-06-02 16:59:50 Medical CHANNEL WY LA SALLE ST/WESTERN ST E1 freq 5 66 5 5 2 first 2017-06-02 16:36:46 NaN NaN NaN NaN last 2017-06-02 17:41:30 NaN NaN NaN NaN
>>> calls_df.groupby('Call Type').count()
Call Date Street Cross Streets Unit Call Type Medical 66 66 61 66 Traffic Accident (L1) 3 3 3 3
>>> calls_df['Unit'].unique()
array(['E46', 'MR33', 'T40', 'E201', 'M6', 'E34', 'M34', 'E29', 'M30', 'M43', 'M21', 'T20', 'E20', 'M20', 'E26', 'M32', 'SQ55', 'E1', 'M26', 'BLS4', 'E17', 'E22', 'M47', 'E38', 'M41', 'E5', 'M19', 'E28', 'M1', 'E42', 'M42', 'E23', 'MR9', 'PD', 'LCCNOT', 'M52', 'E45', 'M12', 'E40', 'MR40', 'M45', 'T1', 'M23', 'E14', 'M2', 'E39', 'M25', 'E8', 'M17', 'E4', 'M22', 'M37', 'E7', 'M31', 'E9', 'M39', 'SQ56', 'E10', 'M44', 'M11'], dtype=object)
|