Python抓取網(wǎng)頁(yè)數(shù)據(jù)的終極辦法 !

萬(wàn)皇之皇 2019-01-25

展開(kāi)全文

源 / Python學(xué)習(xí)交流

假設(shè)你在網(wǎng)上搜索某個(gè)項(xiàng)目所需的原始數(shù)據(jù)，但壞消息是數(shù)據(jù)存在于網(wǎng)頁(yè)中,，并且沒(méi)有可用于獲取原始數(shù)據(jù)的API,。

所以現(xiàn)在你必須浪費(fèi)30分鐘寫腳本來(lái)獲取數(shù)據(jù)（最后花費(fèi) 2小時(shí)）。

這不難但是很浪費(fèi)時(shí)間,。

Pandas庫(kù)有一種內(nèi)置的方法,，可以從名為read_html()的html頁(yè)面中提取表格數(shù)據(jù):

https://pandas./

import pandas as pd

tables = pd.read_html('https://apps./sdfiredispatch/')

print(tables[0])

就這么簡(jiǎn)單! Pandas可以在頁(yè)面上找到所有重要的html表，并將它們作為一個(gè)新的DataFrame對(duì)象返回,。

https://pandas./pandas-docs/stable/dsintro.html#dataframe

輸入表格0行有列標(biāo)題,，并要求它將基于文本的日期轉(zhuǎn)換為時(shí)間對(duì)象:

import pandas as pd

calls_df, = pd.read_html('http://apps./sdfiredispatch/', header=0, parse_dates=['Call Date'])

print(calls_df)

得到：

Call Date        Call Type              Street                             Cross Streets    Unit
  2017-06-02 17:27:58          Medical         HIGHLAND AV                 WIGHTMAN ST/UNIVERSITY AV     E17
  2017-06-02 17:27:58          Medical         HIGHLAND AV                 WIGHTMAN ST/UNIVERSITY AV     M34
  2017-06-02 17:23:51          Medical          EMERSON ST                    LOCUST ST/EVERGREEN ST     E22
  2017-06-02 17:23:51          Medical          EMERSON ST                    LOCUST ST/EVERGREEN ST     M47
  2017-06-02 17:23:15          Medical         MARAUDER WY                     BARON LN/FROBISHER ST     E38
  2017-06-02 17:23:15          Medical         MARAUDER WY                     BARON LN/FROBISHER ST     M41

是一行代碼，數(shù)據(jù)不能作為json記錄可用,。

import pandas as pd

calls_df, = pd.read_html('http://apps./sdfiredispatch/', header=0, parse_dates=['Call Date'])

print(calls_df.to_json(orient='records', date_format='iso'))

運(yùn)行下面的代碼你將得到一個(gè)漂亮的json輸出(即使有適當(dāng)?shù)腎SO 8601日期格式):

[
  {
    'Call Date': '2017-06-02T17:34:00.000Z',
    'Call Type': 'Medical',
    'Street': 'ROSECRANS ST',
    'Cross Streets': 'HANCOCK ST/ALLEY',
    'Unit': 'M21'
  },
  {
    'Call Date': '2017-06-02T17:34:00.000Z',
    'Call Type': 'Medical',
    'Street': 'ROSECRANS ST',
    'Cross Streets': 'HANCOCK ST/ALLEY',
    'Unit': 'T20'
  },
  {
    'Call Date': '2017-06-02T17:30:34.000Z',
    'Call Type': 'Medical',
    'Street': 'SPORTS ARENA BL',
    'Cross Streets': 'CAM DEL RIO WEST/EAST DR',
    'Unit': 'E20'
  }
  // etc...
]

你甚至可以將數(shù)據(jù)保存到CSV或XLS文件中:

import pandas as pd

calls_df, = pd.read_html('http://apps./sdfiredispatch/', header=0, parse_dates=['Call Date'])

calls_df.to_csv('calls.csv', index=False)

運(yùn)行并雙擊calls.csv在電子表格中打開(kāi):

當(dāng)然,，Pandas還可以更簡(jiǎn)單地對(duì)數(shù)據(jù)進(jìn)行過(guò)濾，分類或處理：

>>> calls_df.describe()

              Call Date Call Type      Street           Cross Streets Unit
count                    69        69          69                      64   69
unique                   29         2          29                      27   60
top     2017-06-02 16:59:50   Medical  CHANNEL WY  LA SALLE ST/WESTERN ST   E1
freq                      5        66           5                       5    2
first   2017-06-02 16:36:46       NaN         NaN                     NaN  NaN
last    2017-06-02 17:41:30       NaN         NaN                     NaN  NaN

>>> calls_df.groupby('Call Type').count()

                      Call Date  Street  Cross Streets  Unit
Call Type
Medical                       66      66             61    66
Traffic Accident (L1)          3       3              3     3

>>> calls_df['Unit'].unique()

array(['E46', 'MR33', 'T40', 'E201', 'M6', 'E34', 'M34', 'E29', 'M30',
      'M43', 'M21', 'T20', 'E20', 'M20', 'E26', 'M32', 'SQ55', 'E1',
      'M26', 'BLS4', 'E17', 'E22', 'M47', 'E38', 'M41', 'E5', 'M19',
      'E28', 'M1', 'E42', 'M42', 'E23', 'MR9', 'PD', 'LCCNOT', 'M52',
      'E45', 'M12', 'E40', 'MR40', 'M45', 'T1', 'M23', 'E14', 'M2', 'E39',
      'M25', 'E8', 'M17', 'E4', 'M22', 'M37', 'E7', 'M31', 'E9', 'M39',
      'SQ56', 'E10', 'M44', 'M11'], dtype=object)

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間,，所有內(nèi)容均由用戶發(fā)布,，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式,、誘導(dǎo)購(gòu)買等信息,，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,，請(qǐng)點(diǎn)擊一鍵舉報(bào),。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來(lái)自：萬(wàn)皇之皇 > 《IT互聯(lián)》

舉報(bào)/認(rèn)領(lǐng)