Python 插件雜談 (4) ---- BeautifulSoup , Python中的網(wǎng)頁(yè)分析工具

liluvu 2012-09-27

展開(kāi)全文

Python 插件雜談 (4) ---- BeautifulSoup , Python中的網(wǎng)頁(yè)分析工具

嗯哼，Meego中文核心站-- 米趣網(wǎng) 又發(fā)新博文啦,。
前面向大家介紹了 PyQuery ，下面轉(zhuǎn)而介紹一下 BeautifulSoup , Beautiful Soup 是 Python 內(nèi)置的網(wǎng)頁(yè)分析工具,，名字叫美麗的蝴蝶。呵呵,，某些時(shí)候確如美麗蝴蝶一樣,。
先來(lái)段介紹:
Beautiful Soup 是一個(gè) Python HTML/XML 處理器，設(shè)計(jì)用來(lái)快速地轉(zhuǎn)換網(wǎng)頁(yè)抓取,。以下的特性支撐著 Beautiful Soup：

Beautiful Soup 不會(huì)選擇即使你給他一個(gè)損壞的標(biāo)簽,。他產(chǎn)生一個(gè)轉(zhuǎn)換DOM樹(shù)，盡可能和你原文檔內(nèi)容含義一致 ,。這種措施通常能夠你搜集數(shù)據(jù)的需求,。
Beautiful Soup 提供一些簡(jiǎn)單的方法以及類Python語(yǔ)法來(lái)查找、查找,、修改一顆轉(zhuǎn)換樹(shù)：一個(gè)工具集幫助你解析一棵樹(shù)并釋出你需要的內(nèi)容,。你不需要為每一個(gè)應(yīng)用創(chuàng)建自己的解析工具。
Beautiful Soup 自動(dòng)將送進(jìn)來(lái)的文檔轉(zhuǎn)換為 Unicode 編碼 而且在輸出的時(shí)候轉(zhuǎn)換為 UTF-8,,。除非這個(gè)文檔沒(méi)有指定編碼方式或者Beautiful Soup 沒(méi)能自動(dòng)檢測(cè)編碼,，你需要手動(dòng)指定編碼方式，否則你不需要考慮編碼的問(wèn)題,。

Beautiful Soup 轉(zhuǎn)換任何你給他的內(nèi)容,，然后為你做那些轉(zhuǎn)換的事情。你可以命令他 “找出所有的鏈接", 或者 "找出所有 class 是 externalLink 的鏈接" , 再或者是 "找出所有的鏈接 url 匹配 ”foo.com", 甚至是 "找出那些表頭是粗體文字,，然后返回給我文字“.
那些設(shè)計(jì)不好的網(wǎng)站中的有價(jià)值的數(shù)據(jù)可以被你一次鎖定,，原本要花數(shù)個(gè)小時(shí)候的工作，通過(guò)使用 Beautiful Soup 可以在幾分鐘內(nèi)搞定,。
下面讓我們快速開(kāi)始：
首先引用包：

from BeautifulSoup import BeautifulSoup # For processing HTML
from BeautifulSoup import BeautifulStoneSoup # For processing XML
import BeautifulSoup # To get everything[/font][/color]

復(fù)制代碼

下面使用一段代碼演示Beautiful Soup的基本使用方式,。你可以拷貝與粘貼這段代碼自己運(yùn)行。

from BeautifulSoup import BeautifulSoup
import re
doc = ['<html><head><title>Page title</title></head>',
'<body>This is paragraph one.',
'This is paragraph two.',
'</html>']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()
# <html>
# <head>
# <title>
# Page title
# </title>
# </head>
# <body>
#
# This is paragraph
#
# one
#
# .
#
#
# This is paragraph
#
# two
#
# .
#
# </body>
# </html>

復(fù)制代碼

下面是一個(gè)解析文檔的方法：

soup.contents[0].name
# u'html'
soup.contents[0].contents[0].name
# u'head'
head = soup.contents[0].contents[0]
head.parent.name
# u'html'
head.next
# <title>Page title</title>
head.nextSibling.name
# u'body'
head.nextSibling.contents[0]
# This is paragraph one.
head.nextSibling.contents[0].nextSibling
# This is paragraph two.

復(fù)制代碼

接著是一打方法查找一文檔中包含的標(biāo)簽,，或者含有指定屬性的標(biāo)簽

titleTag = soup.html.head.title
titleTag
# <title>Page title</title>
titleTag.string
# u'Page title'
len(soup('p'))
# 2
soup.findAll('p', align="center")
# [This is paragraph one. ]
soup.find('p', align="center")
# This is paragraph one.
soup('p', align="center")[0]['id']
# u'firstpara'
soup.find('p', align=re.compile('^b.*'))['id']
# u'secondpara'
soup.find('p').b.string
# u'one'
soup('p')[1].b.string
# u'two'

復(fù)制代碼

當(dāng)然也可以簡(jiǎn)單地修改文檔

titleTag['id'] = 'theTitle'
titleTag.contents[0].replaceWith("New title")
soup.html.head
# <head><title id="theTitle">New title</title></head>
soup.p.extract()
soup.prettify()
# <html>
# <head>
# <title id="theTitle">
# New title
# </title>
# </head>
# <body>
#
# This is paragraph
#
# two
#
# .
#
# </body>
# </html>
soup.p.replaceWith(soup.b)
# <html>
# <head>
# <title id="theTitle">
# New title
# </title>
# </head>
# <body>
#
# two
#
# </body>
# </html>
soup.body.insert(0, "This page used to have ")
soup.body.insert(2, " tags!")
soup.body
# <body>This page used to have two tags!</body>

復(fù)制代碼

最后,，為大家提供 Beautiful Soup 的文檔。希望能對(duì)您有幫助,。

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間,，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn),。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式,、誘導(dǎo)購(gòu)買等信息，謹(jǐn)防詐騙,。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來(lái)自： liluvu > 《Python》

舉報(bào)/認(rèn)領(lǐng)