?目錄
Beautiful Soup 中文教程Beautiful Soup 是一個處理Python HTML/XML的模塊,,功能相當強勁,最近仔細的看了一下他的幫助文檔,,終于看明白了一些,。 準備好好研究一下,,順便將Beautiful Soup的一些用法整理一下,放到這個wiki上面,,那個文檔確實不咋地,。 Beautiful Soup 中文教程的官方頁面:http://www./software/BeautifulSoup/ BeautifulSoup 下載與安裝
下載地址為: 安裝其實很簡單,BeautifulSoup只有一個文件,,只要把這個文件拷到你的工作目錄,,就可以了。 from BeautifulSoup import BeautifulSoup # For processing HTML from BeautifulSoup import BeautifulStoneSoup # For processing XML import BeautifulSoup # To get everything 創(chuàng)建 BeautifulSoup 對象BeautifulSoup對象需要一段html文本就可以創(chuàng)建了。 下面的代碼就創(chuàng)建了一個BeautifulSoup對象: from BeautifulSoup import BeautifulSoup doc = ['<html><head><title>PythonClub.org</title></head>', '<body><p id="firstpara" align="center">This is paragraph <b>one</b> of ptyhonclub.org.', '<p id="secondpara" align="blah">This is paragraph <b>two</b> of .', '</html>'] soup = BeautifulSoup(''.join(doc)) 查找HTML內指定元素BeautifulSoup可以直接用”.”訪問指定HTML元素 根據(jù)html標簽(tag)查找:查找html title可以用 soup.html.head.title 得到title的name,,和字符串值,。 >>> soup.html.head.title <title>PythonClub.org</title> >>> soup.html.head.title.name u'title' >>> soup.html.head.title.string u'PythonClub.org' >>> 也可以直接通過soup.title直接定位到指定HTML元素: >>> soup.title <title>PythonClub.org</title> >>> 根據(jù)html內容查找:查找包含特定字符串的整個標簽內容下面的例子給出了查找含有”para”的html tag內容: >>> soup.findAll(text=re.compile("para")) [u'This is paragraph ', u'This is paragraph '] >>> soup.findAll(text=re.compile("para"))[0].parent <p id="firstpara" align="center">This is paragraph <b>one</b> of ptyhonclub.org.</p> >>> soup.findAll(text=re.compile("para"))[0].parent.contents [u'This is paragraph ', <b>one</b>, u' of ptyhonclub.org.'] 根據(jù)CSS屬性查找HTML內容soup.findAll(id=re.compile("para$")) # [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>] soup.findAll(attrs={'id' : re.compile("para$")}) # [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>] 深入理解BeautifulSoup |
|