第 2 關、BeautifulSoup

zlq百科全書 2021-05-30

展開全文

1,、BeautifulSoup 是什么

解析和提取網(wǎng)頁中的數(shù)據(jù)：

（1）解析數(shù)據(jù)：把服務器返回來的 HTML 源代碼翻譯為我們能理解的方式；

（2）提取數(shù)據(jù)：把我們需要的數(shù)據(jù)從眾多數(shù)據(jù)中挑選出來,。

2,、BeautifulSoup 怎么用

2-1、BeautifulSoup 安裝

win：pip install BeautifulSoup4；

Mac：pip3 install BeautifulSoup4,。

2-2,、BeautifulSoup 解析數(shù)據(jù)

自動檢測

bs對象 = BeautifulSoup（要解析的文本,'解析器'）

括號中，要輸入兩個參數(shù)：

①,、第 0 個參數(shù)是要被解析的文本（必須是字符串）

②,、第 1 個參數(shù)用來標識解析器，我們要用的是一個Python內(nèi)置庫：html.parser,。（不是唯一的解析器）

自動檢測

import requests

from bs4 import BeautifulSoup

#引入BS庫

res = requests.get('https://localprod./python-manuscript/crawler-html/spider-men5.0.html')

html = res.text

soup = BeautifulSoup(html,'html.parser') #把網(wǎng)頁解析為BeautifulSoup對象

2-3,、BeautifulSoup 提取數(shù)據(jù)

2-3-1、find() 與 find_all()

find() 與 find_all() 是 BeautifulSoup 對象的兩個方法,，它們可以匹配 html 的標簽和屬性,，把 BeautifulSoup 對象里符合要求的數(shù)據(jù)都提取出來：

①、find()只提取首個滿足要求的數(shù)據(jù),；

自動檢測

import requests

from bs4 import BeautifulSoup

url = 'https://localprod./python-manuscript/crawler-html/spder-men0.0.html'

res = requests.get (url)

soup = BeautifulSoup(res.text,'html.parser')

item = soup.find('div') #使用find()方法提取首個<div>元素,，并放到變量item里。

print(item) #打印item

#結果：<div>大家好,，我是一個塊</div>

②,、find_all()提取出的是所有滿足要求的數(shù)據(jù)。

自動檢測

import requests

from bs4 import BeautifulSoup

url = 'https://localprod./python-manuscript/crawler-html/spder-men0.0.html'

res = requests.get (url)

soup = BeautifulSoup(res.text,'html.parser')

items = soup.find_all('div') #用find_all()把所有符合要求的數(shù)據(jù)提取出來,，并放在變量items里

print(items) #打印items

#結果：[<div>大家好,，我是一個塊</div>, <div>我也是一個塊</div>, <div>我還是一個塊</div>]

注意：

find() 或 find_all() 括號中的參數(shù)：標簽和屬性可以任選其一，也可以兩個一起使用,，這取決于我們要在網(wǎng)頁中提取的內(nèi)容,。

（1）中括號里的class_，這里有一個下劃線,，是為了和python語法中的類 class區(qū)分,，避免程序沖突。當然,，除了用class屬性去匹配,，還可以使用其它屬性，比如style屬性等,；

（2）只用其中一個參數(shù)就可以準確定位的話,，就只用一個參數(shù)檢索。如果需要標簽和屬性同時滿足的情況下才能準確定位到我們想找的內(nèi)容,，那就兩個參數(shù)一起使用,。

自動檢測

import requests # 調(diào)用requests庫

from bs4 import BeautifulSoup # 調(diào)用BeautifulSoup庫

res = requests.get('https://localprod./python-manuscript/crawler-html/spider-men5.0.html')# 返回一個Response對象，賦值給res

html = res.text# 把Response對象的內(nèi)容以字符串的形式返回

soup = BeautifulSoup( html,'html.parser') # 把網(wǎng)頁解析為BeautifulSoup對象

items = soup.find_all(class_='books') # 通過匹配標簽和屬性提取我們想要的數(shù)據(jù)

print(items) # 打印items

2-3-2,、Tag 對象

自動檢測

import requests # 調(diào)用requests庫

from bs4 import BeautifulSoup # 調(diào)用BeautifulSoup庫

res =requests.get('https://localprod./python-manuscript/crawler-html/spider-men5.0.html')

# 返回一個response對象,，賦值給res

html=res.text

# 把res解析為字符串

soup = BeautifulSoup( html,'html.parser')

# 把網(wǎng)頁解析為BeautifulSoup對象

items = soup.find_all(class_='books') # 通過匹配屬性class='books'提取出我們想要的元素

for item in items: # 遍歷列表items

kind = item.find('h2') # 在列表中的每個元素里,，匹配標簽<h2>提取出數(shù)據(jù)

title = item.find(class_='title') # 在列表中的每個元素里，匹配屬性class_='title'提取出數(shù)據(jù)

brief = item.find(class_='info') # 在列表中的每個元素里,，匹配屬性class_='info'提取出數(shù)據(jù)

print(kind.text,'\n',title.text,'\n',title['href'],'\n',brief.text) # 打印書籍的類型,、名字、鏈接和簡介的文字

3,、對象的變化過程

對象操作：Response對象——字符串——BS對象：

①,、一條是BS對象——Tag對象；

②,、另一條是BS對象——列表——Tag對象,。

本站是提供個人知識管理的網(wǎng)絡存儲空間，所有內(nèi)容均由用戶發(fā)布,，不代表本站觀點,。請注意甄別內(nèi)容中的聯(lián)系方式、誘導購買等信息,，謹防詐騙,。如發(fā)現(xiàn)有害或侵權內(nèi)容，請點擊一鍵舉報,。

轉藏分享

QQ空間 QQ好友新浪微博微信

獻花（0） +1

來自： zlq百科全書 > 《python》

舉報/認領

0條評論

發(fā)表

請遵守用戶評論公約

類似文章 更多

zlq百科全書

關注對話

TA的最新館藏

本片聽課筆記1
各位請注意
各位顧客朋友請注意，
有空來做汗蒸,。
行人請注意
戴口罩

喜歡該文的人也喜歡更多

熱門閱讀換一換

久久国产成人av_抖音国产毛片_a片网站免费观看_A片无码播放手机在线观看,色五月在线观看,亚洲精品m在线观看,女人自慰的免费网址,悠悠在线观看精品视频,一级日本片免费的,亚洲精品久,国产精品成人久久久久久久

第 2 關、BeautifulSoup

1,、BeautifulSoup 是什么

2,、BeautifulSoup 怎么用

2-1、BeautifulSoup 安裝

2-2,、BeautifulSoup 解析數(shù)據(jù)

2-3,、BeautifulSoup 提取數(shù)據(jù)

2-3-1、find() 與 find_all()

2-3-2,、Tag 對象

3,、對象的變化過程