一,、前期準(zhǔn)備1.Scrapy原理概述 Scrapy是一個(gè)為了爬取網(wǎng)站數(shù)據(jù),,提取結(jié)構(gòu)性數(shù)據(jù)而編寫(xiě)的應(yīng)用框架??梢詰?yīng)用在包括數(shù)據(jù)挖掘,,信息處理或存儲(chǔ)歷史數(shù)據(jù)等一系列的程序中。其最初是為了爬蟲(chóng)或者數(shù)據(jù)采集所設(shè)計(jì)的,, 也可以應(yīng)用在獲取API所返回的數(shù)據(jù)或者通用的網(wǎng)絡(luò)爬蟲(chóng),。簡(jiǎn)單來(lái)說(shuō),與普通爬蟲(chóng)相比,,Scrapy“有組織 有紀(jì)律”,,更容易構(gòu)建大規(guī)模抓取項(xiàng)目。 下圖即為Scrapy框架的原理架構(gòu)圖,,下面來(lái)一一解釋?zhuān)?/p>
有了上文對(duì)Scrapy組件的介紹,下面描述一下Scrapy運(yùn)作流程:
2.Scrapy安裝配置 接下來(lái)開(kāi)始安裝Scrapy,,Scrapy已經(jīng)支持python3,本文環(huán)境為win10 Anaconda3,,實(shí)測(cè)安裝沒(méi)有出現(xiàn)問(wèn)題,。首先通過(guò)pip安裝Scrapy: pip install scrapy 之后進(jìn)入到python命行并導(dǎo)入,如果沒(méi)有出現(xiàn)報(bào)錯(cuò)則初步說(shuō)明安裝成功,。
3.Scrapy入門(mén)測(cè)試 接著我們通過(guò)一個(gè)百度分布式爬蟲(chóng)框架小例子進(jìn)行測(cè)試,,首先在cmd中用cd命令切到任一目錄,之后運(yùn)行: scrapy startproject littletest 然后切換至項(xiàng)目目錄并通過(guò)genspider命令加入爬蟲(chóng)網(wǎng)站:
之后進(jìn)入目錄查看,,目錄結(jié)構(gòu)如下:
同時(shí)我們進(jìn)入settings.py將ROBOTSTXT_OBEY配置項(xiàng)改為False,,即不遵守爬蟲(chóng)協(xié)議,否則很多網(wǎng)站無(wú)法正常獲取,。 ROBOTSTXT_OBEY = False 最后進(jìn)入命令行啟動(dòng)scrapy爬蟲(chóng):
得到結(jié)果如下,,狀態(tài)碼為200且接收字節(jié)數(shù)大于0,則表明爬取成功,! 3.MongDB安裝配置 MongoDB 是目前最流行的 NoSQL 數(shù)據(jù)庫(kù)之一,使用的數(shù)據(jù)類(lèi)型 BSON(類(lèi)似 JSON),,下載安裝及配置以及鏈接python的pymongo數(shù)據(jù)庫(kù)和最優(yōu)秀的compass可視化工具安裝及使用可參考作者博客,。 二、QQ音樂(lè)爬蟲(chóng)實(shí)戰(zhàn) 1.網(wǎng)頁(yè)分析 通過(guò)打開(kāi)QQ音樂(lè)官網(wǎng)并點(diǎn)擊歌手欄(鏈接傳送門(mén):https://y.qq.com/portal/singer_list.html),,并打開(kāi)DevTools工具,,選擇XHR異步并觀察item,,發(fā)現(xiàn)musicu.fcg一欄返回的json數(shù)據(jù)中有歌手相關(guān)信息。 因此我們進(jìn)一步進(jìn)入該項(xiàng)headers獲取到請(qǐng)求url,,繼續(xù)點(diǎn)擊下一頁(yè),,通過(guò)三頁(yè)(url如下)查找規(guī)律進(jìn)一步發(fā)現(xiàn)sin參數(shù)發(fā)生變化,規(guī)律公式為80*(n-1),,n為頁(yè)碼,。篇幅有限,json數(shù)據(jù)解析就不再解釋?zhuān)蓞⒖记拔摹?/p> https://u.y.qq.com/cgi-bin/musicu.fcg?-=getUCGI9874589974344781&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8?ice=0&platform=yqq.json&needNewCode=0&data=%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A0%7D%2C%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A-100%2C%22sex%22%3A-100%2C%22genre%22%3A-100%2C%22index%22%3A-100%2C%22sin%22%3A0%2C%22cur_page%22%3A1%7D%7D%7D
https://u.y.qq.com/cgi-bin/musicu.fcg?-=getUCGI8205866038561849&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8?ice=0&platform=yqq.json&needNewCode=0&data=%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A0%7D%2C%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A-100%2C%22sex%22%3A-100%2C%22genre%22%3A-100%2C%22index%22%3A-100%2C%22sin%22%3A80%2C%22cur_page%22%3A2%7D%7D%7D
https://u.y.qq.com/cgi-bin/musicu.fcg?-=getUCGI8189152987042585&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8?ice=0&platform=yqq.json&needNewCode=0&data=%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A0%7D%2C%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A-100%2C%22sex%22%3A-100%2C%22genre%22%3A-100%2C%22index%22%3A-100%2C%22sin%22%3A160%2C%22cur_page%22%3A3%7D%7D%7D 以此類(lèi)推依次獲取到歌曲下載地址,、歌曲列表地址,、歌詞列表地址、歌曲評(píng)論地址等并配置翻頁(yè)參數(shù):
之后我們開(kāi)始建立scrapy爬蟲(chóng)程序,。首先切換至個(gè)人目錄下面開(kāi)啟項(xiàng)目: scrapy startproject musicspyder
cd musicspyder
scrapy genspider qqmusic y.qq.com 2.spyder.py編寫(xiě) 接下來(lái)開(kāi)始對(duì)Scrapy組件逐一完善,,首先對(duì)主要爬蟲(chóng)程序qqmusic.py進(jìn)行編寫(xiě),在生成類(lèi)中分別定義爬蟲(chóng)名,、允許域名,、爬取url等變量,并創(chuàng)建解析用戶(hù)信息,、歌曲信息,、歌詞信息、評(píng)論信息,、url信息方法:
3.items.py編寫(xiě) 之后對(duì)items.py進(jìn)行編寫(xiě),,在QqMusicItem類(lèi)中創(chuàng)建MongoDB集合名、id字段,、歌手名字段,、歌曲名字段、歌曲地址字段,、歌詞字段,、評(píng)論字段等變量: import scrapy
from scrapy import Field
class QqMusicItem(scrapy.Item):
# mongodb collection
collection = 'singer'
id = Field()
# 歌手名字字段
singer_name = Field()
# 歌曲名字段
song_name = Field()
# 歌曲地址字段
song_url = Field()
# 歌詞字段
lrc = Field()
# 評(píng)論字段
comment = Field() 4.piplines.py編寫(xiě) 再對(duì)piplines.py進(jìn)行編寫(xiě),新增加IrcText類(lèi)對(duì)歌詞進(jìn)行解析處理:
之后是middlewares.py代碼編寫(xiě),,自定義my_useragent類(lèi),,使用random庫(kù)隨機(jī)選擇瀏覽器頭: import random
from scrapy import signals
# 默認(rèn)中間件
class MusicspyderSpiderMiddleware(object):
@classmethod
def from_crawler(cls, crawler):
def process_spider_input(self, response, spider):
def process_spider_output(self, response, result, spider):
def process_spider_exception(self, response, exception, spider):
def process_start_requests(self, start_requests, spider):
def spider_opened(self, spider):
?
# 在中間件中加入useragent防爬
class my_useragent(object):
def process_request(self, request, spider):
user_agent_list = ['...','...',...]
user_agent = random.choice(user_agent_list)
request.headers['User_Agent'] = user_agent 6.settings.py編寫(xiě) 最后是settings.py編寫(xiě),配置相應(yīng)的爬取頁(yè)數(shù),、爬取歌手歌曲數(shù)量,、mongoDB的地址與數(shù)據(jù)庫(kù)等變量,并且設(shè)置不遵守Robots協(xié)議,,開(kāi)啟下載中間件和itempipline:
定義上述scrapy組件完成之后我們即可在命令行中輸入以下命令用以啟用qqmusic爬蟲(chóng)框架: scrapy crawl qqmusic 之后進(jìn)入mongodb查看爬取結(jié)果即可得到響應(yīng)歌手歌曲信息: 三、爬蟲(chóng)系列總結(jié)至此Scrapy框架爬取QQ音樂(lè)講解完成,,Python網(wǎng)絡(luò)爬蟲(chóng)數(shù)據(jù)采集實(shí)戰(zhàn)系列也隨之結(jié)束,,總體來(lái)說(shuō),,爬蟲(chóng)是一種細(xì)致活,需要掌握固定的套路并且去努力尋找網(wǎng)絡(luò)數(shù)據(jù)規(guī)律的蛛絲馬跡方能爬取成功,,同時(shí)也要量力而行,,防止對(duì)對(duì)方服務(wù)器造成巨大負(fù)載或者己方投入產(chǎn)出不成正比。完整代碼可以在頭條號(hào)中私信“QQ音樂(lè)”獲得,,前文涉及的基礎(chǔ)知識(shí)可參考下面鏈接: 爬蟲(chóng)所要了解的基礎(chǔ)知識(shí),,這一篇就夠了!Python網(wǎng)絡(luò)爬蟲(chóng)實(shí)戰(zhàn)系列 一文帶你深入了解并學(xué)會(huì)Python爬蟲(chóng)庫(kù),!從此數(shù)據(jù)不用愁 Python爬蟲(chóng)有多簡(jiǎn)單,?一文帶你實(shí)戰(zhàn)豆瓣電影TOP250數(shù)據(jù)爬取,! 一文弄清Python網(wǎng)絡(luò)爬蟲(chóng)解析庫(kù),!內(nèi)含多個(gè)實(shí)例講解 誰(shuí)說(shuō)同花順很難爬?一文帶你學(xué)會(huì)用Python爬取財(cái)經(jīng)動(dòng)態(tài)網(wǎng)頁(yè),! 誰(shuí)說(shuō)京東商品很難爬,?一文教你用Python構(gòu)建電商網(wǎng)站爬蟲(chóng)! Python網(wǎng)絡(luò)爬蟲(chóng)實(shí)戰(zhàn)之Fiddler抓包今日頭條app,!附代碼 參考鏈接: https://blog.csdn.net/qq_1290259791/article/details/82263014 https://www.jianshu.com/p/cecb29c04cd2 https:///4380.html |
|