久久国产成人av_抖音国产毛片_a片网站免费观看_A片无码播放手机在线观看,色五月在线观看,亚洲精品m在线观看,女人自慰的免费网址,悠悠在线观看精品视频,一级日本片免费的,亚洲精品久,国产精品成人久久久久久久

分享

Python分布式爬蟲(chóng)一點(diǎn)也不難,!Scrapy MongoDB爬取QQ音樂(lè)實(shí)戰(zhàn)

 yaohbsg 2020-07-25

通過(guò)前七章的學(xué)習(xí),,相信大家對(duì)整個(gè)爬蟲(chóng)有了一個(gè)比較全面的了解 ,,其中分別涉及四個(gè)案例:靜態(tài)網(wǎng)頁(yè)爬取,、動(dòng)態(tài)Ajax網(wǎng)頁(yè)爬取Selenium瀏覽器模擬爬取Fillder今日頭條app爬取,,基本涵蓋了爬蟲(chóng)的大致套路,。本文在此基礎(chǔ)上進(jìn)一步深耕,使用Scrapy框架構(gòu)建QQ音樂(lè)分布式爬蟲(chóng)系統(tǒng),,更加接近搜索引擎技術(shù),。

Python分布式爬蟲(chóng)一點(diǎn)也不難!Scrapy MongoDB爬取QQ音樂(lè)實(shí)戰(zhàn)

一,、前期準(zhǔn)備

1.Scrapy原理概述

Scrapy是一個(gè)為了爬取網(wǎng)站數(shù)據(jù),,提取結(jié)構(gòu)性數(shù)據(jù)而編寫(xiě)的應(yīng)用框架??梢詰?yīng)用在包括數(shù)據(jù)挖掘,,信息處理或存儲(chǔ)歷史數(shù)據(jù)等一系列的程序中。其最初是為了爬蟲(chóng)或者數(shù)據(jù)采集所設(shè)計(jì)的,, 也可以應(yīng)用在獲取API所返回的數(shù)據(jù)或者通用的網(wǎng)絡(luò)爬蟲(chóng),。簡(jiǎn)單來(lái)說(shuō),與普通爬蟲(chóng)相比,,Scrapy“有組織 有紀(jì)律”,,更容易構(gòu)建大規(guī)模抓取項(xiàng)目。

下圖即為Scrapy框架的原理架構(gòu)圖,,下面來(lái)一一解釋?zhuān)?/p>

  • Engine:引擎,,框架的核心,用于觸發(fā)事務(wù),,處理整個(gè)系統(tǒng)的數(shù)據(jù)流處理,。(Scrapy已實(shí)現(xiàn))
  • Spiders:即爬蟲(chóng)主程序,定義了爬取的邏輯和網(wǎng)頁(yè)內(nèi)容的解析規(guī)則,,主要負(fù)責(zé)解析響應(yīng)并生成結(jié)果和新的請(qǐng)求(需要自己編寫(xiě))
  • Scheduler:任務(wù)調(diào)度器,,接受引擎發(fā)過(guò)來(lái)的請(qǐng)求并將其加入隊(duì)列中,,在引擎再次請(qǐng)求時(shí)將請(qǐng)求提供給引擎。(需要自己編寫(xiě))
  • Downloader:下載器,,下載網(wǎng)頁(yè)內(nèi)容,,并將下載內(nèi)容返回給spider進(jìn)行處理(Scrapy已實(shí)現(xiàn))
  • ItemPipeline:項(xiàng)目管道,負(fù)責(zé)處理spider從網(wǎng)頁(yè)中抽取的數(shù)據(jù),,主要是負(fù)責(zé)清洗,,驗(yàn)證和向數(shù)據(jù)庫(kù)中存儲(chǔ)數(shù)據(jù)(需要自己編寫(xiě))
  • Downloader Middlewares:下載中間件,是處于Scrapy的Request和Requesponse之間的處理模塊(Scrapy已實(shí)現(xiàn))
  • Spider Middlewares:spider中間件,,主要處理spider輸入的響應(yīng)和輸出的結(jié)果及新的請(qǐng)求middlewares.py里實(shí)現(xiàn)(Scrapy已實(shí)現(xiàn))
Python分布式爬蟲(chóng)一點(diǎn)也不難,!Scrapy MongoDB爬取QQ音樂(lè)實(shí)戰(zhàn)

有了上文對(duì)Scrapy組件的介紹,下面描述一下Scrapy運(yùn)作流程:

  1. Spider使用yeild將request發(fā)送給Engine
  2. Engine對(duì)request不做任何處理發(fā)送給Scheduler
  3. Engine拿到request,,通過(guò)Middleware發(fā)送給Downloader
  4. Downloader獲取response之后經(jīng)過(guò)Middleware發(fā)送給Engine
  5. Engine傳遞給Spider,,Spider的parse()方法對(duì)response進(jìn)行解析
  6. Spider將解析出來(lái)的items或者requests返回給Engine
  7. Engine將items發(fā)送給ItemPipeline,將requests發(fā)送給Scheduler
  8. 只有當(dāng)Scheduler中不存在request時(shí)程序才會(huì)停止

2.Scrapy安裝配置

接下來(lái)開(kāi)始安裝Scrapy,,Scrapy已經(jīng)支持python3,本文環(huán)境為win10 Anaconda3,,實(shí)測(cè)安裝沒(méi)有出現(xiàn)問(wèn)題,。首先通過(guò)pip安裝Scrapy:

pip install scrapy

之后進(jìn)入到python命行并導(dǎo)入,如果沒(méi)有出現(xiàn)報(bào)錯(cuò)則初步說(shuō)明安裝成功,。

import scrapy

3.Scrapy入門(mén)測(cè)試

接著我們通過(guò)一個(gè)百度分布式爬蟲(chóng)框架小例子進(jìn)行測(cè)試,,首先在cmd中用cd命令切到任一目錄,之后運(yùn)行:

scrapy startproject littletest

然后切換至項(xiàng)目目錄并通過(guò)genspider命令加入爬蟲(chóng)網(wǎng)站:

cd littletest
scrapy genspider baidu www.baidu.com

之后進(jìn)入目錄查看,,目錄結(jié)構(gòu)如下:

  • scrapy. cfg # Scrapy 部署時(shí)的配置文件
  • littletest #項(xiàng)目模塊
  • items.py # 定義爬取的數(shù)據(jù)結(jié)構(gòu)
  • middlewares.py # 定義爬取時(shí)的中間件
  • pipelines.py # Pipelines 的定義,,定義數(shù)據(jù)管道
  • settings.py #配置文件,放置基本設(shè)置和存儲(chǔ)變量
  • spiders #放置Spiders 的文件夾
Python分布式爬蟲(chóng)一點(diǎn)也不難,!Scrapy MongoDB爬取QQ音樂(lè)實(shí)戰(zhàn)

同時(shí)我們進(jìn)入settings.pyROBOTSTXT_OBEY配置項(xiàng)改為False,,即不遵守爬蟲(chóng)協(xié)議,否則很多網(wǎng)站無(wú)法正常獲取,。

ROBOTSTXT_OBEY = False

最后進(jìn)入命令行啟動(dòng)scrapy爬蟲(chóng)

scrapy crawl baidu

得到結(jié)果如下,,狀態(tài)碼為200且接收字節(jié)數(shù)大于0,則表明爬取成功,!

Python分布式爬蟲(chóng)一點(diǎn)也不難,!Scrapy MongoDB爬取QQ音樂(lè)實(shí)戰(zhàn)

3.MongDB安裝配置

MongoDB 是目前最流行的 NoSQL 數(shù)據(jù)庫(kù)之一,使用的數(shù)據(jù)類(lèi)型 BSON(類(lèi)似 JSON),,下載安裝及配置以及鏈接python的pymongo數(shù)據(jù)庫(kù)和最優(yōu)秀的compass可視化工具安裝及使用可參考作者博客,。


二、QQ音樂(lè)爬蟲(chóng)實(shí)戰(zhàn)

1.網(wǎng)頁(yè)分析

通過(guò)打開(kāi)QQ音樂(lè)官網(wǎng)并點(diǎn)擊歌手欄(鏈接傳送門(mén):https://y.qq.com/portal/singer_list.html),,并打開(kāi)DevTools工具,,選擇XHR異步并觀察item,,發(fā)現(xiàn)musicu.fcg一欄返回的json數(shù)據(jù)中有歌手相關(guān)信息。

因此我們進(jìn)一步進(jìn)入該項(xiàng)headers獲取到請(qǐng)求url,,繼續(xù)點(diǎn)擊下一頁(yè),,通過(guò)三頁(yè)(url如下)查找規(guī)律進(jìn)一步發(fā)現(xiàn)sin參數(shù)發(fā)生變化,規(guī)律公式為80*(n-1),,n為頁(yè)碼,。篇幅有限,json數(shù)據(jù)解析就不再解釋?zhuān)蓞⒖记拔摹?/p>

https://u.y.qq.com/cgi-bin/musicu.fcg?-=getUCGI9874589974344781&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8?ice=0&platform=yqq.json&needNewCode=0&data=%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A0%7D%2C%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A-100%2C%22sex%22%3A-100%2C%22genre%22%3A-100%2C%22index%22%3A-100%2C%22sin%22%3A0%2C%22cur_page%22%3A1%7D%7D%7D https://u.y.qq.com/cgi-bin/musicu.fcg?-=getUCGI8205866038561849&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8?ice=0&platform=yqq.json&needNewCode=0&data=%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A0%7D%2C%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A-100%2C%22sex%22%3A-100%2C%22genre%22%3A-100%2C%22index%22%3A-100%2C%22sin%22%3A80%2C%22cur_page%22%3A2%7D%7D%7D https://u.y.qq.com/cgi-bin/musicu.fcg?-=getUCGI8189152987042585&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8?ice=0&platform=yqq.json&needNewCode=0&data=%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A0%7D%2C%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A-100%2C%22sex%22%3A-100%2C%22genre%22%3A-100%2C%22index%22%3A-100%2C%22sin%22%3A160%2C%22cur_page%22%3A3%7D%7D%7D

以此類(lèi)推依次獲取到歌曲下載地址,、歌曲列表地址,、歌詞列表地址歌曲評(píng)論地址等并配置翻頁(yè)參數(shù):

start_urls = ['https://u.y.qq.com/cgi-bin/musicu.fcg?data=%7B%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer'     '%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A-100%2C%22sex%22%3A-100%2C%22genr'     'e%22%3A-100%2C%22index%22%3A-100%2C%22sin%22%3A{num}%2C%22cur_page%22%3A{id}%7D%7D%7D']  # 歌手地址
song_down = 'https://c.y.qq.com/base/fcgi-bin/fcg_music_express_mobile3.fcg?&jsonpCallback=MusicJsonCallback&ci'             'd=205361747&songmid={songmid}&filename=C400{songmid}.m4a&guid=9082027038'  # 歌曲下載地址
song_url = 'https://c.y.qq.com/v8/fcg-bin/fcg_v8_singer_track_cp.fcg?singermid={singer_mid}&order=listen&num={sum}'  # 歌曲列表地址
lrc_url = 'https://c.y.qq.com/lyric/fcgi-bin/fcg_query_lyric.fcg?nobase64=1&musicid={musicid}'  # 歌詞列表地址
discuss_url = 'https://c.y.qq.com/base/fcgi-bin/fcg_global_comment_h5.fcg?cid=205360772&reqtype=2&biztype=1&topid='               '{song_id}&cmd=8&pagenum=0&pagesize=25'  # 歌曲評(píng)論地址

之后我們開(kāi)始建立scrapy爬蟲(chóng)程序,。首先切換至個(gè)人目錄下面開(kāi)啟項(xiàng)目:

scrapy startproject musicspyder cd musicspyder scrapy genspider qqmusic y.qq.com

2.spyder.py編寫(xiě)

接下來(lái)開(kāi)始對(duì)Scrapy組件逐一完善,,首先對(duì)主要爬蟲(chóng)程序qqmusic.py進(jìn)行編寫(xiě),在生成類(lèi)中分別定義爬蟲(chóng)名,、允許域名,、爬取url等變量,并創(chuàng)建解析用戶(hù)信息,、歌曲信息,、歌詞信息評(píng)論信息,、url信息方法:

import json
import scrapy
from scrapy import Request
from musicspyder.items import QqMusicItem
?
class MusicSpider(scrapy.Spider):
  name = 'qqmusic'
  allowed_domains = ['y.qq.com']
  start_urls = ['...']
  song_down = '...'
  song_url = '...'
  lrc_url = '...'
  discuss_url = '...'
  # 生成請(qǐng)求并從配置中獲取頁(yè)數(shù)
  def start_requests(self):
  # 解析用戶(hù)信息
  def parse_user(self, response)
  # 解析歌曲信息
  def parse_song(self, response)
  # 解析歌詞信息
  def parse_lrc(self, response)
  # 解析評(píng)論信息
  def parse_comment(self, response)
  # 解析url信息
  def parse_url(self, response)

3.items.py編寫(xiě)

之后對(duì)items.py進(jìn)行編寫(xiě),,在QqMusicItem類(lèi)中創(chuàng)建MongoDB集合名id字段,、歌手名字段,、歌曲名字段、歌曲地址字段,、歌詞字段,、評(píng)論字段等變量:

import scrapy from scrapy import Field class QqMusicItem(scrapy.Item): # mongodb collection collection = 'singer' id = Field() # 歌手名字字段 singer_name = Field() # 歌曲名字段 song_name = Field() # 歌曲地址字段 song_url = Field() # 歌詞字段 lrc = Field() # 評(píng)論字段 comment = Field()

4.piplines.py編寫(xiě)

再對(duì)piplines.py進(jìn)行編寫(xiě),新增加IrcText類(lèi)對(duì)歌詞進(jìn)行解析處理:

import json
import pymongo
import re
from scrapy.exceptions import DropItem
from musicspyder.items import QqMusicItem
# 默認(rèn)pipline類(lèi)
class QqMusicPipeline(object):
    def process_item(self, item, spider):
        return item
# 在pipline中新增類(lèi)用于解析和清洗單詞
class lrcText(object):
  # 進(jìn)行正則匹配獲取的單詞
  def process_item(self, item, spider):
# 保存到Mongo數(shù)據(jù)庫(kù)
class MongoPipline(object):
  # 構(gòu)造方法
  def __init__(self, mongo_url, mongo_db):
  # 從settings.py中獲取Mongo rl和庫(kù)
  @classmethod
  def from_crawler(cls, crawler):
  # 存儲(chǔ)處理
  def process_item(self, item, spider):
  # 關(guān)閉mongodb數(shù)據(jù)庫(kù)
  def close_spider(self, spider):

之后是middlewares.py代碼編寫(xiě),,自定義my_useragent類(lèi),,使用random庫(kù)隨機(jī)選擇瀏覽器頭:

import random from scrapy import signals # 默認(rèn)中間件 class MusicspyderSpiderMiddleware(object): @classmethod def from_crawler(cls, crawler): def process_spider_input(self, response, spider): def process_spider_output(self, response, result, spider): def process_spider_exception(self, response, exception, spider): def process_start_requests(self, start_requests, spider): def spider_opened(self, spider): ? # 在中間件中加入useragent防爬 class my_useragent(object): def process_request(self, request, spider): user_agent_list = ['...','...',...] user_agent = random.choice(user_agent_list) request.headers['User_Agent'] = user_agent

6.settings.py編寫(xiě)

最后是settings.py編寫(xiě),配置相應(yīng)的爬取頁(yè)數(shù),、爬取歌手歌曲數(shù)量,、mongoDB的地址數(shù)據(jù)庫(kù)等變量,并且設(shè)置不遵守Robots協(xié)議,,開(kāi)啟下載中間件和itempipline:

# 系統(tǒng)配置變量
BOT_NAME = 'musicspyder'
SPIDER_MODULES = ['musicspyder.spiders']
NEWSPIDER_MODULE = 'musicspyder.spiders'
MAX_PAGE = 3    # 爬取頁(yè)數(shù)
SONGER_NUM = 1      # 爬取歌手歌曲數(shù)量
MONGO_URL = 'mongodb://localhost:27017/'
MONGO_DB = 'music'  # mongo數(shù)據(jù)庫(kù)
# 定義robots協(xié)議遵守規(guī)則為:不遵守
ROBOTSTXT_OBEY = False
# 啟用下載中間件
DOWNLOADER_MIDDLEWARES = {
    # 'musicspyder.middlewares.QqMusicDownloaderMiddleware': 543,
    'musicspyder.middlewares.my_useragent': 544,
}
# 啟用pipline中mongodb存儲(chǔ)
ITEM_PIPELINES = {
    # 'musicspyder.pipelines.QqMusicPipeline': 300,
    'musicspyder.pipelines.lrcText': 300,
    'musicspyder.pipelines.MongoPipline': 302,
}

定義上述scrapy組件完成之后我們即可在命令行中輸入以下命令用以啟用qqmusic爬蟲(chóng)框架:

scrapy crawl qqmusic

之后進(jìn)入mongodb查看爬取結(jié)果即可得到響應(yīng)歌手歌曲信息:

Python分布式爬蟲(chóng)一點(diǎn)也不難,!Scrapy MongoDB爬取QQ音樂(lè)實(shí)戰(zhàn)

三、爬蟲(chóng)系列總結(jié)

至此Scrapy框架爬取QQ音樂(lè)講解完成,,Python網(wǎng)絡(luò)爬蟲(chóng)數(shù)據(jù)采集實(shí)戰(zhàn)系列也隨之結(jié)束,,總體來(lái)說(shuō),,爬蟲(chóng)是一種細(xì)致活,需要掌握固定的套路并且去努力尋找網(wǎng)絡(luò)數(shù)據(jù)規(guī)律的蛛絲馬跡方能爬取成功,,同時(shí)也要量力而行,,防止對(duì)對(duì)方服務(wù)器造成巨大負(fù)載或者己方投入產(chǎn)出不成正比。完整代碼可以在頭條號(hào)中私信“QQ音樂(lè)”獲得,,前文涉及的基礎(chǔ)知識(shí)可參考下面鏈接:

爬蟲(chóng)所要了解的基礎(chǔ)知識(shí),,這一篇就夠了!Python網(wǎng)絡(luò)爬蟲(chóng)實(shí)戰(zhàn)系列

一文帶你深入了解并學(xué)會(huì)Python爬蟲(chóng)庫(kù),!從此數(shù)據(jù)不用愁

Python爬蟲(chóng)有多簡(jiǎn)單,?一文帶你實(shí)戰(zhàn)豆瓣電影TOP250數(shù)據(jù)爬取,!

一文弄清Python網(wǎng)絡(luò)爬蟲(chóng)解析庫(kù),!內(nèi)含多個(gè)實(shí)例講解

誰(shuí)說(shuō)同花順很難爬?一文帶你學(xué)會(huì)用Python爬取財(cái)經(jīng)動(dòng)態(tài)網(wǎng)頁(yè),!

誰(shuí)說(shuō)京東商品很難爬,?一文教你用Python構(gòu)建電商網(wǎng)站爬蟲(chóng)!

Python網(wǎng)絡(luò)爬蟲(chóng)實(shí)戰(zhàn)之Fiddler抓包今日頭條app,!附代碼

參考鏈接:

https://blog.csdn.net/qq_1290259791/article/details/82263014

https://www.jianshu.com/p/cecb29c04cd2

https:///4380.html

    本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間,,所有內(nèi)容均由用戶(hù)發(fā)布,不代表本站觀點(diǎn),。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購(gòu)買(mǎi)等信息,,謹(jǐn)防詐騙,。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,請(qǐng)點(diǎn)擊一鍵舉報(bào),。
    轉(zhuǎn)藏 分享 獻(xiàn)花(0

    0條評(píng)論

    發(fā)表

    請(qǐng)遵守用戶(hù) 評(píng)論公約

    類(lèi)似文章 更多