第一時(shí)間接收最新Python干貨,! 早起Python推薦搜索
Python Pandas 數(shù)據(jù)分析 大家好,,關(guān)于Requests爬蟲我們已經(jīng)講了很多。今天我們就說一下Scrapy框架各組件的詳細(xì)設(shè)置方便之后更新Scrapy爬蟲實(shí)戰(zhàn)案例,。Scrapy是純Python語言實(shí)現(xiàn)的爬蟲框架,,簡單、易用,、拓展性高是其主要特點(diǎn),。這里不過多介紹Scrapy的基本知識點(diǎn),主要針對其高拓展性詳細(xì)介紹各個(gè)主要部件的配置方法,。其實(shí)也不詳細(xì),,不過應(yīng)該能滿足大多數(shù)人的需求了 : )。當(dāng)然,,更多信息可以仔細(xì)閱讀官方文檔,。首先還是放一張Scrapy數(shù)據(jù)流的圖供復(fù)習(xí)和參考,。接下來進(jìn)入正題,,有些具體的示例以某瓣spider為例。scrapy startproject <Project_name> scrapy genspider <spider_name> <domains> 如果想要?jiǎng)?chuàng)建全網(wǎng)爬取的便捷框架crawlspider,,則用如下命令:scrapy genspider –t crawl <spider_name> <domains>
首先介紹最核心的部件spider.py,,廢話不多說,上代碼,,看注釋 import scrapy # 有些命令如果有python基礎(chǔ)的都明白,,我不做過多介紹 import json # 需要做持久化所以導(dǎo)入item,,也可以根據(jù)文件夾名慢慢導(dǎo)入 from ..items import DoubanItem
class DoubanSpider(scrapy.Spider): name = 'douban' allowed_domains = ['douban.com'] # 對單個(gè)爬蟲設(shè)置請求頭 custom_settings = { 'DEFAULT_REQUEST_HEADERS': { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36' }} # 很多時(shí)候并不需要重載這個(gè)函數(shù),如果需要定制化起始url或者單獨(dú)設(shè)置請求頭可以選擇重載 def start_requests(self): page = 18 base_url = 'https://xxxx' for i in range(page): url = base_url.format(i * 20) req = scrapy.Request(url=url, callback=self.parse) # 對某個(gè)請求添加請求頭,,后面的請求如果要設(shè)置也是類似方法 # req.headers['User-Agent'] = '' yield req # 沒有特別要解釋,,就是常規(guī)的頁面解析拋給...(看數(shù)據(jù)流就明白了) def parse(self, response): json_str = response.body.decode('utf-8') res_dict = json.loads(json_str) for i in res_dict['subjects']: url = i['url'] yield scrapy.Request(url=url, callback=self.parse_detailed_page) # scrapy的response可以直接用xpath解析,基礎(chǔ)東西大家都懂不贅述 def parse_detailed_page(self, response): title = response.xpath('//h1/span[1]/text()').extract_first() year = response.xpath('//h1/span[2]/text()').extract()[0] image = response.xpath('//img[@rel='v:image']/@src').extract_first() item = DoubanItem() item['title'] = title item['year'] = year item['image'] = image # 如果要下載圖片需要單獨(dú)設(shè)置,,ImagePipelines,,同樣在settings和pipelines都需要相應(yīng)設(shè)置 item['image_urls'] = [image] yield item
如果是全網(wǎng)爬取,則框架中spiders的部分開頭會(huì)略有差別 rules = (Rule(LinkExtractor(allow=r'http:///digimon/.*/index.html'), callback='parse_item', follow=False),)
關(guān)鍵就是follow的設(shè)置了,,是否到達(dá)既定深度和頁面需要自己把握,。提一嘴,請求頭可以在三個(gè)地方設(shè)置,,決定了請求頭的影響范圍 在settings中設(shè)置,,范圍最大,影響整個(gè)框架的所有spider 在spiders類變量處設(shè)置,,影響該spider的所有請求 在具體請求中設(shè)置,,只影響該request
三處設(shè)置的影響范圍實(shí)際就是從全局到單個(gè)爬蟲到單個(gè)請求。如果同時(shí)存在則單個(gè)請求的headers設(shè)置優(yōu)先級最高,! import scrapy
class DoubanItem(scrapy.Item): title = scrapy.Field() year = scrapy.Field() image = scrapy.Field() # 下載圖片的ImagePipelines也需要設(shè)置items image_urls = scrapy.Field() # 持久化存儲我選擇用mysql,,不具體展開 def get_insert_sql_and_data(self): # CREATE TABLE douban( # id int not null auto_increment primary key, # title text, `year` int, image text)ENGINE=INNODB DEFAULT CHARSET=UTF8mb4; insert_sql = 'INSERT INTO douban(title,`year`,image)' \ # 系統(tǒng)關(guān)鍵字需要加`` 'VALUES(%s,%s,%s)' data = (self['title'],self['year'],self['image']) return (insert_sql, data)
中間件就很靈性了,很多小伙伴也不一定用的到,,但實(shí)際上在配置代理時(shí)很重要,,一般需求不去配置SpiderMiddleware,主要針對DownloaderMiddleware進(jìn)行修改 # 信號,,這個(gè)名詞在scrapy自定義拓展中很重要 from scrapy import signals # 本地配置的類,,代碼見后續(xù),可以搭在自己的IP池上,,也可以直接掛在收費(fèi)IP(比如我) from proxyhelper import Proxyhelper # 多線程操作同一個(gè)對象需要鎖,,用法就是實(shí)例化以后一鎖一釋放 from twisted.internet.defer import DeferredLock
class DoubanSpiderMiddleware(object): # spider中間件不設(shè)置 pass
class DoubanDownloaderMiddleware(object): def __init__(self): # 對IP配置的代理和鎖都實(shí)例化 self.helper = Proxyhelper() self.lock = DeferredLock()
@classmethod def from_crawler(cls, crawler): # 不修改 # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s
def process_request(self, request, spider): # request的數(shù)據(jù)流到達(dá)下載中間件的時(shí)候出發(fā) self.lock.acquire() request.meta['Proxy'] = self.helper.get_proxy() self.lock.release() return None
def process_response(self, request, response, spider): # 對響應(yīng)判斷,如果不符合就換代理重新請求 if response.status != 200: self.lock.acquire() self.helper.update_proxy(request.meta['Proxy']) self.lock.release() return request return response
def process_exception(self, request, exception, spider): self.lock.acquire() self.helper.update_proxy(request.meta['Proxy']) self.lock.release() return request
def spider_opened(self, spider): # 不修改 spider.logger.info('Spider opened: %s' % spider.name)
import requests
class Proxyhelper(object): def __init__(self): self.proxy = self._get_proxy_from_xxx()
def get_proxy(self): return self.proxy
def update_proxy(self, proxy): if proxy == self.proxy: print('Updating a proxy') self.proxy = self._get_proxy_from_xxx()
def _get_proxy_from_xxx(self): url = '' # 此處修改url,,最好是一次返回一個(gè)ip response = requests.get(url) return 'http://' + response.text.strip()
# 載入本地的mysql持久化類,,按需自己寫 from mysqlhelper import Mysqlhelper # 載入ImagesPipeline便于重載,自定義一些功能 from scrapy.pipelines.images import ImagesPipeline import hashlib from scrapy.utils.python import to_bytes from scrapy.http import Request
class DoubanImagesPipeline(ImagesPipeline): def get_media_requests(self, item, info): request_lst = [] for x in item.get(self.images_urls_field, []): req = Request(x) req.meta['movie_name'] = item['title'] # 獲取名字 request_lst.append(req) return request_lst # 重載 def file_path(self, request, response=None, info=None): image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest() return 'full/%s.jpg' % (request.meta['movie_name']) # 修改圖片名
# 無特殊,,有些步驟在items已經(jīng)寫完,,實(shí)現(xiàn)pipelines和items功能上的分離 class DoubanPipeline(object): def __init__(self): self.mysqlhelper = Mysqlhelper()
def process_item(self, item, spider): if 'get_insert_sql_and_data' in dir(item): (insert_sql, data) = item.get_insert_sql_and_data() self.mysqlhelper.execute_sql(insert_sql, data) return item
極其關(guān)鍵的部件,注釋已經(jīng)在代碼中標(biāo)注 # 爬蟲名稱 BOT_NAME = 'Douban'
SPIDER_MODULES = ['Douban.spiders'] NEWSPIDER_MODULE = 'Douban.spiders'
# 客戶端請求頭 # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'Douban (+http://www.)'
# Obey robots.txt rules # 機(jī)器人協(xié)定 ROBOTSTXT_OBEY = False
# 并發(fā)請求數(shù) # Configure maximum concurrent requests performed by Scrapy (default: 16) CONCURRENT_REQUESTS = 32
# 下載延遲 #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: # 單域名和單IP并發(fā)數(shù),,會(huì)覆蓋上面的設(shè)定 #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default) #COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default) # 對爬蟲進(jìn)行監(jiān)控 #TELNETCONSOLE_ENABLED = False # TELNETCONSOLE_ENABLED = True # TELNETCONSOLE_HOST = '127.0.0.1' # TELNETCONSOLE_PORT = [6023,] # 操作命令:cmd -> telent 127.0.0.1 6023-> est<>
# Override the default request headers: # 默認(rèn)請求頭,,項(xiàng)目內(nèi)所有爬蟲有效 # DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', # 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36' # }
# 爬蟲中間件 # SPIDER_MIDDLEWARES = { # # 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None # 'Douban.middlewares.DoubanSpiderMiddleware': 543, # }
# Enable or disable downloader middlewares # See https://docs./en/latest/topics/downloader-middleware.html # 下載中間件 DOWNLOADER_MIDDLEWARES = { 'Douban.middlewares.DoubanDownloaderMiddleware': 560, # 更改為560的原因在于不同中間件細(xì)分很多亞組間,這些組間的數(shù)據(jù)大小決定了request和response數(shù)據(jù)流觸碰的順序,,具體見官方文檔 } # 允許url的訪問時(shí)限 TIMEOUT = 10 # 深度限制 # DEPTH_LIMIT = 1
# 自定義拓展 EXTENSIONS = { 'Douban.extends.MyExtension': 500, }
# item-pipelines配置 ITEM_PIPELINES = { # 'scrapy.pipelines.images.ImagesPipeline': 1, # 圖片下載器需要注冊 'Douban.pipelines.DoubanImagesPipeline': 300, }
# 利用算法,,自動(dòng)限速 # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs./en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False
# 啟用緩存,,較少用 # Enable and configure HTTP caching (disabled by default) # See https://docs./en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
# 圖片下載器ImagePipeline的配置,按需開啟 IMAGES_STORE = 'download'
自定義擴(kuò)展,,建議設(shè)置該部件需要對信號有了解,,深入理解scrapy運(yùn)行過程的信號觸發(fā),實(shí)際還是需要對數(shù)據(jù)流理解的完善,。代碼中我是利用自己寫的類,,本質(zhì)就是利用喵提醒在某些特定時(shí)刻觸發(fā)提醒(喵提醒打錢?),。當(dāng)然也可以利用日志或者其他功能強(qiáng)化拓展功能,,通過signal的不同觸發(fā)時(shí)刻針對性設(shè)置 需要自己創(chuàng)建,創(chuàng)建位置如圖:
from scrapy import signals from message import Message
class MyExtension(object): def __init__(self, value): self.value = value
@classmethod def from_crawler(cls, crawler): val = crawler.settings.getint('MMMM') ext = cls(val)
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened) crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
return ext
def spider_opened(self, spider): print('spider running')
def spider_closed(self, spider): message = Message('spider運(yùn)行結(jié)束') message.push() print('spider closed')
runnings.py最后提一下吧,,其實(shí)就是一個(gè)在python中運(yùn)行cmd的命令 from scrapy.cmdline import execute execute('scrapy crawl douban'.split())
以上就是可以滿足基本需求的Scrapy各部件配置,,如果還不熟悉的話可以參考,之后我們將更新一些Scrapy爬蟲實(shí)戰(zhàn)案例,。 本周繼續(xù)給常讀與常分享的讀者送書,!本周第一本書為很多讀者要求的機(jī)器學(xué)習(xí)相關(guān)書籍??
|