Python爬取IP代理,，讓你構(gòu)建IP代理池（附源碼）

隴原秋風(fēng) 2021-08-20

展開(kāi)全文

前言

本文的文字及圖片來(lái)源于網(wǎng)絡(luò),僅供學(xué)習(xí),、交流使用,不具有任何商業(yè)用途,如有問(wèn)題請(qǐng)及時(shí)聯(lián)系我們以作處理。

基本開(kāi)發(fā)環(huán)境

Python 3.6
Pycharm

一、確定目標(biāo)需求

獲取代理IP地址,，端口然后對(duì)IP進(jìn)行檢測(cè)

二,、網(wǎng)站數(shù)據(jù)分析

網(wǎng)站是靜態(tài)網(wǎng)頁(yè)，是可以直接獲取數(shù)據(jù)的,。

根據(jù)re,、xpath或者css選擇器都是可以提取數(shù)據(jù)的，還是比較簡(jiǎn)單的,。爬取IP主要是因?yàn)樵谑褂门老x(chóng)頻繁抓取數(shù)據(jù)的時(shí)候,，某些網(wǎng)站是比較容易被封IP的。

雖然網(wǎng)站有很多關(guān)于免費(fèi)的IP代理可以使用,，但是基本上都是用不了的,。

完整代碼

import requests  # 第三方模塊
import parsel
import time  # 時(shí)間模塊


def check_ip(proxies_list):
    """檢測(cè)代理ip的可用性"""

    use_proxy = []
    for ip in proxies_list:
        try:
            response = requests.get(url='https://www.baidu.com', proxies=ip, timeout=2)
            if response.status_code == 200:
                use_proxy.append(ip)
        except Exception as e:
            print('當(dāng)前代理ip: ', ip, '請(qǐng)求超時(shí), 檢測(cè)不合格!!!')
        else:
            print('當(dāng)前代理ip: ', ip, '檢測(cè)通過(guò)')

    return use_proxy


proxy_list = []

for page in range(1, 11):
    time.sleep(0.5)
    print(f'==================正在抓取第{page}頁(yè)數(shù)據(jù)================')
    # 1.確定數(shù)據(jù)所在地址<url>(分析網(wǎng)頁(yè)性質(zhì)<靜態(tài)網(wǎng)頁(yè)\動(dòng)態(tài)網(wǎng)頁(yè)>)
    url = f'http://www./?stype=1&page={page}'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}

    # 2.發(fā)送網(wǎng)絡(luò)請(qǐng)求
    response = requests.get(url=url, headers=headers)
    html_data = response.text  # str
    # print(html_data)

    # 3.解析數(shù)據(jù)
    # 3.1 轉(zhuǎn)換數(shù)據(jù)類(lèi)型
    selector = parsel.Selector(html_data)
    # 3.2 數(shù)據(jù)提取
    trs = selector.xpath('//table[@class="table table-bordered table-striped"]/tbody/tr')  # tr

    """
    # 代理ip的結(jié)構(gòu)
        proxies_dict = {
            "http": "http://" + ip:端口,
            "https": "http://" + ip:端口,
        }
    """

    for tr in trs:
        ip_num = tr.xpath('./td[1]/text()').get()
        ip_port = tr.xpath('./td[2]/text()').get()
        # print(ip_num, ip_port)

        ip_proxy = ip_num + ':' + ip_port
        # print(ip_proxy)

        proxies_dict = {
            'http': "http://" + ip_proxy,
            'https': "https://" + ip_proxy
        }

        # 4.數(shù)據(jù)的保存
        proxy_list.append(proxies_dict)
        print('保存成功:', proxies_dict)

print(proxy_list)
print('獲取到的代理ip數(shù)量: ', len(proxy_list))

print('============================正在檢測(cè)代理===================================')
can_use = check_ip(proxy_list)
print('可用代理:', can_use)
print('可用代理數(shù)量:', len(can_use))

爬取了100IP代理，最終可以使用的就只有一個(gè),，事實(shí)證明還是付費(fèi)的香

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間,，所有內(nèi)容均由用戶(hù)發(fā)布，不代表本站觀點(diǎn),。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式,、誘導(dǎo)購(gòu)買(mǎi)等信息，謹(jǐn)防詐騙,。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來(lái)自：隴原秋風(fēng) > 《百度指數(shù)等》

舉報(bào)/認(rèn)領(lǐng)