python代碼運(yùn)行速度有點(diǎn)慢 ? 教你使用多線程速度飛升

python芊 2023-01-08 發(fā)布于湖南

展開全文

前言

嗨嘍，大家好呀~這里是愛看美女的茜茜吶

又到了學(xué)Python時刻~

在我們爬取數(shù)據(jù)的時候,有時候它運(yùn)行速度會非常慢

那么怎么解決呢?

這里給大家?guī)硪粋€多線程的方法

我們用采集二手車來舉例

環(huán)境使用:

Python 3.8
Pycharm

模塊使用:

requests 數(shù)據(jù)請求模塊
parsel 數(shù)據(jù)解析模塊
re
csv 內(nèi)置模塊

一. 代碼實(shí)現(xiàn)步驟:

發(fā)送請求, 模擬瀏覽器對于url地址發(fā)送請求
獲取數(shù)據(jù), 獲取服務(wù)器返回響應(yīng)數(shù)據(jù)
開發(fā)者工具: response
解析數(shù)據(jù), 提取我們想要的數(shù)據(jù)內(nèi)容
提取: 房源基本信息
保存數(shù)據(jù), 把數(shù)據(jù)保存表格文件里面
多頁數(shù)據(jù)采集

二. 代碼展示

基礎(chǔ)版

導(dǎo)入模塊

# 導(dǎo)入數(shù)據(jù)請求模塊 --> 第三方模塊 需要安裝 pip install requests
import requests
# 導(dǎo)入數(shù)據(jù)解析模塊 --> 第三方模塊 需要安裝 pip install parsel
import parsel
# 導(dǎo)入csv
import csv
# 導(dǎo)入時間模塊
import time

PS：完整源碼或數(shù)據(jù)集如有需要的小伙伴可以加下方的群去找管理員免費(fèi)領(lǐng)取

time_1 = time.time()

創(chuàng)建文件 <對象>

f = open('data.csv', mode='a', encoding='utf-8', newline='')
csv_writer = csv.DictWriter(f, fieldnames=[
    '標(biāo)題'
    '小區(qū)',
    '總價',
    '單價',
    '戶型',
    '面積',
    '朝向',
    '裝修',
    '樓層',
    '建筑日期',
    '建筑類型',
    '詳情頁',
])

寫入表頭

csv_writer.writeheader()

"""

發(fā)送請求, 模擬瀏覽器對于url地址發(fā)送請求

偽裝模擬: 請求頭字典數(shù)據(jù)類型構(gòu)建完整鍵值對
headers 可以直接在開發(fā)者工具里面進(jìn)行復(fù)制
<Response [200]> 響應(yīng)對象
200 狀態(tài)碼表示請求成功

"""

for page in range(1, 101):
    try:
        print(f'==================正在采集第{page}頁的數(shù)據(jù)內(nèi)容==================')

請求鏈接

( 因不可抗原因,，不能出現(xiàn)網(wǎng)址,，會發(fā)不出去，用圖片代替了,，大家照著敲一下 )

模擬偽裝

        headers = {
            # User-Agent 用戶代理, 表示瀏覽器基本身份信息
            'User-Agent': ' Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36'
        }

發(fā)送請求

        response = requests.get(url, headers)

"""

獲取數(shù)據(jù), 獲取服務(wù)器返回響應(yīng)數(shù)據(jù)
開發(fā)者工具: response
獲取網(wǎng)頁源代碼

response.text 獲取響應(yīng)文本數(shù)據(jù), 字符串?dāng)?shù)據(jù)類型獲取html字符串?dāng)?shù)據(jù)內(nèi)容
response.json() 獲取響應(yīng)字典數(shù)據(jù) json數(shù)據(jù) 必須是完整json數(shù)據(jù)格式

解析數(shù)據(jù), 提取我們想要的數(shù)據(jù)內(nèi)容
提取: 房源基本信息
解析方法:

xpath
re正則
css
json數(shù)據(jù)處理

css選擇器: 根據(jù)標(biāo)簽屬性提取數(shù)據(jù)內(nèi)容

1. 看數(shù)據(jù)在那個標(biāo)簽里面

"""

        html_data = requests.get(link).text
        select = parsel.Selector(html_data)

把獲取下來 html字符串?dāng)?shù)據(jù) response.text , 轉(zhuǎn)成可解析對象

        selector = parsel.Selector(response.text)

第一次提取, 把包含房源數(shù)據(jù)信息標(biāo)簽全部獲取下來獲取所有l(wèi)i標(biāo)簽

        lis = selector.css('sellListContent li')

for循環(huán)把列表里元素一個一個提取出來

        for li in lis:
            源碼,、解答、資源,、學(xué)習(xí)交流可加企鵝裙：261823976##
            title = li.css('.title a::text').get()  # 標(biāo)題
            href = li.css('.title a::attr(href)').get()  # 詳情頁
            totalPrice = li.css('.totalPrice span::text').get()  # 售價
            unitPrice = li.css('.unitPrice span::text').get()  # 單價
            string = select.css('.comments div:nth-child(7) .comment_text::text').get()

join把列表合并字符串

            area = '-'.join(li.css('.info .flood .positionInfo a::text').getall())  # 小區(qū)
            houseInfo = li.css('.info .address .houseInfo::text').get()

split 把字符串分割成列表

            houseType = houseInfo.split(' | ')[0]  # 戶型
            houseArea = houseInfo.split(' | ')[1]  # 面積
            orientation = houseInfo.split(' | ')[2]  # 朝向
            renovation = houseInfo.split(' | ')[3]  # 裝修
            floor = houseInfo.split(' | ')[4]  # 樓層

判斷 houseInfo.split(' | ') 有多少個元素, 如果6個元素說明沒有建造日期

            if len(houseInfo.split(' | ')) == 6:
                date = ''
            else:
                date = houseInfo.split(' | ')[5]
            buildingType = houseInfo.split(' | ')[-1]  # 建筑類型
            dit = {
                '標(biāo)題': title,
                '小區(qū)': area,
                '總價': totalPrice,
                '單價': unitPrice,
                '戶型': houseType,
                '面積': houseArea,
                '朝向': orientation,
                '裝修': renovation,
                '樓層': floor,
                源碼,、解答、資源,、學(xué)習(xí)交流可加企鵝裙：261823976##
                '建筑日期': date,
                '建筑類型': buildingType,
                '詳情頁': href,
            }
            csv_writer.writerow(dit)
            print(string)
    except:
        print('報名系統(tǒng)課程可以添加清風(fēng)老師微信: pythonmiss')

多線程版

mport requests
import parsel
import re
import csv
# 線程池模塊
import concurrent.futures
import time

def get_response(html_url):
    """
    發(fā)送請求函數(shù)
    :param html_url:
    :return:
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
    }
    response = requests.get(url=html_url, headers=headers)
    return response


def get_content(html_url):
    """
    獲取數(shù)據(jù)函數(shù)
    :param html_url:
    :return:
    """
    response = get_response(html_url)
    html_data = get_response(link).text
    selector = parsel.Selector(response.text)
    select = parsel.Selector(html_data)
    lis = selector.css('.sellListContent li')
    content_list = []
    for li in lis:

        title = li.css('.title a::text').get()  # 標(biāo)題
        area = '-'.join(li.css('.positionInfo a::text').getall())  # 小區(qū)
        Price = li.css('.totalPrice span::text').get()  # 總價
        Price_1 = li.css('.unitPrice span::text').get().replace('元/平', '')  # 單價
        houseInfo = li.css('.houseInfo::text').get()  # 信息
        HouseType = houseInfo.split(' | ')[0]  # 戶型
        HouseArea = houseInfo.split(' | ')[1].replace('平米', '')  # 面積
        direction = houseInfo.split(' | ')[2].replace(' ', '')  # 朝向
        renovation = houseInfo.split(' | ')[3]  # 裝修
        floor_info = houseInfo.split(' | ')[4]
        floor = floor_info[:3]  # 樓層
        floor_num = re.findall('(\d+)層', floor_info)[0]  # 層數(shù)
        BuildingType = houseInfo.split(' | ')[-1]
        string = select.css('.comments div:nth-child(7) .comment_text::text').get()
        href = li.css('.title a::attr(href)').get()  # 詳情頁
        if len(houseInfo.split(' | ')) == 6:
            date = 'None'
        else:
            date = houseInfo.split(' | ')[5].replace('年建', '')  # 日期
        print(string)
        dit = {
        源碼,、解答、資源,、學(xué)習(xí)交流可加企鵝裙：261823976##
            '標(biāo)題': title,
            '內(nèi)容': string,
            '小區(qū)': area,
            '總價': Price,
            '單價': Price_1,
            '戶型': HouseType,
            '面積': HouseArea,
            '朝向': direction,
            '裝修': renovation,
            '樓層': floor,
            '層數(shù)': floor_num,
            '建筑日期': date,
            '建筑類型': BuildingType,
            '詳情頁': href,
        }
        content_list.append(dit)
    return content_list


def main(page):
    """
    主函數(shù)
    :param page:
    :return:
    """
    print(f'===============正在采集第{page}頁的數(shù)據(jù)內(nèi)容===============')

    content_list = get_content(html_url=url)
    for content in content_list:
        csv_writer.writerow(content)

if __name__ == '__main__':
    time_1 = time.time()
    link = 'http:// *******.com/article/149'
    # 創(chuàng)建文件
    f = open('data多線程.csv', mode='a', encoding='utf-8', newline='')
    csv_writer = csv.DictWriter(f, fieldnames=[
        '標(biāo)題',
        '內(nèi)容',
        '小區(qū)',
        '總價',
        '單價',
        '戶型',
        '面積',
        '朝向',
        '裝修',
        '樓層',
        '層數(shù)',
        '建筑日期',
        '建筑類型',
        '詳情頁',
    ])
    csv_writer.writeheader()

    # 線程池執(zhí)行器 max_workers 最大線程數(shù)
    exe = concurrent.futures.ThreadPoolExecutor(max_workers=10)
    for page in range(1, 11):
        exe.submit(main, page)
    exe.shutdown()
    time_2 = time.time()
    use_time = int(time_2 - time_1)
    # 總計耗時: 9
    print('總計耗時:', use_time)

尾語

感謝你觀看我的文章吶~本次航班到這里就結(jié)束啦 ??

希望本篇文章有對你帶來幫助 ??,，有學(xué)習(xí)到一點(diǎn)知識~

躲起來的星星??也在努力發(fā)光，你也要努力加油（讓我們一起努力叭）,。

本站是提供個人知識管理的網(wǎng)絡(luò)存儲空間,，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn),。請注意甄別內(nèi)容中的聯(lián)系方式,、誘導(dǎo)購買等信息，謹(jǐn)防詐騙,。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,，請點(diǎn)擊一鍵舉報。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自： python芊 > 《python》

舉報/認(rèn)領(lǐng)