【Python】【爬蟲】爬取網(wǎng)易、騰訊,、新浪,、搜狐新聞到本地

阿甘ch1wn8cyc3 2019-06-19

展開全文

這個(gè)實(shí)驗(yàn)主要爬取新聞網(wǎng)站首頁的新聞內(nèi)容保存到本地，爬取內(nèi)容有標(biāo)題,、時(shí)間,、來源、評論數(shù)和正文,。
工具：python 3.6 谷歌瀏覽器
爬取過程：

一,、安裝庫：urllib、requests,、BeautifulSoup

1,、urllib庫：Urllib是python內(nèi)置的HTTP請求庫。用這個(gè)庫可以用python請求網(wǎng)頁獲取信息,。
主要用到的函數(shù)：

       data = urllib.request.urlopen(qurl).read() 
        #qurl為網(wǎng)頁的網(wǎng)址,，利用這個(gè)函數(shù)可以獲取該網(wǎng)頁的內(nèi)容data

2、requests庫：requests是python實(shí)現(xiàn)的簡單易用的HTTP庫,，使用起來比urllib簡潔很多,。這個(gè)實(shí)驗(yàn)我兩個(gè)庫都用了,，作用類似。

    data = requests.get(url).text

3,、BeautifulSoup庫
當(dāng)我們通過上面兩個(gè)庫獲得了網(wǎng)頁的數(shù)據(jù)的時(shí)候,，我們需要從數(shù)據(jù)中提取我們想要的，這時(shí)BeautifulSoup就派上了用場,。BeautifulSoup可以為我們解析文檔,，抓取我們想要的新聞標(biāo)題、正文內(nèi)容等,。
4、re 庫
正則表達(dá)式的庫,，正則表達(dá)式大家都明白的,。

二、爬取新聞首頁,，得到所有要爬取新聞的鏈接

因?yàn)樾侣勈醉撌醉撝挥行侣劦臉?biāo)題,，新聞的具體信息要點(diǎn)進(jìn)標(biāo)題鏈接進(jìn)入另一個(gè)網(wǎng)頁查看。所以我們首先要在新聞首頁把所有要爬取新聞的鏈接保存到一個(gè)txt文件里,。先上代碼再解釋,。

def getQQurl(): #獲取騰訊新聞首頁的所有新聞鏈接
    url = "http://news.qq.com/"

    urldata = requests.get(url).text

    soup = BeautifulSoup(urldata, 'lxml')

    news_titles = soup.select("div.text > em.f14 > a.linkto")

    fo = open("D:/news/QQ鏈接.txt", "w+")  # 創(chuàng)建TXT文件保存首頁所有鏈接

    # 對返回的列表進(jìn)行遍歷寫入文件
    for n in news_titles:
        title = n.get_text()
        link = n.get("href")
        fo.writelines(link + "\n")
    fo.close()

函數(shù)的前兩行代碼前面已經(jīng)解釋了，就解釋一下三四行代碼吧,。

soup = BeautifulSoup(wbdata, ‘lxml’) #解析獲取的文件,，解析器為lxml

news_titles = soup.select(“div.text > em.f14 > a.linkto”)
分析新聞網(wǎng)頁源代碼的時(shí)候我們可以發(fā)現(xiàn)，首頁新聞的鏈接大多數(shù)在圖片中的地方
這里寫圖片描述
由此我們可以利用soup.select（）把所有標(biāo)簽div.text > em.f14 > a.linkto對應(yīng)的數(shù)據(jù)挑選出來,，因此是一個(gè)列表,。再用get(“herf”)把鏈接挑選出來，寫在TXT文件里面,。

一般新聞網(wǎng)站首頁的新聞鏈接按板塊不同在源代碼中的標(biāo)簽也不同,，挑選規(guī)則也不同。如果想挑選多個(gè)板塊的新聞的話可以多寫幾種規(guī)則,。

三,、根據(jù)鏈接文件依次爬取每個(gè)鏈接對應(yīng)的新聞數(shù)據(jù)

當(dāng)把所有新聞的鏈接寫在一個(gè)文件后，我們剩下要做的就是循環(huán)讀取每個(gè)鏈接,，利用第二步得到鏈接類似的辦法得到新聞的相關(guān)數(shù)據(jù),。
分析新聞的網(wǎng)頁源代碼我們可以發(fā)現(xiàn)，標(biāo)題都放在title標(biāo)簽下,，而正文內(nèi)容都在p標(biāo)簽下,，由此我們可以用
content = soup.select(‘p’) # 選擇正文內(nèi)容
title = soup.select(‘title’) # 選擇標(biāo)題將它們挑選出來，時(shí)間和來源等信息可以用類似的方法挑選,。
當(dāng)這些信息被挑選出來后,，它們都是以列表的形式,，所以我們要將它們依次寫入文件，整體代碼如下,。
這里寫圖片描述

def getqqtext():
    qqf = open("D:/news/QQ鏈接.txt", "r")
    qqurl = qqf.readlines()  # 讀取文件,，得到一個(gè)鏈接列表
    i = 0

    # 遍歷列表，請求網(wǎng)頁,，篩選出正文信息
    for qurl in qqurl:
        try:
            data = urllib.request.urlopen(qurl).read()
            data2 = data.decode("gbk", "ignore")

            soup = BeautifulSoup(data2, "html.parser")  # 從解析文件中通過select選擇器定位指定的元素,，返回一個(gè)列表

            content = soup.select('p')  # 選擇正文內(nèi)容
            title = soup.select('title')  # 選擇標(biāo)題
            time = soup.select('div.a_Info > span.a_time')
            author = soup.select('div.a_Info > span.a_source')

            # 將得到的網(wǎng)頁正文寫進(jìn)本地文件
            if (len(time) != 0):
                fo = open("D:/news/新聞/騰訊" + str(i) + ".txt", "w+")

                if (len(title) != 0):
                    fo.writelines("      " + title[0].get_text().strip() + "\n")
                fo.writelines("時(shí)間："+time[0].get_text().strip() + "\n")
                fo.writelines("評論數(shù): 0" + "\n")
                if (len(author) != 0):
                    fo.writelines("來源："+author[0].get_text() + '\n'+ "\n")

                # print(title[0].get_text())
                # print(time[0].string)
                # print(author[0].get_text()
                for m in range(0, len(content)):
                    con = content[m].get_text().strip()
                    if (len(con) != 0):
                        fo.writelines("\n" + con)
                    m += 1
                fo.close()

        except Exception as err:
            print(err)

        i += 1

四、其他網(wǎng)站特殊的情況

網(wǎng)易新聞?dòng)幸粋€(gè)新聞排行榜,，我直接爬了這個(gè)排行榜,，里面按類別劃分新聞，有跟帖排行,，評論排行,，分析網(wǎng)頁的源代碼很有意思，可以嘗試把跟帖數(shù)和評論數(shù)爬下來,。代碼在后面,。
新浪新聞的評論數(shù)是動(dòng)態(tài)數(shù)據(jù)，分析網(wǎng)頁源代碼無法找到這個(gè)數(shù)據(jù),，所以我利用谷歌瀏覽器的開發(fā)者工具分析動(dòng)態(tài)數(shù)據(jù)（具體方法可看網(wǎng)上教程）,，得到了新浪存放評論數(shù)的網(wǎng)頁，好像是用PHP寫的用beautifulsup提取不出來,，所以我用了re,，提取里面的top_num（熱點(diǎn)數(shù)）和鏈接。值得注意的是,，這個(gè)網(wǎng)頁的鏈接給得很奇葩,，不是標(biāo)準(zhǔn)格式，類似http:\/\/ent.sina.com.cn\/m\/v….所以還是要轉(zhuǎn)換一下,，具體就不細(xì)講了,，可以看代碼。
這里寫圖片描述

五,、總結(jié)

所以整個(gè)過程大概就三個(gè)步驟,，其它幾個(gè)網(wǎng)站也適用。重點(diǎn)是要去分析網(wǎng)頁源代碼,，不同的網(wǎng)頁不同數(shù)據(jù)在源代碼的位置不同,，根據(jù)不同的規(guī)則利用soup.select()就可以靈活操作。網(wǎng)上也有一些常用網(wǎng)站該怎么爬取的規(guī)則,，可以參考一下,。

六、完整代碼

可運(yùn)行,，需要自己改一下路徑,，只有兩個(gè)文件夾,，D：/news D:/news/新聞

import json
import os
import requests
from bs4 import BeautifulSoup
import urllib.request
import re
import io
import sys
from urllib.parse import quote
import codecs
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')

# 函數(shù)功能：得到網(wǎng)易新聞
def get163news():

    url = "http://news.163.com/rank/"  # 請求網(wǎng)易新聞的URL，獲取其text文本
    wbdata = requests.get(url).text        # 對獲取到的文本進(jìn)行解析
    soup = BeautifulSoup(wbdata, 'lxml')    # 創(chuàng)建一個(gè)beautifulsoup對象
    news_titles = soup.select("td  a")      # 從解析文件中通過select選擇器定位指定的元素,，返回一個(gè)列表
    comment = soup.select("td.cBlue")  #獲取網(wǎng)頁內(nèi)容的步驟對應(yīng)其它網(wǎng)頁相同,，不予贅述

    # 循環(huán)鏈接列表將獲取到的標(biāo)題、時(shí)間,、來源,、評論、正文寫進(jìn)txt文件

    start = 3
    i = 30
    n = 30
    for strat in range(30,500):
        for n in range(start, start + 29):
            link = news_titles[n].get("href")
            try:
                neteasedata = urllib.request.urlopen(link).read()
                neteasedata2 = neteasedata.decode("gbk", "ignore")

                soup = BeautifulSoup(neteasedata2, "html.parser")

                content = soup.select('p')
                title = soup.select('title')
                time = soup.select('div.post_time_source')
                author = soup.select('div.post_time_source > a.ne_article_source')

                if (len(time) != 0):
                    fo = open("D:/news/新聞/網(wǎng)易" + str(i) + ".txt", "w+")
                    if (len(title) != 0):
                        fo.writelines("      " + title[0].get_text().strip() + "\n")
                    fo.writelines("時(shí)間：" + time[0].get_text().strip() + "\n")
                    fo.writelines("評論數(shù): " + comment[i].get_text() + "\n" )
                    if (len(author) != 0):
                        fo.writelines(author[0].get_text() + '\n')

                    # print(title[0].get_text())
                    # print(time[0].string)
                    # print(author[0].get_text()
                    for m in range(2, len(content)):
                        try:
                            con = content[m].get_text().strip()
                            if (len(con) != 0):
                                fo.writelines("\n" + con)

                        except Exception as err:
                            print(err)
                        m += 1
                    fo.close()
            except Exception as err:
                print(err)

            i += 1
            n += 1
        start += 60
        n = start
        i = start
        if(start > 270):
            break

# 函數(shù)功能：得到騰訊新聞首頁所有新聞鏈接
def getQQurl():
    url = "http://news.qq.com/"

    wbdata = requests.get(url).text

    soup = BeautifulSoup(wbdata, 'lxml')

    news_titles = soup.select("div.text > em.f14 > a.linkto")

    fo = open("D:/news/QQ鏈接.txt", "w+")  # 創(chuàng)建TXT文件保存首頁所有鏈接

    # 對返回的列表進(jìn)行遍歷
    for n in news_titles:
        title = n.get_text()
        link = n.get("href")
        fo.writelines(link + "\n")
    fo.close()


# 函數(shù)功能：根據(jù)獲取的鏈接依次爬取新聞?wù)牟⒈４娴奖镜?/span>
def getqqtext():
    qqf = open("D:/news/QQ鏈接.txt", "r")
    qqurl = qqf.readlines()  # 讀取文件,，得到一個(gè)鏈接列表
    i = 0

    # 遍歷列表,，請求網(wǎng)頁，篩選出正文信息
    for qurl in qqurl:
        try:
            data = urllib.request.urlopen(qurl).read()
            data2 = data.decode("gbk", "ignore")

            soup = BeautifulSoup(data2, "html.parser")  # 從解析文件中通過select選擇器定位指定的元素,，返回一個(gè)列表

            content = soup.select('p')  # 選擇正文內(nèi)容
            title = soup.select('title')  # 選擇標(biāo)題
            time = soup.select('div.a_Info > span.a_time')
            author = soup.select('div.a_Info > span.a_source')

            # 將得到的網(wǎng)頁正文寫進(jìn)本地文件

            fo = open("D:/news/新聞/騰訊" + str(i) + ".txt", "w+")

            if (len(title) != 0):
                fo.writelines("      " + title[0].get_text().strip() + "\n")
                if(len(time)!=0):
                    fo.writelines("時(shí)間："+time[0].get_text().strip() + "\n")
                if (len(author) != 0):
                    fo.writelines("來源："+author[0].get_text() + '\n'+ "\n")

                # print(title[0].get_text())
                # print(time[0].string)
                # print(author[0].get_text()
                for m in range(0, len(content)):
                    con = content[m].get_text().strip()
                    if (len(con) != 0):
                        fo.writelines("\n" + con)
                    m += 1
                fo.close()

        except Exception as err:
            print(err)

        i += 1

#函數(shù)功能：得到搜狐新聞首頁所有新聞鏈接
def getsohuurl():
    url = "http://news.sohu.com/"
    wbdata = requests.get(url).text
    soup = BeautifulSoup(wbdata, 'lxml')

    news_titles = soup.select("div.list16 > ul > li > a")

    fo = open("D:/news/sohu鏈接.txt", "w+")

    for n in news_titles:
        title = n.get_text()
        link = n.get("href")
        fo.writelines(link + "\n")

    fo.close()


# 函數(shù)功能：根據(jù)獲取的搜狐新聞鏈接依次爬取新聞?wù)牟⒈４娴奖镜?/span>
def getsohutext():
    sohuf = open("D:/news/sohu鏈接.txt", "r")
    sohuurl = sohuf.readlines()
    i = 0
    for sohuu in sohuurl:
        try:
            sohudata = urllib.request.urlopen(sohuu).read()
            sohudata2 = sohudata.decode("utf-8", "ignore")

            soup = BeautifulSoup(sohudata2, "html.parser")

            content = soup.select('p')
            title = soup.select('title')
            time = soup.select('div.article-info > span.time')
            author = soup.select('div.date-source > span.original-link')

            if (len(time) != 0):
                fo = open("D:/news/新聞/搜狐" + str(i) + ".txt", "w+")
                if (len(title) != 0):
                    fo.writelines( "      " + title[0].get_text().strip() + "\n")
                fo.writelines("時(shí)間：" + time[0].get_text().strip() + "\n")
                fo.writelines("評論數(shù): 0" + "\n" + "\n")
                if (len(author) != 0):
                    fo.writelines(author[0].get_text() + '\n')

                # print(title[0].get_text())
                # print(time[0].string)
                # print(author[0].get_text()
                for m in range(0, len(content)):
                    con = content[m].get_text().strip()
                    if (len(con) != 0):
                        fo.writelines("\n"  + con)
                    m += 1

                fo.close()

        except Exception as err:
            print(err)

        i += 1

#函數(shù)功能：得到新浪新聞首頁所有新聞鏈接
def getsinaurl():
    url = ['http://top.news.sina.com.cn/ws/GetTopDataList.php?top_type=day&top_cat=qbpdpl&top_time=20180715&top_show_num=100&top_order=DESC&js_var=comment_all_data',
    'http://top.news.sina.com.cn/ws/GetTopDataList.php?top_type=day&top_cat=www_www_all_suda_suda & top_time=20180715&top_show_num=100&top_order=DESC&js_var=all_1_data01',
    'http://top.collection.sina.com.cn/ws/GetTopDataList.php?top_type=day&top_cat=wbrmzf_qz&top_time=20180715&top_show_num=10&top_order=DESC&js_var=wbrmzf_qz_1_data&call_back=showContent',
    'http://top.news.sina.com.cn/ws/GetTopDataList.php?top_type=day&top_cat=total_slide_suda&top_time=20180715&top_show_num=100&top_order=DESC&js_var=slide_image_1_data',
    'http://top.news.sina.com.cn/ws/GetTopDataList.php?top_type=day&top_cat=wbrmzfgwxw&top_time=20180715&top_show_num=10&top_order=DESC&js_var=wbrmzfgwxw_1_data&call_back=showContent',
     'http://top.news.sina.com.cn/ws/GetTopDataList.php?top_type=day&top_cat=news_china_suda&top_time=20180715&top_show_num=20&top_order=DESC&js_var=news_',
        'http://top.news.sina.com.cn/ws/GetTopDataList.php?top_type=day&top_cat=gnxwpl&top_time=20180715&top_show_num=20&top_order=DESC&js_var=news_']
    furl = open("D:/news/sina鏈接1.txt", "w+")
    fcom = open("D:/news/sinacom.txt", "w+")
    for u in url:
        try:
            wbdata = requests.get(u).text
            fo = open("D:/news/sinau.txt", "w+")

            fo.write(wbdata)
            fo.close()

            text = open("D:/news/sinau.txt", "r").read()
            allurl = re.findall('"url":"(.+?)",', text)
            topnum = re.findall('"top_num":"(.+?)",', text)
            print(len(allurl))
            print(len(topnum))

            for n in allurl:
                # s=n.encode ("utf-8")
                # print(s)
                furl.writelines(n + "\n")
            for n in topnum:
                fcom.writelines(n + "\n")

        except Exception as err:
            print(err)

    furl.close()
    fcom.close()
        # sinaf = codecs.open("D:/news/sina鏈接1.txt", 'r', 'utf-8')

# 函數(shù)功能：根據(jù)獲取的新浪新聞鏈接依次爬取新聞?wù)牟⒈４娴奖镜?/span>
def getsinanews():
    sinaf1 = open("D:/news/sina鏈接1.txt", "r")
    sinaf2 = open("D:/news/sinacom.txt", "r")
    sinaurl = sinaf1.readlines()
    sinacom = sinaf2.readlines()
    i = 0
    for surl in sinaurl:
        try:

            realurl = surl.replace('\/', '/')
            sinadata = urllib.request.urlopen(realurl).read()
            sinadata2 = sinadata.decode("utf-8", "ignore")

            soup = BeautifulSoup(sinadata2, "html.parser")

            content = soup.select('p')
            title = soup.select('title')
            time = soup.select('div.date-source > span.date')
            author = soup.select('div.date-source > a.source')
            # comments = soup.select('div.hd clearfix > span.count > em > a.comment_participatesum_p')
            # print(len(comments))
            if (len(time) != 0):
                fo = open("D:/news/新聞/新浪" + str(i) + ".txt", "w+")
                if (len(title) != 0):
                    fo.writelines("      " + title[0].get_text().strip() + "\n")
                    fo.writelines("時(shí)間：" + time[0].get_text().strip() + "\n")
                    fo.writelines("評論數(shù): " + sinacom[i] )
                if (len(author) != 0):
                    fo.writelines(author[0].get_text() + '\n')

                for m in range(0, len(content)):
                    con = content[m].get_text().strip()
                    if (len(con) != 0):
                        fo.writelines("\n" + con)
                    m += 1

                fo.close()
        except Exception as err:
            print(err)

        i += 1


def main():
   get163news()
    getQQurl()
    getqqtext()
   getsinaurl()
   getsinanews()
   getsohuurl()
   getsohutext()

main()

ps:編程小白,，剛剛上路，請多關(guān)照,。歡迎關(guān)注我的微博：努力學(xué)習(xí)的小譙同學(xué)

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布,，不代表本站觀點(diǎn),。請注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息,，謹(jǐn)防詐騙,。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請點(diǎn)擊一鍵舉報(bào),。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自：阿甘ch1wn8cyc3 > 《自己的頭條》

舉報(bào)/認(rèn)領(lǐng)