【原】用AI批量下載Sam Altman個(gè)人博客頁面

AIGC部落 2025-01-15 發(fā)布于廣東

展開全文

Sam Altman個(gè)人博客頁面：https://blog.samaltman.com/

找到翻頁規(guī)律：https://blog.samaltman.com/?page=12

在deepseek中輸入提示詞：

你是一個(gè)Python編程專家,，要寫一個(gè)網(wǎng)頁爬蟲任務(wù)的腳本,，具體步驟如下：

打開網(wǎng)頁：https://blog.samaltman.com/?page={pagenumber},參數(shù){pagenumber}的值是從1到12,；

定位網(wǎng)頁中所有class="post"的article元素，然后定位其中的a元素,，提取其文本內(nèi)容,，作為網(wǎng)頁標(biāo)題；

提取其href屬性值,，作為網(wǎng)頁URL,；

下載網(wǎng)頁，保存到文件夾：F:\Sam Altman,網(wǎng)頁格式為html,，網(wǎng)頁文件名用上一步中提取的網(wǎng)頁標(biāo)題,，網(wǎng)頁中的全部圖片要按照原文樣式保存下來；

注意：每一步都要輸出信息到屏幕上
文件名中可能會(huì)包含了非法字符（例如問號(hào)?）,，導(dǎo)致文件無法保存,。Windows文件系統(tǒng)不允許文件名中包含某些特殊字符，如<, >, :, ", /, \, |, ?, *,。為了解決這個(gè)問題，我們需要在保存文件之前對(duì)文件名進(jìn)行清理，去除或替換這些非法字符,，可以使用正則表達(dá)式來完成這個(gè)任務(wù),。

Deepseek給出源代碼：

import os

import re

import requests

from bs4 import BeautifulSoup

from urllib.parse import urljoin

# 定義基礎(chǔ)URL和頁碼范圍

base_url = "https://blog.samaltman.com/?page={}"

pages = range(1, 13)

# 定義保存路徑

save_folder = r"F:\Sam Altman"

# 確保保存文件夾存在

if not os.path.exists(save_folder):

os.makedirs(save_folder)

# 定義一個(gè)函數(shù)來清理文件名

def clean_filename(filename):

# 替換非法字符為下劃線

cleaned_filename = re.sub(r'[<>:"/\\|?*]', '_', filename)

return cleaned_filename

# 遍歷每一頁

for page_number in pages:

print(f"Processing page {page_number}")

url = base_url.format(page_number)

# 發(fā)送請(qǐng)求獲取網(wǎng)頁內(nèi)容

response = requests.get(url)

if response.status_code != 200:

print(f"Failed to retrieve page {page_number}")

continue

# 解析HTML內(nèi)容

soup = BeautifulSoup(response.text, 'html.parser')

# 定位所有class="post"的article元素

posts = soup.find_all('article', class_='post')

# 遍歷每個(gè)post

for post in posts:

# 定位a元素并提取文本內(nèi)容作為標(biāo)題

title_element = post.find('a')

if title_element:

title = title_element.text.strip()

post_url = urljoin(base_url, title_element['href'])

print(f"Found post: {title} with URL: {post_url}")

# 清理文件名

cleaned_title = clean_filename(title)

# 下載文章頁面

post_response = requests.get(post_url)

if post_response.status_code != 200:

print(f"Failed to retrieve post: {title}")

continue

# 保存文章頁面到文件

post_html_path = os.path.join(save_folder, f"{cleaned_title}.html")

with open(post_html_path, 'w', encoding='utf-8') as file:

file.write(post_response.text)

print(f"Saved post: {title} to {post_html_path}")

# 解析文章頁面內(nèi)容

post_soup = BeautifulSoup(post_response.text, 'html.parser')

# 定位所有圖片元素

images = post_soup.find_all('img')

# 遍歷每個(gè)圖片并下載

for img in images:

img_url = urljoin(post_url, img['src'])

img_response = requests.get(img_url)

if img_response.status_code != 200:

print(f"Failed to retrieve image: {img_url}")

continue

# 保存圖片到文件

img_name = os.path.basename(img_url)

img_path = os.path.join(save_folder, img_name)

with open(img_path, 'wb') as img_file:

img_file.write(img_response.content)

print(f"Saved image: {img_name} to {img_path}")

print("All pages processed.")

腳本說明：

頁面遍歷：腳本會(huì)遍歷從1到12的頁碼，構(gòu)建每個(gè)頁面的URL并發(fā)送請(qǐng)求,。

文章提?。簩?duì)于每個(gè)頁面，腳本會(huì)定位所有class="post"的article元素,，并提取其中的a元素的文本內(nèi)容作為標(biāo)題,，以及href屬性作為文章的URL。

文章下載：腳本會(huì)下載每個(gè)文章的HTML內(nèi)容,，并保存到指定的文件夾中,，文件名使用提取的標(biāo)題。

圖片下載：對(duì)于每個(gè)文章頁面,，腳本會(huì)定位所有的img元素,，下載圖片并保存到相同的文件夾中。

注意事項(xiàng)：

確保你已經(jīng)安裝了requests和beautifulsoup4庫(kù),。如果沒有安裝,，可以使用以下命令安裝：

bash

復(fù)制

pip install requests beautifulsoup4

由于網(wǎng)絡(luò)請(qǐng)求可能會(huì)失敗，腳本中包含了錯(cuò)誤處理邏輯,，以確保在請(qǐng)求失敗時(shí)不會(huì)中斷整個(gè)爬蟲任務(wù),。

保存的文件名和路徑使用了os.path模塊來處理，確保路徑的正確性,。

這個(gè)腳本應(yīng)該能夠滿足你的需求,，并且會(huì)在每一步輸出相關(guān)信息到屏幕上。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自： AIGC部落 > 《待分類》

舉報(bào)/認(rèn)領(lǐng)

0條評(píng)論

發(fā)表

請(qǐng)遵守用戶評(píng)論公約

類似文章 更多

AIGC部落

關(guān)注對(duì)話

TA的最新館藏

AI批量處理TXT文檔換行符
英文有聲書下載：我們最后的發(fā)明-人工智能與人類時(shí)代的終結(jié)
AI批量刪除Excel單元格內(nèi)容的空格并按字母排序
《算法霸權(quán)——數(shù)學(xué)殺傷性武器的威脅》
AI批量刪除空文件夾
AI批量下載播客轉(zhuǎn)錄的文本

喜歡該文的人也喜歡更多

熱門閱讀換一換

久久国产成人av_抖音国产毛片_a片网站免费观看_A片无码播放手机在线观看,色五月在线观看,亚洲精品m在线观看,女人自慰的免费网址,悠悠在线观看精品视频,一级日本片免费的,亚洲精品久,国产精品成人久久久久久久

【原】用AI批量下載Sam Altman個(gè)人博客頁面