快捷導(dǎo)航

使用BeautifulSoup和Pandas進(jìn)行網(wǎng)頁數(shù)據(jù)抓取與清洗處理

更新時(shí)間：2025年02月18日 15:54:45 作者：站大爺IP

在數(shù)據(jù)分析和機(jī)器學(xué)習(xí)的項(xiàng)目中,數(shù)據(jù)的獲取,清洗和處理是非常關(guān)鍵的步驟,下面我們就來講講如何利用Python中的Beautiful Soup庫進(jìn)行這樣的操作吧

在數(shù)據(jù)分析和機(jī)器學(xué)習(xí)的項(xiàng)目中，數(shù)據(jù)的獲取、清洗和處理是非常關(guān)鍵的步驟。今天，我們將通過一個(gè)實(shí)戰(zhàn)案例，演示如何利用Python中的Beautiful Soup庫進(jìn)行網(wǎng)頁數(shù)據(jù)抓取，并使用Pandas庫進(jìn)行數(shù)據(jù)清洗和處理。這個(gè)案例不僅適合初學(xué)者，也能幫助有一定經(jīng)驗(yàn)的朋友快速掌握這兩個(gè)強(qiáng)大的工具。

一、準(zhǔn)備工作

在開始之前，請(qǐng)確保你的Python環(huán)境中已經(jīng)安裝了requests、beautifulsoup4和pandas庫。你可以通過以下命令安裝它們：

pip install requests beautifulsoup4 pandas

此外，我們需要抓取一個(gè)網(wǎng)頁的數(shù)據(jù)作為示例。為了簡單起見，我們選擇了一個(gè)公開的新聞網(wǎng)站頁面。

二、抓取網(wǎng)頁數(shù)據(jù)

首先，我們需要使用requests庫獲取網(wǎng)頁的HTML內(nèi)容。然后，使用Beautiful Soup解析HTML，并提取我們感興趣的數(shù)據(jù)。

import requests
from bs4 import BeautifulSoup
 
# 目標(biāo)網(wǎng)頁URL
url = 'https://example.com/news'  # 替換為實(shí)際的URL
 
# 發(fā)送HTTP請(qǐng)求獲取網(wǎng)頁內(nèi)容
response = requests.get(url)
response.raise_for_status()  # 檢查請(qǐng)求是否成功
 
# 使用Beautiful Soup解析HTML
soup = BeautifulSoup(response.text, 'html.parser')

假設(shè)我們要提取新聞標(biāo)題、發(fā)布時(shí)間和正文內(nèi)容。通過檢查網(wǎng)頁的HTML結(jié)構(gòu)，我們發(fā)現(xiàn)這些信息都包含在特定的HTML標(biāo)簽中。

# 提取新聞標(biāo)題、發(fā)布時(shí)間和正文內(nèi)容
articles = []
for article in soup.select('.news-article'):  # 假設(shè)新聞文章都有class="news-article"
    title = article.select_one('h2.title').text.strip()
    publish_time = article.select_one('.publish-time').text.strip()
    content = article.select_one('.content').text.strip()
    articles.append({
        'title': title,
        'publish_time': publish_time,
        'content': content
    })

三、數(shù)據(jù)清洗

抓取到的數(shù)據(jù)通常包含一些不需要的信息，比如多余的空格、HTML標(biāo)簽殘留、特殊字符等。我們需要對(duì)這些數(shù)據(jù)進(jìn)行清洗。

import pandas as pd
 
# 將數(shù)據(jù)轉(zhuǎn)換為DataFrame
df = pd.DataFrame(articles)
 
# 打印前幾行數(shù)據(jù)查看
print(df.head())
 
# 數(shù)據(jù)清洗步驟
# 1. 去除字符串前后的空格（已在提取時(shí)處理）
# 2. 替換特殊字符（例如換行符為空格）
df['content'] = df['content'].str.replace('\n', ' ')
 
# 3. 刪除缺失值或無效數(shù)據(jù)（假設(shè)空標(biāo)題或空內(nèi)容的數(shù)據(jù)無效）
df = df.dropna(subset=['title', 'content'])
 
# 4. 統(tǒng)一時(shí)間格式（假設(shè)發(fā)布時(shí)間為"YYYY-MM-DD HH:MM:SS"格式）
# 這里我們假設(shè)發(fā)布時(shí)間已經(jīng)是字符串格式，且格式統(tǒng)一，如果需要轉(zhuǎn)換格式，可以使用pd.to_datetime()
# df['publish_time'] = pd.to_datetime(df['publish_time'], format='%Y-%m-%d %H:%M:%S')
 
# 打印清洗后的數(shù)據(jù)查看
print(df.head())

四、數(shù)據(jù)處理

數(shù)據(jù)清洗后，我們可能還需要進(jìn)行一些額外的處理，比如數(shù)據(jù)轉(zhuǎn)換、數(shù)據(jù)合并、數(shù)據(jù)分組等。

# 數(shù)據(jù)處理步驟
# 1. 提取發(fā)布日期的日期部分（如果需要）
# df['publish_date'] = df['publish_time'].dt.date
 
# 2. 統(tǒng)計(jì)每個(gè)發(fā)布日期的新聞數(shù)量（如果需要）
# daily_counts = df['publish_date'].value_counts().reset_index()
# daily_counts.columns = ['publish_date', 'count']
# print(daily_counts)
 
# 3. 根據(jù)關(guān)鍵詞過濾新聞（例如只保留包含"疫情"關(guān)鍵詞的新聞）
keyword = '疫情'
filtered_df = df[df['content'].str.contains(keyword, na=False, case=False)]
 
# 打印過濾后的數(shù)據(jù)查看
print(filtered_df.head())

五、保存數(shù)據(jù)

處理完數(shù)據(jù)后，我們可能需要將其保存到文件中，以便后續(xù)使用。Pandas提供了多種保存數(shù)據(jù)的方法，比如保存為CSV文件、Excel文件等。

# 保存數(shù)據(jù)為CSV文件
csv_file_path = 'cleaned_news_data.csv'
df.to_csv(csv_file_path, index=False, encoding='utf-8-sig')
 
# 保存數(shù)據(jù)為Excel文件
excel_file_path = 'cleaned_news_data.xlsx'
df.to_excel(excel_file_path, index=False, engine='openpyxl')

六、完整代碼示例

為了方便大家理解和運(yùn)行，以下是完整的代碼示例。請(qǐng)確保將url變量替換為實(shí)際的網(wǎng)頁URL，并根據(jù)實(shí)際的HTML結(jié)構(gòu)調(diào)整Beautiful Soup的選擇器。

import requests
from bs4 import BeautifulSoup
import pandas as pd
 
# 目標(biāo)網(wǎng)頁URL（請(qǐng)?zhí)鎿Q為實(shí)際的URL）
url = 'https://example.com/news'
 
# 發(fā)送HTTP請(qǐng)求獲取網(wǎng)頁內(nèi)容
response = requests.get(url)
response.raise_for_status()
 
# 使用Beautiful Soup解析HTML
soup = BeautifulSoup(response.text, 'html.parser')
 
# 提取新聞標(biāo)題、發(fā)布時(shí)間和正文內(nèi)容
articles = []
for article in soup.select('.news-article'):  # 假設(shè)新聞文章都有class="news-article"
    title = article.select_one('h2.title').text.strip()
    publish_time = article.select_one('.publish-time').text.strip()
    content = article.select_one('.content').text.strip()
    articles.append({
        'title': title,
        'publish_time': publish_time,
        'content': content
    })
 
# 將數(shù)據(jù)轉(zhuǎn)換為DataFrame
df = pd.DataFrame(articles)
 
# 數(shù)據(jù)清洗步驟
df['content'] = df['content'].str.replace('\n', ' ')
df = df.dropna(subset=['title', 'content'])
 
# 數(shù)據(jù)處理步驟（示例：根據(jù)關(guān)鍵詞過濾新聞）
keyword = '疫情'
filtered_df = df[df['content'].str.contains(keyword, na=False, case=False)]
 
# 保存數(shù)據(jù)為CSV文件和Excel文件
csv_file_path = 'cleaned_news_data.csv'
excel_file_path = 'cleaned_news_data.xlsx'
df.to_csv(csv_file_path, index=False, encoding='utf-8-sig')
df.to_excel(excel_file_path, index=False, engine='openpyxl')
 
# 打印過濾后的數(shù)據(jù)查看
print(filtered_df.head())

七、總結(jié)

通過本文，我們學(xué)會(huì)了如何使用Beautiful Soup進(jìn)行網(wǎng)頁數(shù)據(jù)抓取，并使用Pandas進(jìn)行數(shù)據(jù)清洗和處理。這兩個(gè)庫的結(jié)合使用可以大大提高我們處理網(wǎng)頁數(shù)據(jù)的效率。在實(shí)際項(xiàng)目中，你可能需要根據(jù)具體的網(wǎng)頁結(jié)構(gòu)和數(shù)據(jù)需求調(diào)整代碼。希望這個(gè)實(shí)戰(zhàn)案例能幫助你更好地掌握這兩個(gè)工具，并在你的數(shù)據(jù)分析和機(jī)器學(xué)習(xí)項(xiàng)目中發(fā)揮它們的作用。

到此這篇關(guān)于使用BeautifulSoup和Pandas進(jìn)行網(wǎng)頁數(shù)據(jù)抓取與清洗處理的文章就介紹到這了,更多相關(guān)BeautifulSoup Pandas數(shù)據(jù)抓取與清洗內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: