快捷導(dǎo)航

Python自動(dòng)檢測(cè)requests所獲得html文檔的編碼

更新時(shí)間：2024年11月18日 11:12:46 作者：Humbunklung

這篇文章主要為大家詳細(xì)介紹了如何通過(guò)Python自動(dòng)檢測(cè)requests實(shí)現(xiàn)獲得html文檔的編碼,文中的示例代碼講解詳細(xì),感興趣的可以了解下

使用chardet庫(kù)自動(dòng)檢測(cè)requests所獲得html文檔的編碼

使用requests和BeautifulSoup庫(kù)獲取某個(gè)頁(yè)面帶來(lái)的亂碼問(wèn)題

使用requests配合BeautifulSoup庫(kù)，可以輕松地從網(wǎng)頁(yè)中提取數(shù)據(jù)。但是，當(dāng)網(wǎng)頁(yè)返回的編碼格式與Python默認(rèn)的編碼格式不一致時(shí)，就會(huì)導(dǎo)致亂碼問(wèn)題。

以如下代碼為例，它會(huì)獲取到一段亂碼的html：

import requests
from bs4 import BeautifulSoup

# 目標(biāo) URL
url = 'https://finance.sina.com.cn/realstock/company/sh600050/nc.shtml'

# 發(fā)送 HTTP GET 請(qǐng)求
response = requests.get(url)

# 檢查請(qǐng)求是否成功
if response.status_code == 200:

    # 獲取網(wǎng)頁(yè)內(nèi)容
    html_content = response.text
    
    # 使用 BeautifulSoup 解析 HTML 內(nèi)容
    soup = BeautifulSoup(html_content, 'html.parser')
    
    # 要查找的 ID
    target_id = 'hqDetails'
    
    # 查找具有特定 ID 的標(biāo)簽
    element = soup.find(id=target_id)
    
    if element:
        # 獲取該標(biāo)簽下的 HTML 內(nèi)容
        element_html = str(element)
        print(f"ID 為 {target_id} 的 HTML 內(nèi)容:\n{element_html}\n")
        
        # 查找該標(biāo)簽下的所有 table 元素
        tables = element.find_all('table')
        
        if tables:
            for i, table in enumerate(tables):
                print(f"第 {i+1} 個(gè) table 的 HTML 內(nèi)容:\n{table}\n")
        else:
            print(f"ID 為 {target_id} 的標(biāo)簽下沒(méi)有 table 元素")
    else:
        print(f"未找到 ID 為 {target_id} 的標(biāo)簽")
else:
    print(f"請(qǐng)求失敗，狀態(tài)碼: {response.status_code}")

我們可以通過(guò)通過(guò)手工指定代碼的方式來(lái)解決這個(gè)問(wèn)題，例如在response.status_code == 200后，通過(guò)response.encoding = 'utf-8'指定代碼，又或通過(guò)soup = BeautifulSoup(html_content, 'html.parser', from_encoding='utf-8') 來(lái)指定編碼。

然而，當(dāng)我們獲取的html頁(yè)面編碼不確定的時(shí)候，有沒(méi)有更好的辦法讓編碼監(jiān)測(cè)自動(dòng)執(zhí)行呢？這時(shí)候chardet編碼監(jiān)測(cè)庫(kù)是一個(gè)很好的幫手。

使用 chardet 庫(kù)自動(dòng)檢測(cè)編碼

chardet 是一個(gè)用于自動(dòng)檢測(cè)字符編碼的庫(kù)，可以更準(zhǔn)確地檢測(cè)響應(yīng)的編碼。

安裝chardet庫(kù)

pip install chardet

代碼應(yīng)用示例

import requests
from bs4 import BeautifulSoup
import chardet

# 目標(biāo) URL
url = 'https://finance.sina.com.cn/realstock/company/sh600050/nc.shtml'

# 發(fā)送 HTTP GET 請(qǐng)求
response = requests.get(url)

# 檢查請(qǐng)求是否成功
if response.status_code == 200:
    # 自動(dòng)檢測(cè)字符編碼
    detected_encoding = chardet.detect(response.content)['encoding']
    
    # 設(shè)置響應(yīng)的編碼
    response.encoding = detected_encoding

    # 獲取網(wǎng)頁(yè)內(nèi)容
    html_content = response.text
    
    # 使用 BeautifulSoup 解析 HTML 內(nèi)容
    soup = BeautifulSoup(html_content, 'html.parser')
    
    # 要查找的 ID
    target_id = 'hqDetails'
    
    # 查找具有特定 ID 的標(biāo)簽
    element = soup.find(id=target_id)
    
    if element:
        # 獲取該標(biāo)簽下的 HTML 內(nèi)容
        element_html = str(element)
        print(f"ID 為 {target_id} 的 HTML 內(nèi)容:\n{element_html}\n")
        
        # 查找該標(biāo)簽下的所有 table 元素
        tables = element.find_all('table')
        
        if tables:
            for i, table in enumerate(tables):
                print(f"第 {i+1} 個(gè) table 的 HTML 內(nèi)容:\n{table}\n")
        else:
            print(f"ID 為 {target_id} 的標(biāo)簽下沒(méi)有 table 元素")
    else:
        print(f"未找到 ID 為 {target_id} 的標(biāo)簽")
else:
    print(f"請(qǐng)求失敗，狀態(tài)碼: {response.status_code}")