快捷導(dǎo)航

Python爬蟲XPath解析出亂碼的問題及解決

更新時間：2024年05月24日 15:32:27 作者：平人的進(jìn)步日常

這篇文章主要介紹了Python爬蟲XPath解析出亂碼的問題及解決,具有很好的參考價值,希望對大家有所幫助,如有錯誤或未考慮完全的地方,望不吝賜教

Python爬蟲XPath解析出亂碼

請求后加上編碼

resp = requests.get(url, headers=headers)
resp.encoding = 'GBK'

Python XPath解析html出現(xiàn)â??解決方法 html出現(xiàn)&#123；

爬網(wǎng)頁又遇到一個坑，老是出現(xiàn)a亂碼，查看html出現(xiàn)的是&#數(shù)字;這樣的。

網(wǎng)上相關(guān)的“Python字符中出現(xiàn)&#的解決辦法”又沒有很好的解決，自己繼續(xù)沖浪，費(fèi)了一番功夫解決了。

這算是又加深了一下我對這些iso、Unicode編碼的理解。故分享。

問題

用Python的lxml解析html時，調(diào)用text()輸出出來的結(jié)果帶有a這樣的亂碼：

網(wǎng)頁原頁面展示：

爬取代碼：

url = "xxx"
 
response = requests.request("GET", url)
 
html = etree.HTML(response.text)
 
# 直接調(diào)用text函數(shù)
description = html.xpath('//div[@class="xxx"]/div/div//text()')
# 直接打印
for desc in description:
    print(desc)

原因

不用說自然是編碼的問題。下面教大家排查和解決。

排查與解決

首先查看返回的響應(yīng)是如何編碼的：

response = requests.request("GET", url, proxies=proxy)
# 得到響應(yīng)之后，先檢查一下它的編碼方式
print(response.encoding)

結(jié)果如下：

然后根據(jù)這個編碼的方式再來解碼：

html = etree.HTML(response.text)
 
description = html.xpath('//div[@class="xxx"]/div/div//text()')
 
for desc in description:
    # print(desc)
    # 根據(jù)上面的結(jié)果，用iso88591來編碼，再解碼為utf-8
    print(desc.encode("ISO-8859-1").decode("utf-8"))

結(jié)果如下：

完整代碼：

url = "xxx"
 
response = requests.request("GET", url)
print(response.encoding)
 
html = etree.HTML(response.text)
 
description = html.xpath('//div[@class="xxx"]/div/div//text()')
 
for desc in description:
    print(desc.encode("ISO-8859-1").decode("utf-8"))
    # print(desc)

總結(jié)

網(wǎng)上有用python2流傳下來的HTMLParser的，還有用python3的html包的，效果都不好。

不過也有改response的編碼方式的，就是這樣：

url = "xxx"
 
response = requests.request("GET", url)
 
# html = etree.HTML(response.text)
html = etree.HTML(response.content)  # 改用二進(jìn)制編碼
 
# 直接調(diào)用text函數(shù)
description = html.xpath('//div[@class="xxx"]/div/div//text()')
# 直接打印
for desc in description:
    print(desc)

也能成功解析。

以上為個人經(jīng)驗，希望能給大家一個參考，也希望大家多多支持腳本之家。

您可能感興趣的文章: