xpath無法定位tbody標(biāo)簽解決方法示例

更新時間：2023年09月13日 09:28:48 作者：ponponon

這篇文章主要介紹了xpath無法定位tbody標(biāo)簽解決方法示例,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進(jìn)步,早日升職加薪

引言

你用 selenium 抓取，必定有 body你用 requests 抓取，不一定有 body

瀏覽器會對不存在 body 的情況自動加上 body

所以，你用 requests 抓取就去分析 html tree用 selenium 就去分析 render tree

html tree 就是 networks 標(biāo)簽中的 html 內(nèi)容；render tree 就是 Elements 標(biāo)簽頁中的內(nèi)容

以前的講法有點(diǎn)問題，所以再次更新一下，也算是填坑

定位不到tbody是因?yàn)闃?biāo)準(zhǔn)差異，tbody不是必須存在的

chrome的Elements標(biāo)簽頁的tbody是肯定存在的

但是程序員寫的網(wǎng)頁不一定會有tbody

但是在chrome的Elements標(biāo)簽頁不管返回的html有沒有tbody，chrome都會有（有就不加，沒有就自動加上）

所以用selenium請求網(wǎng)頁數(shù)據(jù)，就加上tbody標(biāo)簽，因?yàn)閟elenium返回的必定是包含tbody的（因?yàn)榉祷氐氖莄hrome的Elements標(biāo)簽頁的內(nèi)容）

用requests請求的時候，就自己看看源html內(nèi)是否真的包含tbody標(biāo)簽（可以在chrome的network標(biāo)簽頁下查看）

總結(jié)：服務(wù)器返回的html不一定有tbody標(biāo)簽（具體看網(wǎng)站前端程序員有沒有加tbody標(biāo)簽），但是經(jīng)過chrome渲染的render html必定包含tbody標(biāo)簽（服務(wù)器返回沒有的話，瀏覽器就給你自動加上）

以下是原文：
寫于2019.10.29日

測試庫：lxml庫；鏈接鏈接：http://www.sxchxx.com/index-13-1075-1.html

問題發(fā)現(xiàn)

個人比較喜歡用xpath解析網(wǎng)頁，但時常得到的結(jié)果卻是一個空列表。

1.1 etree.HTML

from lxml import etree
import requests
url = 'http://www.sxchxx.com/index-13-1075-1.html'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36',
}
resposne = requests.get(url, headers=headers)
parser = etree.HTMLParser(encoding="utf-8")
html = etree.HTML(resposne.text, parser=parser)
resu=html.xpath('//*[@id="large_mid"]/table[2]/tr[3]/td/p//text()')
print(resu)

當(dāng)用如上代碼解析如下網(wǎng)頁時，可以獲取正文

但發(fā)現(xiàn)我們并沒有在rule里面加入tbody標(biāo)簽。相反，加入tbody標(biāo)簽會使的解析結(jié)果變成一個空列表

html.xpath('//*[@id="large_mid"]/table[2]/tbody/tr[3]/td/p//text()') # 這樣會得到空列表

1.2 etree.parse

使用etree.parse和etree.HTML恰好相反

from lxml import etree
import requests

parser = etree.HTMLParser(encoding="utf-8")
html = etree.parse('test.html', parser=parser)


content = html.xpath('//*[@id="large_mid"]/table[2]/tbody/tr[3]/td/p//text()')

print(content)

將網(wǎng)頁保存成test.html，再用etree.parse加載，發(fā)現(xiàn)rule中加入tbody標(biāo)簽才能獲得預(yù)期的結(jié)果；不加tbody標(biāo)簽會獲得一個空列表

1.3 代碼對比

from lxml import etree
import requests
parser = etree.HTMLParser(encoding="utf-8")
html = etree.parse('test.html', parser=parser)
content = html.xpath('//*[@id="large_mid"]/table[2]/tbody/tr[3]/td/p//text()')
print(content)
print('----------------分割線-------------------')
url = 'http://www.sxchxx.com/index-13-1075-1.html'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36',
}
resposne = requests.get(url, headers=headers)
parser = etree.HTMLParser(encoding="utf-8")
html = etree.HTML(resposne.text, parser=parser)
content = html.xpath('//*[@id="large_mid"]/table[2]/tr[3]/td/p//text()')
print(content)