使用BeautifulSoup4解析XML的方法小結(jié)

更新時(shí)間：2020年12月07日 11:24:55 作者：pengjunlee

這篇文章主要介紹了使用BeautifulSoup4解析XML的方法小結(jié)，文中通過(guò)示例代碼介紹的非常詳細(xì)，對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值，需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧

Beautiful Soup 是一個(gè)用來(lái)從HTML或XML文件中提取數(shù)據(jù)的Python庫(kù)，它利用大家所喜歡的解析器提供了許多慣用方法用來(lái)對(duì)文檔樹(shù)進(jìn)行導(dǎo)航、查找和修改。

幫助文檔英文版：https://www.crummy.com/software/BeautifulSoup/bs4/doc/

幫助文檔中文版：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

入門示例

以下是電影《愛(ài)麗絲夢(mèng)游仙境》中的一段HTML內(nèi)容：

我們以此為例，對(duì)如何使用BeautifulSoup解析HTML頁(yè)面內(nèi)容進(jìn)行簡(jiǎn)單入門示例：

from bs4 import BeautifulSoup
 
# 《愛(ài)麗絲夢(mèng)游仙境》故事片段
html_doc = """
<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>
"""
 
# 構(gòu)造解析樹(shù)
soup = BeautifulSoup(html_doc, "html.parser")
 
# 美化輸出
#soup.prettify())
 
# 獲取第一個(gè) title 標(biāo)簽
soup.title
# <title>The Dormouse's story</title>
 
# 獲取第一個(gè) title 標(biāo)簽的名稱
soup.title.name
# title
 
# 獲取第一個(gè) title 標(biāo)簽的文本內(nèi)容
soup.title.string
# The Dormouse's story
 
# 獲取第一個(gè) title 標(biāo)簽的父標(biāo)簽的名稱
soup.title.parent.name
# head
 
# 獲取第一個(gè) p 標(biāo)簽
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
 
# 獲取第一個(gè) p 標(biāo)簽的 class 屬性
soup.p['class']
# ['title']
 
# 獲取第一個(gè) a 標(biāo)簽
soup.a
# <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>
 
# 查找所有的 a 標(biāo)簽
soup.find_all('a')
# [<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
 
# 獲取所有的 a 標(biāo)簽的 href 屬性
for link in soup.find_all('a'):
  print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
 
# 查找 id = link3 的 a 標(biāo)簽
soup.find(id="link3")
# <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>
 
# 獲取解析樹(shù)的文本內(nèi)容
print(soup.get_text())
# The Dormouse's story
# 
# The Dormouse's story
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
# ...

解析器

Beautiful Soup除了支持Python標(biāo)準(zhǔn)庫(kù)中的HTML解析器外，還支持一些第三方的解析器，其中一個(gè)就是 lxml 。

下表列出了主要的解析器，以及它們的優(yōu)缺點(diǎn)：

解析器	使用方法	優(yōu)勢(shì)	劣勢(shì)
Python標(biāo)準(zhǔn)庫(kù)	BeautifulSoup(markup, "html.parser")	Python的內(nèi)置標(biāo)準(zhǔn)庫(kù) 執(zhí)行速度適中文檔容錯(cuò)能力強(qiáng)	Python 2.7.3 or 3.2.2)前的版本中文檔容錯(cuò)能力差
lxml HTML 解析器	BeautifulSoup(markup, "lxml")	速度快文檔容錯(cuò)能力強(qiáng)	需要安裝C語(yǔ)言庫(kù)
lxml XML 解析器	BeautifulSoup(markup, ["lxml", "xml"]) BeautifulSoup(markup, "xml")	速度快唯一支持XML的解析器	需要安裝C語(yǔ)言庫(kù)
html5lib	BeautifulSoup(markup, "html5lib")	最好的容錯(cuò)性以瀏覽器的方式解析文檔生成HTML5格式的文檔	速度慢不依賴外部擴(kuò)展

推薦使用lxml作為解析器，因?yàn)樾矢摺?在Python2.7.3之前的版本和Python3中3.2.2之前的版本，必須安裝lxml或html5lib，因?yàn)槟切㏄ython版本的標(biāo)準(zhǔn)庫(kù)中內(nèi)置的HTML解析方法不夠穩(wěn)定。

注意： 如果一段HTML或XML文檔格式不正確的話，那么在不同的解析器中返回的結(jié)果可能是不一樣的。

解析器之間的區(qū)別

Beautiful Soup為不同的解析器提供了相同的接口，但解析器本身是有區(qū)別的，同一篇文檔被不同的解析器解析后可能會(huì)生成不同結(jié)構(gòu)的樹(shù)型文檔，區(qū)別最大的是HTML解析器和XML解析器，看下面片段被解析成HTML結(jié)構(gòu)：

html_soup = BeautifulSoup("<a><b/></a>", "lxml")
print(html_soup)
# <html><body><a><b></b></a></body></html>

因?yàn)榭諛?biāo)簽不符合HTML標(biāo)準(zhǔn)，所以解析器把它解析成。

同樣的文檔使用XML解析如下(解析XML需要安裝lxml庫(kù))。注意，空標(biāo)簽依然被保留，并且文檔前添加了XML頭，而不是被包含在<html>標(biāo)簽內(nèi)：

xml_soup = BeautifulSoup("<a><b/></a>", "xml")
print(xml_soup)
# <?xml version="1.0" encoding="utf-8"?>
# <a><b/></a>

HTML解析器之間也有區(qū)別，如果被解析的HTML文檔是標(biāo)準(zhǔn)格式，那么解析器之間沒(méi)有任何差別，只是解析速度不同，結(jié)果都會(huì)返回正確的文檔樹(shù)。

但是如果被解析文檔不是標(biāo)準(zhǔn)格式，那么不同的解析器返回結(jié)果可能不同。下面例子中，使用lxml解析錯(cuò)誤格式的文檔，結(jié)果標(biāo)簽被直接忽略掉了：

soup = BeautifulSoup("<a></p>", "lxml")
print(soup)
# <html><body><a></a></body></html>

使用html5lib庫(kù)解析相同文檔會(huì)得到不同的結(jié)果：

soup = BeautifulSoup("<a></p>", "html5lib")
print(soup)
# <html><head></head><body><a><p></p></a></body></html>

html5lib庫(kù)沒(méi)有忽略掉標(biāo)簽，而是自動(dòng)補(bǔ)全了標(biāo)簽，還給文檔樹(shù)添加了<head>標(biāo)簽。

使用pyhton內(nèi)置庫(kù)解析結(jié)果如下:

soup = BeautifulSoup("<a></p>", "html.parser")
print(soup)
# <a></a>

與lxml 庫(kù)類似的，Python內(nèi)置庫(kù)忽略掉了標(biāo)簽，與html5lib庫(kù)不同的是標(biāo)準(zhǔn)庫(kù)沒(méi)有嘗試創(chuàng)建符合標(biāo)準(zhǔn)的文檔格式或?qū)⑽臋n片段包含在<body>標(biāo)簽內(nèi)，與lxml不同的是標(biāo)準(zhǔn)庫(kù)甚至連<html>標(biāo)簽都沒(méi)有嘗試去添加。

因?yàn)槲臋n片段“<a>”是錯(cuò)誤格式，所以以上解析方式都能算作”正確”，html5lib庫(kù)使用的是HTML5的部分標(biāo)準(zhǔn)，所以最接近”正確”，不過(guò)所有解析器的結(jié)構(gòu)都能夠被認(rèn)為是”正?！钡?。

不同的解析器可能影響代碼執(zhí)行結(jié)果，如果在分發(fā)給別人的代碼中使用了 BeautifulSoup ，那么最好注明使用了哪種解析器，以減少不必要的麻煩。

創(chuàng)建文檔對(duì)象

將一段文檔傳入BeautifulSoup 的構(gòu)造方法，就能得到一個(gè)文檔的對(duì)象，可以傳入一段字符串或一個(gè)文件句柄。

from bs4 import BeautifulSoup
 
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")

首先，文檔被轉(zhuǎn)換成Unicode，并且HTML的實(shí)例都被轉(zhuǎn)換成Unicode編碼。

soup = BeautifulSoup("Sacr&eacute; bleu!")
print(soup)
# <html><body><p>Sacré bleu!</p></body></html>

然后，Beautiful Soup選擇最合適的解析器來(lái)解析這段文檔，如果手動(dòng)指定解析器那么Beautiful Soup會(huì)選擇指定的解析器來(lái)解析文檔。

對(duì)象的種類

Beautiful Soup將復(fù)雜HTML文檔轉(zhuǎn)換成一個(gè)復(fù)雜的樹(shù)形結(jié)構(gòu)，每個(gè)節(jié)點(diǎn)都是Python對(duì)象，所有對(duì)象可以歸納為4種：Tag 、NavigableString、 BeautifulSoup、Comment 。

Tag

Tag 對(duì)象與XML或HTML原生文檔中的tag相同：

from bs4 import BeautifulSoup
 
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>',"html.parser")
 
# 獲取第一個(gè) b 標(biāo)簽
tag = soup.b
 
# 獲取對(duì)象類型
type(tag)
# <class 'bs4.element.Tag'>
 
# 獲取標(biāo)簽的名稱
tag.name
# b
 
# 修改標(biāo)簽的名稱
tag.name = "blockquote"
tag
# <blockquote class="boldest">Extremely bold</blockquote>
 
# 查看標(biāo)簽的 class 屬性
tag['class']
# ['boldest']
 
# 修改標(biāo)簽的 class 屬性
tag['class'] = 'verybold'
 
# 查看標(biāo)簽的 class 屬性內(nèi)容
tag.get('class')
# verybold
 
# 為標(biāo)簽新增 id 屬性
tag['id'] = 'title'
tag
# <blockquote class="verybold" id="title">Extremely bold</blockquote>
 
# 查看標(biāo)簽的所有屬性
tag.attrs
# {'class': ['verybold'], 'id': 'title'}
 
# 刪除標(biāo)簽的 id 屬性
del tag['id']
tag
# <blockquote class="verybold">Extremely bold</blockquote>

可遍歷字符串

字符串常被包含在tag內(nèi)，Beautiful Soup用 NavigableString 類來(lái)包裝tag中的字符串：

from bs4 import BeautifulSoup
 
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', "html.parser")
 
# 獲取第一個(gè) b 標(biāo)簽
tag = soup.b
 
# 獲取標(biāo)簽的文本內(nèi)容
tag.string
# Extremely bold
 
# 獲取標(biāo)簽的文本內(nèi)容的類型
type(tag.string)
# <class 'bs4.element.NavigableString'>

BeautifulSoup

BeautifulSoup 對(duì)象表示的是一個(gè)文檔的全部?jī)?nèi)容，大部分時(shí)候，可以把它當(dāng)作 Tag 對(duì)象，它支持遍歷文檔樹(shù) 和搜索文檔樹(shù) 中描述的大部分的方法。

因?yàn)?BeautifulSoup 對(duì)象并不是真正的HTML或XML的tag，所以它沒(méi)有name和attribute屬性。但有時(shí)查看它的 .name 屬性是很方便的，所以 BeautifulSoup 對(duì)象包含了一個(gè)值為 “[document]” 的特殊屬性 .name 。

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>',"html.parser")
soup.name
# [document]

注釋及特殊字符串

Tag、NavigableString、BeautifulSoup 幾乎覆蓋了html和xml中的所有內(nèi)容，但是還有一些特殊對(duì)象，容易讓人擔(dān)心的內(nèi)容是文檔的注釋部分：

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

Comment 對(duì)象是一個(gè)特殊類型的 NavigableString 對(duì)象：

comment
# Hey, buddy. Want to buy a used parser?

但是當(dāng)它出現(xiàn)在HTML文檔中時(shí)， Comment 對(duì)象會(huì)使用特殊的格式輸出：

soup.b.prettify()
# <b>
# <!--Hey, buddy. Want to buy a used parser?-->
# </b>

Beautiful Soup中定義的其它類型都可能會(huì)出現(xiàn)在XML的文檔中： CData，ProcessingInstruction， Declaration，Doctype。與 Comment 對(duì)象類似。這些類都是 NavigableString 的子類，只是添加了一些額外的方法的字符串獨(dú)享。下面是用CDATA來(lái)替代注釋的例子：

from bs4 import CData
 
cdata = CData("A CDATA block")
comment.replace_with(cdata)
 
print(soup.b.prettify())
# <b>
# <![CDATA[A CDATA block]]>
# </b>

子節(jié)點(diǎn)

一個(gè)Tag可能包含多個(gè)字符串或其它的Tag，這些都是這個(gè)Tag的子節(jié)點(diǎn)。Beautiful Soup提供了許多操作和遍歷子節(jié)點(diǎn)的屬性。

注意： Beautiful Soup中字符串節(jié)點(diǎn)不支持這些屬性,因?yàn)樽址疀](méi)有子節(jié)點(diǎn)。

繼續(xù)拿上面的《愛(ài)麗絲夢(mèng)游仙境》的文檔來(lái)做示例：

from bs4 import BeautifulSoup
 
# 《愛(ài)麗絲夢(mèng)游仙境》故事片段
html_doc = """
<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>
"""
 
# 構(gòu)造解析樹(shù)
soup = BeautifulSoup(html_doc, "html.parser")
 
# 通過(guò)點(diǎn)取屬性的方式獲得當(dāng)前名字的第一個(gè)tag
soup.body.p.b
# <b>The Dormouse's story</b>
 
# 查找所有的 a 標(biāo)簽
soup.find_all('a')
# [<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
 
# 通過(guò) .contents 屬性獲取tag 的子節(jié)點(diǎn)列表
soup.head.contents
# [<title>The Dormouse's story</title>]
 
# 通過(guò) .children 生成器對(duì)tag的子節(jié)點(diǎn)進(jìn)行遍歷
for child in soup.head.children:
  print(child)
# <title>The Dormouse's story</title>
 
# 通過(guò) .descendants 生成器對(duì)tag的后代節(jié)點(diǎn)進(jìn)行遍歷
for descendant in soup.head.descendants:
  print(descendant)
# <title>The Dormouse's story</title>
# The Dormouse's story
 
# 通過(guò) .string 屬性獲取唯一 NavigableString 類型子節(jié)點(diǎn)
soup.head.title.string
# The Dormouse's story
 
# 通過(guò) .string 屬性獲取唯一子節(jié)點(diǎn)的NavigableString 類型子節(jié)點(diǎn)
soup.head.string
# The Dormouse's story
 
# 通過(guò) .strings 屬性獲取 tag 中的多個(gè)字符串
for string in soup.strings:
  print(repr(string))
  
# 通過(guò) .stripped_strings 屬性獲取 tag 中去除多余空白內(nèi)容的多個(gè)字符串
for string in soup.stripped_strings:
  print(repr(string))

注意：如果tag包含了多個(gè)子節(jié)點(diǎn)，tag就無(wú)法確定 .string 方法應(yīng)該調(diào)用哪個(gè)子節(jié)點(diǎn)的內(nèi)容， .string 的輸出結(jié)果是 None 。

父節(jié)點(diǎn)

每個(gè)tag或字符串都有父節(jié)點(diǎn)，還是以上面的《愛(ài)麗絲夢(mèng)游仙境》的文檔來(lái)舉例：

from bs4 import BeautifulSoup
 
# 《愛(ài)麗絲夢(mèng)游仙境》故事片段
html_doc = """
<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>
"""
 
# 構(gòu)造解析樹(shù)
soup = BeautifulSoup(html_doc, "html.parser")
 
# 通過(guò) .parent 屬性來(lái)獲取 title 標(biāo)簽的父節(jié)點(diǎn)
soup.title.parent
# <head><title>The Dormouse's story</title></head>
 
# 通過(guò) .parent 屬性來(lái)獲取 title 標(biāo)簽的內(nèi)字符串的父節(jié)點(diǎn)
soup.title.string.parent
# <title>The Dormouse's story</title>
 
# 文檔的頂層節(jié)點(diǎn) <html> 的父節(jié)點(diǎn)是 BeautifulSoup 對(duì)象
type(soup.html.parent)
# <class 'bs4.BeautifulSoup'>
 
# BeautifulSoup 對(duì)象的 .parent 是None
soup.parent
 
for parent in soup.a.parents:
  print(parent.name)
# p
# body
# html
# [document]

兄弟節(jié)點(diǎn)

為了示例如何使用BeautifulSoup來(lái)查找兄弟節(jié)點(diǎn)，需要對(duì)上例中的《愛(ài)麗絲夢(mèng)游仙境》文檔進(jìn)行修改，刪掉一些換行符、字符串和標(biāo)簽。具體示例代碼如下：

from bs4 import BeautifulSoup
 
# 《愛(ài)麗絲夢(mèng)游仙境》故事片段
html_doc = """
<html>
<body>
<p class="title"><b>Schindler's List</b></p>
<p class="names"><a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a><a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a><a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a></p>
</body>
</html>
"""
 
# 構(gòu)造解析樹(shù)
soup = BeautifulSoup(html_doc, "html.parser")
 
# 獲取 ID = name2 的 a 標(biāo)簽
name2 = soup.find("a", {"id":{"name2"}})
# <a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a>
 
# 獲取前一個(gè)兄弟節(jié)點(diǎn)
name1 = name2.previous_sibling
# <a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a>
 
# 獲取前一個(gè)兄弟節(jié)點(diǎn)
name3 = name2.next_sibling
# <a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a>
 
name1.previous_sibling
# None
 
name3.next_sibling
# None
 
# 通過(guò) .next_siblings 屬性對(duì)當(dāng)前節(jié)點(diǎn)的兄弟節(jié)點(diǎn)進(jìn)行遍歷
for sibling in soup.find("a", {"id":{"name1"}}).next_siblings:
  print(repr(sibling))
# <a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a>
# <a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a>
 
# 通過(guò) .previous_siblings 屬性對(duì)當(dāng)前節(jié)點(diǎn)的兄弟節(jié)點(diǎn)進(jìn)行遍歷
for sibling in soup.find("a", {"id":{"name3"}}).previous_siblings:
  print(repr(sibling))
# <a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a>
# <a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a>

注意：標(biāo)簽之間包含的字符串、字符或者換行符等內(nèi)容均會(huì)被看作節(jié)點(diǎn)。

回退和前進(jìn)

繼續(xù)用上一章節(jié)《兄弟節(jié)點(diǎn)》中的HTML文檔進(jìn)行回退和前進(jìn)示例：

from bs4 import BeautifulSoup
 
# 《愛(ài)麗絲夢(mèng)游仙境》故事片段
html_doc = """
<html>
<body>
<p class="title"><b>Schindler's List</b></p>
<p class="names"><a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a><a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a><a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a></p>
</body>
</html>
"""
 
# 構(gòu)造解析樹(shù)
soup = BeautifulSoup(html_doc, "html.parser")
 
# 獲取 ID = name2 的 a 標(biāo)簽
name2 = soup.find("a", {"id":{"name2"}})
# <a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a>
 
# 獲取前一個(gè)節(jié)點(diǎn)
name2.previous_element
# Oskar Schindler
 
# 獲取前一個(gè)節(jié)點(diǎn)的前一個(gè)節(jié)點(diǎn)
name2.previous_element.previous_element
# <a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a>
 
# 獲取后一個(gè)節(jié)點(diǎn)
name2.next_element
# Itzhak Stern
 
 
# 獲取后一個(gè)節(jié)點(diǎn)的后一個(gè)節(jié)點(diǎn)
name2.next_element.next_element
# <a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a>
 
# 通過(guò) .next_elements 屬性對(duì)當(dāng)前節(jié)點(diǎn)的后面節(jié)點(diǎn)進(jìn)行遍歷
for element in soup.find("a", {"id":{"name1"}}).next_elements:
  print(repr(element))
# 'Oskar Schindler'
# <a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a>
# 'Itzhak Stern'
# <a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a>
# 'Helen Hirsch'
# '\n'
# '\n'
# '\n'
 
# 通過(guò) .previous_elements 屬性對(duì)當(dāng)前節(jié)點(diǎn)的前面節(jié)點(diǎn)進(jìn)行遍歷
for element in soup.find("a", {"id":{"name1"}}).previous_elements:
  print(repr(element))
# <p class="names"><a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a><a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a><a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a></p>
# '\n'
# "Schindler's List"
# <b>Schindler's List</b>
# <p class="title"><b>Schindler's List</b></p>
# '\n'
# <body>
# <p class="title"><b>Schindler's List</b></p>
# <p class="names"><a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a><a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a><a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a></p>
# </body>
# '\n'
# <html>
# <body>
# <p class="title"><b>Schindler's List</b></p>
# <p class="names"><a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a><a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a><a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a></p>
# </body>
# </html>
# '\n'

搜索文檔樹(shù)

find_all( name , attrs , recursive , text , **kwargs )
find( name , attrs , recursive , text , **kwargs )
find_parents( name , attrs , recursive , text , **kwargs )
find_parent( name , attrs , recursive , text , **kwargs )
find_next_siblings( name , attrs , recursive , text , **kwargs )
find_next_sibling( name , attrs , recursive , text , **kwargs )
find_previous_siblings( name , attrs , recursive , text , **kwargs )
find_previous_sibling( name , attrs , recursive , text , **kwargs )
find_all_next( name , attrs , recursive , text , **kwargs )
find_next( name , attrs , recursive , text , **kwargs )
find_all_previous( name , attrs , recursive , text , **kwargs )
find_previous( name , attrs , recursive , text , **kwargs )

Beautiful Soup定義了很多搜索方法，這里著重對(duì) find_all() 的用法進(jìn)行舉例。

from bs4 import BeautifulSoup
from bs4 import NavigableString
import re
 
# 《愛(ài)麗絲夢(mèng)游仙境》故事片段
html_doc = """
<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>
"""
 
# 構(gòu)造解析樹(shù)
soup = BeautifulSoup(html_doc, "html.parser")
 
# 傳入字符串，根據(jù)標(biāo)簽名稱查找(b)標(biāo)簽
soup.find_all('b')
# [<b>The Dormouse's story</b>]
 
# 傳入兩個(gè)字符串參數(shù)，返回 class = title 的 p 標(biāo)簽
soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse's story</b></p>]
 
# 返回 id = link2 的標(biāo)簽
soup.find_all(id='link2')
# [<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>]
 
# href 匹配 elsie 并且 id = link1 的標(biāo)簽
soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">three</a>]
 
# 返回 id = link1 的標(biāo)簽
print(soup.find_all(attrs={"id": "link1"}))
# [<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>]
 
# 返回 class = sister 的 a 標(biāo)簽
soup.find_all("a", class_="sister")
# [<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
 
def has_six_characters(css_class):
  return css_class is not None and len(css_class) == 6
 
# 返回 class 屬性為6個(gè)字符的 標(biāo)簽
soup.find_all(class_=has_six_characters)
# [<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
 
# 返回字符串
soup.find_all(text=["Tillie", "Elsie", "Lacie"])
 
# 返回前兩個(gè) a 標(biāo)簽
soup.find_all("a", limit=2)
 
# 返回 title 標(biāo)簽,不級(jí)聯(lián)查詢
soup.html.find_all("title", recursive=False)
 
# 使用 CSS 選擇器進(jìn)行過(guò)濾
soup.select("head > title")
 
# 傳入正則表達(dá)式，根據(jù)標(biāo)簽名稱查找匹配(以字母 b 開(kāi)頭)標(biāo)簽
for tag in soup.find_all(re.compile("^b")):
  print(tag.name)
# body
# b
 
# 傳入正則表達(dá)式，根據(jù)標(biāo)簽名稱查找匹配(包含字母 t)標(biāo)簽
for tag in soup.find_all(re.compile("t")):
  print(tag.name)
# html
# title
 
# 傳入列表，根據(jù)標(biāo)簽名稱查找(a和b)標(biāo)簽
soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>, 
# <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, 
# <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, 
# <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
 
# 傳入True，返回除字符串節(jié)點(diǎn)外的所有標(biāo)簽
for tag in soup.find_all(True):
  print(tag.name)
 
def has_class_but_no_id(tag):
  return tag.has_attr('class') and not tag.has_attr('id')
 
# 傳入自定義方法，返回僅包含 class 屬性但不包含 id 屬性的所有標(biāo)簽
soup.find_all(has_class_but_no_id)

到此這篇關(guān)于使用BeautifulSoup4解析XML的方法小結(jié)的文章就介紹到這了,更多相關(guān)BeautifulSoup4解析XML內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: