使用BeautifulSoup4解析XML的方法小結(jié)
Beautiful Soup 是一個(gè)用來(lái)從HTML或XML文件中提取數(shù)據(jù)的Python庫(kù),它利用大家所喜歡的解析器提供了許多慣用方法用來(lái)對(duì)文檔樹(shù)進(jìn)行導(dǎo)航、查找和修改。
幫助文檔英文版:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
幫助文檔中文版:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
入門示例
以下是電影《愛(ài)麗絲夢(mèng)游仙境》中的一段HTML內(nèi)容:
我們以此為例,對(duì)如何使用BeautifulSoup解析HTML頁(yè)面內(nèi)容進(jìn)行簡(jiǎn)單入門示例:
from bs4 import BeautifulSoup # 《愛(ài)麗絲夢(mèng)游仙境》故事片段 html_doc = """ <html> <head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>, <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body> </html> """ # 構(gòu)造解析樹(shù) soup = BeautifulSoup(html_doc, "html.parser") # 美化輸出 #soup.prettify()) # 獲取第一個(gè) title 標(biāo)簽 soup.title # <title>The Dormouse's story</title> # 獲取第一個(gè) title 標(biāo)簽的名稱 soup.title.name # title # 獲取第一個(gè) title 標(biāo)簽的文本內(nèi)容 soup.title.string # The Dormouse's story # 獲取第一個(gè) title 標(biāo)簽的父標(biāo)簽的名稱 soup.title.parent.name # head # 獲取第一個(gè) p 標(biāo)簽 soup.p # <p class="title"><b>The Dormouse's story</b></p> # 獲取第一個(gè) p 標(biāo)簽的 class 屬性 soup.p['class'] # ['title'] # 獲取第一個(gè) a 標(biāo)簽 soup.a # <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a> # 查找所有的 a 標(biāo)簽 soup.find_all('a') # [<a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, # <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, # <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>] # 獲取所有的 a 標(biāo)簽的 href 屬性 for link in soup.find_all('a'): print(link.get('href')) # http://example.com/elsie # http://example.com/lacie # http://example.com/tillie # 查找 id = link3 的 a 標(biāo)簽 soup.find(id="link3") # <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a> # 獲取解析樹(shù)的文本內(nèi)容 print(soup.get_text()) # The Dormouse's story # # The Dormouse's story # Once upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; # and they lived at the bottom of a well. # ...
解析器
Beautiful Soup除了支持Python標(biāo)準(zhǔn)庫(kù)中的HTML解析器外,還支持一些第三方的解析器,其中一個(gè)就是 lxml 。
下表列出了主要的解析器,以及它們的優(yōu)缺點(diǎn):
解析器 |
使用方法 |
優(yōu)勢(shì) |
劣勢(shì) |
Python標(biāo)準(zhǔn)庫(kù) |
BeautifulSoup(markup, "html.parser") |
Python的內(nèi)置標(biāo)準(zhǔn)庫(kù) 執(zhí)行速度適中 文檔容錯(cuò)能力強(qiáng) |
Python 2.7.3 or 3.2.2)前 的版本中文檔容錯(cuò)能力差 |
lxml HTML 解析器 |
BeautifulSoup(markup, "lxml") |
速度快 文檔容錯(cuò)能力強(qiáng) |
需要安裝C語(yǔ)言庫(kù) |
lxml XML 解析器 |
BeautifulSoup(markup, ["lxml", "xml"]) BeautifulSoup(markup, "xml") |
速度快 唯一支持XML的解析器 |
需要安裝C語(yǔ)言庫(kù) |
html5lib |
BeautifulSoup(markup, "html5lib") |
最好的容錯(cuò)性 以瀏覽器的方式解析文檔 生成HTML5格式的文檔 |
速度慢 不依賴外部擴(kuò)展 |
推薦使用lxml作為解析器,因?yàn)樾矢摺?在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必須安裝lxml或html5lib, 因?yàn)槟切㏄ython版本的標(biāo)準(zhǔn)庫(kù)中內(nèi)置的HTML解析方法不夠穩(wěn)定。
注意: 如果一段HTML或XML文檔格式不正確的話,那么在不同的解析器中返回的結(jié)果可能是不一樣的。
解析器之間的區(qū)別
Beautiful Soup為不同的解析器提供了相同的接口,但解析器本身是有區(qū)別的,同一篇文檔被不同的解析器解析后可能會(huì)生成不同結(jié)構(gòu)的樹(shù)型文檔,區(qū)別最大的是HTML解析器和XML解析器,看下面片段被解析成HTML結(jié)構(gòu):
html_soup = BeautifulSoup("<a><b/></a>", "lxml") print(html_soup) # <html><body><a><b></b></a></body></html>
因?yàn)榭諛?biāo)簽<b/>不符合HTML標(biāo)準(zhǔn),所以解析器把它解析成<b></b>。
同樣的文檔使用XML解析如下(解析XML需要安裝lxml庫(kù))。注意,空標(biāo)簽<b/>依然被保留,并且文檔前添加了XML頭,而不是被包含在<html>標(biāo)簽內(nèi):
xml_soup = BeautifulSoup("<a><b/></a>", "xml") print(xml_soup) # <?xml version="1.0" encoding="utf-8"?> # <a><b/></a>
HTML解析器之間也有區(qū)別,如果被解析的HTML文檔是標(biāo)準(zhǔn)格式,那么解析器之間沒(méi)有任何差別,只是解析速度不同,結(jié)果都會(huì)返回正確的文檔樹(shù)。
但是如果被解析文檔不是標(biāo)準(zhǔn)格式,那么不同的解析器返回結(jié)果可能不同。下面例子中,使用lxml解析錯(cuò)誤格式的文檔,結(jié)果</p>標(biāo)簽被直接忽略掉了:
soup = BeautifulSoup("<a></p>", "lxml") print(soup) # <html><body><a></a></body></html>
使用html5lib庫(kù)解析相同文檔會(huì)得到不同的結(jié)果:
soup = BeautifulSoup("<a></p>", "html5lib") print(soup) # <html><head></head><body><a><p></p></a></body></html>
html5lib庫(kù)沒(méi)有忽略掉</p>標(biāo)簽,而是自動(dòng)補(bǔ)全了標(biāo)簽,還給文檔樹(shù)添加了<head>標(biāo)簽。
使用pyhton內(nèi)置庫(kù)解析結(jié)果如下:
soup = BeautifulSoup("<a></p>", "html.parser") print(soup) # <a></a>
與lxml 庫(kù)類似的,Python內(nèi)置庫(kù)忽略掉了</p>標(biāo)簽,與html5lib庫(kù)不同的是標(biāo)準(zhǔn)庫(kù)沒(méi)有嘗試創(chuàng)建符合標(biāo)準(zhǔn)的文檔格式或?qū)⑽臋n片段包含在<body>標(biāo)簽內(nèi),與lxml不同的是標(biāo)準(zhǔn)庫(kù)甚至連<html>標(biāo)簽都沒(méi)有嘗試去添加。
因?yàn)槲臋n片段“<a></p>”是錯(cuò)誤格式,所以以上解析方式都能算作”正確”,html5lib庫(kù)使用的是HTML5的部分標(biāo)準(zhǔn),所以最接近”正確”,不過(guò)所有解析器的結(jié)構(gòu)都能夠被認(rèn)為是”正?!钡?。
不同的解析器可能影響代碼執(zhí)行結(jié)果,如果在分發(fā)給別人的代碼中使用了 BeautifulSoup ,那么最好注明使用了哪種解析器,以減少不必要的麻煩。
創(chuàng)建文檔對(duì)象
將一段文檔傳入BeautifulSoup 的構(gòu)造方法,就能得到一個(gè)文檔的對(duì)象, 可以傳入一段字符串或一個(gè)文件句柄。
from bs4 import BeautifulSoup soup = BeautifulSoup(open("index.html")) soup = BeautifulSoup("<html>data</html>")
首先,文檔被轉(zhuǎn)換成Unicode,并且HTML的實(shí)例都被轉(zhuǎn)換成Unicode編碼。
soup = BeautifulSoup("Sacré bleu!") print(soup) # <html><body><p>Sacré bleu!</p></body></html>
然后,Beautiful Soup選擇最合適的解析器來(lái)解析這段文檔,如果手動(dòng)指定解析器那么Beautiful Soup會(huì)選擇指定的解析器來(lái)解析文檔。
對(duì)象的種類
Beautiful Soup將復(fù)雜HTML文檔轉(zhuǎn)換成一個(gè)復(fù)雜的樹(shù)形結(jié)構(gòu),每個(gè)節(jié)點(diǎn)都是Python對(duì)象,所有對(duì)象可以歸納為4種:Tag 、NavigableString、 BeautifulSoup、Comment 。
Tag
Tag 對(duì)象與XML或HTML原生文檔中的tag相同:
from bs4 import BeautifulSoup soup = BeautifulSoup('<b class="boldest">Extremely bold</b>',"html.parser") # 獲取第一個(gè) b 標(biāo)簽 tag = soup.b # 獲取對(duì)象類型 type(tag) # <class 'bs4.element.Tag'> # 獲取標(biāo)簽的名稱 tag.name # b # 修改標(biāo)簽的名稱 tag.name = "blockquote" tag # <blockquote class="boldest">Extremely bold</blockquote> # 查看標(biāo)簽的 class 屬性 tag['class'] # ['boldest'] # 修改標(biāo)簽的 class 屬性 tag['class'] = 'verybold' # 查看標(biāo)簽的 class 屬性內(nèi)容 tag.get('class') # verybold # 為標(biāo)簽新增 id 屬性 tag['id'] = 'title' tag # <blockquote class="verybold" id="title">Extremely bold</blockquote> # 查看標(biāo)簽的所有屬性 tag.attrs # {'class': ['verybold'], 'id': 'title'} # 刪除標(biāo)簽的 id 屬性 del tag['id'] tag # <blockquote class="verybold">Extremely bold</blockquote>
可遍歷字符串
字符串常被包含在tag內(nèi),Beautiful Soup用 NavigableString 類來(lái)包裝tag中的字符串:
from bs4 import BeautifulSoup soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', "html.parser") # 獲取第一個(gè) b 標(biāo)簽 tag = soup.b # 獲取標(biāo)簽的文本內(nèi)容 tag.string # Extremely bold # 獲取標(biāo)簽的文本內(nèi)容的類型 type(tag.string) # <class 'bs4.element.NavigableString'>
BeautifulSoup
BeautifulSoup 對(duì)象表示的是一個(gè)文檔的全部?jī)?nèi)容,大部分時(shí)候,可以把它當(dāng)作 Tag 對(duì)象,它支持 遍歷文檔樹(shù) 和 搜索文檔樹(shù) 中描述的大部分的方法。
因?yàn)?BeautifulSoup 對(duì)象并不是真正的HTML或XML的tag,所以它沒(méi)有name和attribute屬性。但有時(shí)查看它的 .name 屬性是很方便的,所以 BeautifulSoup 對(duì)象包含了一個(gè)值為 “[document]” 的特殊屬性 .name 。
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>',"html.parser") soup.name # [document]
注釋及特殊字符串
Tag、NavigableString、BeautifulSoup 幾乎覆蓋了html和xml中的所有內(nèi)容,但是還有一些特殊對(duì)象,容易讓人擔(dān)心的內(nèi)容是文檔的注釋部分:
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>" soup = BeautifulSoup(markup) comment = soup.b.string type(comment) # <class 'bs4.element.Comment'>
Comment 對(duì)象是一個(gè)特殊類型的 NavigableString 對(duì)象:
comment # Hey, buddy. Want to buy a used parser?
但是當(dāng)它出現(xiàn)在HTML文檔中時(shí), Comment 對(duì)象會(huì)使用特殊的格式輸出:
soup.b.prettify() # <b> # <!--Hey, buddy. Want to buy a used parser?--> # </b>
Beautiful Soup中定義的其它類型都可能會(huì)出現(xiàn)在XML的文檔中: CData,ProcessingInstruction, Declaration,Doctype。與 Comment 對(duì)象類似。這些類都是 NavigableString 的子類,只是添加了一些額外的方法的字符串獨(dú)享。下面是用CDATA來(lái)替代注釋的例子:
from bs4 import CData cdata = CData("A CDATA block") comment.replace_with(cdata) print(soup.b.prettify()) # <b> # <![CDATA[A CDATA block]]> # </b>
子節(jié)點(diǎn)
一個(gè)Tag可能包含多個(gè)字符串或其它的Tag,這些都是這個(gè)Tag的子節(jié)點(diǎn)。Beautiful Soup提供了許多操作和遍歷子節(jié)點(diǎn)的屬性。
注意: Beautiful Soup中字符串節(jié)點(diǎn)不支持這些屬性,因?yàn)樽址疀](méi)有子節(jié)點(diǎn)。
繼續(xù)拿上面的《愛(ài)麗絲夢(mèng)游仙境》的文檔來(lái)做示例:
from bs4 import BeautifulSoup # 《愛(ài)麗絲夢(mèng)游仙境》故事片段 html_doc = """ <html> <head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>, <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body> </html> """ # 構(gòu)造解析樹(shù) soup = BeautifulSoup(html_doc, "html.parser") # 通過(guò)點(diǎn)取屬性的方式獲得當(dāng)前名字的第一個(gè)tag soup.body.p.b # <b>The Dormouse's story</b> # 查找所有的 a 標(biāo)簽 soup.find_all('a') # [<a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, # <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, # <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>] # 通過(guò) .contents 屬性獲取tag 的子節(jié)點(diǎn)列表 soup.head.contents # [<title>The Dormouse's story</title>] # 通過(guò) .children 生成器對(duì)tag的子節(jié)點(diǎn)進(jìn)行遍歷 for child in soup.head.children: print(child) # <title>The Dormouse's story</title> # 通過(guò) .descendants 生成器對(duì)tag的后代節(jié)點(diǎn)進(jìn)行遍歷 for descendant in soup.head.descendants: print(descendant) # <title>The Dormouse's story</title> # The Dormouse's story # 通過(guò) .string 屬性獲取唯一 NavigableString 類型子節(jié)點(diǎn) soup.head.title.string # The Dormouse's story # 通過(guò) .string 屬性獲取唯一子節(jié)點(diǎn)的NavigableString 類型子節(jié)點(diǎn) soup.head.string # The Dormouse's story # 通過(guò) .strings 屬性獲取 tag 中的多個(gè)字符串 for string in soup.strings: print(repr(string)) # 通過(guò) .stripped_strings 屬性獲取 tag 中去除多余空白內(nèi)容的多個(gè)字符串 for string in soup.stripped_strings: print(repr(string))
注意:如果tag包含了多個(gè)子節(jié)點(diǎn),tag就無(wú)法確定 .string 方法應(yīng)該調(diào)用哪個(gè)子節(jié)點(diǎn)的內(nèi)容, .string 的輸出結(jié)果是 None 。
父節(jié)點(diǎn)
每個(gè)tag或字符串都有父節(jié)點(diǎn),還是以上面的《愛(ài)麗絲夢(mèng)游仙境》的文檔來(lái)舉例:
from bs4 import BeautifulSoup # 《愛(ài)麗絲夢(mèng)游仙境》故事片段 html_doc = """ <html> <head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>, <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body> </html> """ # 構(gòu)造解析樹(shù) soup = BeautifulSoup(html_doc, "html.parser") # 通過(guò) .parent 屬性來(lái)獲取 title 標(biāo)簽的父節(jié)點(diǎn) soup.title.parent # <head><title>The Dormouse's story</title></head> # 通過(guò) .parent 屬性來(lái)獲取 title 標(biāo)簽的內(nèi)字符串的父節(jié)點(diǎn) soup.title.string.parent # <title>The Dormouse's story</title> # 文檔的頂層節(jié)點(diǎn) <html> 的父節(jié)點(diǎn)是 BeautifulSoup 對(duì)象 type(soup.html.parent) # <class 'bs4.BeautifulSoup'> # BeautifulSoup 對(duì)象的 .parent 是None soup.parent for parent in soup.a.parents: print(parent.name) # p # body # html # [document]
兄弟節(jié)點(diǎn)
為了示例如何使用BeautifulSoup來(lái)查找兄弟節(jié)點(diǎn),需要對(duì)上例中的《愛(ài)麗絲夢(mèng)游仙境》文檔進(jìn)行修改,刪掉一些換行符、字符串和標(biāo)簽。具體示例代碼如下:
from bs4 import BeautifulSoup # 《愛(ài)麗絲夢(mèng)游仙境》故事片段 html_doc = """ <html> <body> <p class="title"><b>Schindler's List</b></p> <p class="names"><a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a><a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a><a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a></p> </body> </html> """ # 構(gòu)造解析樹(shù) soup = BeautifulSoup(html_doc, "html.parser") # 獲取 ID = name2 的 a 標(biāo)簽 name2 = soup.find("a", {"id":{"name2"}}) # <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a> # 獲取前一個(gè)兄弟節(jié)點(diǎn) name1 = name2.previous_sibling # <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a> # 獲取前一個(gè)兄弟節(jié)點(diǎn) name3 = name2.next_sibling # <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a> name1.previous_sibling # None name3.next_sibling # None # 通過(guò) .next_siblings 屬性對(duì)當(dāng)前節(jié)點(diǎn)的兄弟節(jié)點(diǎn)進(jìn)行遍歷 for sibling in soup.find("a", {"id":{"name1"}}).next_siblings: print(repr(sibling)) # <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a> # <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a> # 通過(guò) .previous_siblings 屬性對(duì)當(dāng)前節(jié)點(diǎn)的兄弟節(jié)點(diǎn)進(jìn)行遍歷 for sibling in soup.find("a", {"id":{"name3"}}).previous_siblings: print(repr(sibling)) # <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a> # <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a>
注意:標(biāo)簽之間包含的字符串、字符或者換行符等內(nèi)容均會(huì)被看作節(jié)點(diǎn)。
回退和前進(jìn)
繼續(xù)用上一章節(jié)《兄弟節(jié)點(diǎn)》中的HTML文檔進(jìn)行回退和前進(jìn)示例:
from bs4 import BeautifulSoup # 《愛(ài)麗絲夢(mèng)游仙境》故事片段 html_doc = """ <html> <body> <p class="title"><b>Schindler's List</b></p> <p class="names"><a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a><a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a><a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a></p> </body> </html> """ # 構(gòu)造解析樹(shù) soup = BeautifulSoup(html_doc, "html.parser") # 獲取 ID = name2 的 a 標(biāo)簽 name2 = soup.find("a", {"id":{"name2"}}) # <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a> # 獲取前一個(gè)節(jié)點(diǎn) name2.previous_element # Oskar Schindler # 獲取前一個(gè)節(jié)點(diǎn)的前一個(gè)節(jié)點(diǎn) name2.previous_element.previous_element # <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a> # 獲取后一個(gè)節(jié)點(diǎn) name2.next_element # Itzhak Stern # 獲取后一個(gè)節(jié)點(diǎn)的后一個(gè)節(jié)點(diǎn) name2.next_element.next_element # <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a> # 通過(guò) .next_elements 屬性對(duì)當(dāng)前節(jié)點(diǎn)的后面節(jié)點(diǎn)進(jìn)行遍歷 for element in soup.find("a", {"id":{"name1"}}).next_elements: print(repr(element)) # 'Oskar Schindler' # <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a> # 'Itzhak Stern' # <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a> # 'Helen Hirsch' # '\n' # '\n' # '\n' # 通過(guò) .previous_elements 屬性對(duì)當(dāng)前節(jié)點(diǎn)的前面節(jié)點(diǎn)進(jìn)行遍歷 for element in soup.find("a", {"id":{"name1"}}).previous_elements: print(repr(element)) # <p class="names"><a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a><a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a><a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a></p> # '\n' # "Schindler's List" # <b>Schindler's List</b> # <p class="title"><b>Schindler's List</b></p> # '\n' # <body> # <p class="title"><b>Schindler's List</b></p> # <p class="names"><a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a><a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a><a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a></p> # </body> # '\n' # <html> # <body> # <p class="title"><b>Schindler's List</b></p> # <p class="names"><a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a><a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a><a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a></p> # </body> # </html> # '\n'
搜索文檔樹(shù)
find_all( name , attrs , recursive , text , **kwargs ) find( name , attrs , recursive , text , **kwargs ) find_parents( name , attrs , recursive , text , **kwargs ) find_parent( name , attrs , recursive , text , **kwargs ) find_next_siblings( name , attrs , recursive , text , **kwargs ) find_next_sibling( name , attrs , recursive , text , **kwargs ) find_previous_siblings( name , attrs , recursive , text , **kwargs ) find_previous_sibling( name , attrs , recursive , text , **kwargs ) find_all_next( name , attrs , recursive , text , **kwargs ) find_next( name , attrs , recursive , text , **kwargs ) find_all_previous( name , attrs , recursive , text , **kwargs ) find_previous( name , attrs , recursive , text , **kwargs )
Beautiful Soup定義了很多搜索方法,這里著重對(duì) find_all() 的用法進(jìn)行舉例。
from bs4 import BeautifulSoup from bs4 import NavigableString import re # 《愛(ài)麗絲夢(mèng)游仙境》故事片段 html_doc = """ <html> <head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>, <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body> </html> """ # 構(gòu)造解析樹(shù) soup = BeautifulSoup(html_doc, "html.parser") # 傳入字符串,根據(jù)標(biāo)簽名稱查找(b)標(biāo)簽 soup.find_all('b') # [<b>The Dormouse's story</b>] # 傳入兩個(gè)字符串參數(shù),返回 class = title 的 p 標(biāo)簽 soup.find_all("p", "title") # [<p class="title"><b>The Dormouse's story</b></p>] # 返回 id = link2 的標(biāo)簽 soup.find_all(id='link2') # [<a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>] # href 匹配 elsie 并且 id = link1 的標(biāo)簽 soup.find_all(href=re.compile("elsie"), id='link1') # [<a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">three</a>] # 返回 id = link1 的標(biāo)簽 print(soup.find_all(attrs={"id": "link1"})) # [<a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>] # 返回 class = sister 的 a 標(biāo)簽 soup.find_all("a", class_="sister") # [<a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, # <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, # <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>] def has_six_characters(css_class): return css_class is not None and len(css_class) == 6 # 返回 class 屬性為6個(gè)字符的 標(biāo)簽 soup.find_all(class_=has_six_characters) # [<a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, # <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, # <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>] # 返回字符串 soup.find_all(text=["Tillie", "Elsie", "Lacie"]) # 返回前兩個(gè) a 標(biāo)簽 soup.find_all("a", limit=2) # 返回 title 標(biāo)簽,不級(jí)聯(lián)查詢 soup.html.find_all("title", recursive=False) # 使用 CSS 選擇器進(jìn)行過(guò)濾 soup.select("head > title") # 傳入正則表達(dá)式,根據(jù)標(biāo)簽名稱查找匹配(以字母 b 開(kāi)頭)標(biāo)簽 for tag in soup.find_all(re.compile("^b")): print(tag.name) # body # b # 傳入正則表達(dá)式,根據(jù)標(biāo)簽名稱查找匹配(包含字母 t)標(biāo)簽 for tag in soup.find_all(re.compile("t")): print(tag.name) # html # title # 傳入列表,根據(jù)標(biāo)簽名稱查找(a和b)標(biāo)簽 soup.find_all(["a", "b"]) # [<b>The Dormouse's story</b>, # <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, # <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, # <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>] # 傳入True,返回除字符串節(jié)點(diǎn)外的所有標(biāo)簽 for tag in soup.find_all(True): print(tag.name) def has_class_but_no_id(tag): return tag.has_attr('class') and not tag.has_attr('id') # 傳入自定義方法,返回僅包含 class 屬性但不包含 id 屬性的所有標(biāo)簽 soup.find_all(has_class_but_no_id)
到此這篇關(guān)于使用BeautifulSoup4解析XML的方法小結(jié)的文章就介紹到這了,更多相關(guān)BeautifulSoup4解析XML內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
相關(guān)文章
Python multiprocessing模塊中的Pipe管道使用實(shí)例
這篇文章主要介紹了Python multiprocessing模塊中的Pipe管道使用實(shí)例,本文直接給出使用實(shí)例,需要的朋友可以參考下2015-04-04Python實(shí)現(xiàn)Harbor私有鏡像倉(cāng)庫(kù)垃圾自動(dòng)化清理詳情
這篇文章主要介紹了Python實(shí)現(xiàn)Harbor私有鏡像倉(cāng)庫(kù)垃圾自動(dòng)化清理詳情,文章圍繞主題分享相關(guān)詳細(xì)代碼,需要的小伙伴可以參考一下2022-05-05PyCharm之如何設(shè)置自動(dòng)換行問(wèn)題
這篇文章主要介紹了PyCharm之如何設(shè)置自動(dòng)換行問(wèn)題,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。如有錯(cuò)誤或未考慮完全的地方,望不吝賜教2023-05-05解讀卷積神經(jīng)網(wǎng)絡(luò)的人臉識(shí)別
這篇文章主要介紹了解讀卷積神經(jīng)網(wǎng)絡(luò)的人臉識(shí)別問(wèn)題,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。如有錯(cuò)誤或未考慮完全的地方,望不吝賜教2022-11-11基于python OpenCV實(shí)現(xiàn)動(dòng)態(tài)人臉檢測(cè)
這篇文章主要為大家詳細(xì)介紹了基于python OpenCV實(shí)現(xiàn)動(dòng)態(tài)人臉檢測(cè),具有一定的參考價(jià)值,感興趣的小伙伴們可以參考一下2018-05-05python實(shí)現(xiàn)unicode轉(zhuǎn)中文及轉(zhuǎn)換默認(rèn)編碼的方法
這篇文章主要介紹了python實(shí)現(xiàn)unicode轉(zhuǎn)中文及轉(zhuǎn)換默認(rèn)編碼的方法,結(jié)合實(shí)例形式分析了Python針對(duì)Unicode編碼操作的相關(guān)技巧及編碼轉(zhuǎn)換中的常見(jiàn)問(wèn)題解決方法,需要的朋友可以參考下2017-04-04python定時(shí)器(Timer)用法簡(jiǎn)單實(shí)例
這篇文章主要介紹了python定時(shí)器(Timer)用法,以一個(gè)簡(jiǎn)單實(shí)例形式分析了定時(shí)器(Timer)實(shí)現(xiàn)延遲調(diào)用的技巧,需要的朋友可以參考下2015-06-06