python爬蟲之BeautifulSoup 使用select方法詳解
本文介紹了python爬蟲之BeautifulSoup 使用select方法詳解 ,分享給大家。具體如下:
<html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>, <a rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and <a rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
我們?cè)趯?CSS 時(shí),標(biāo)簽名不加任何修飾,類名前加點(diǎn),id名前加 #,在這里我們也可以利用類似的方法來篩選元素,用到的方法是 soup.select(),返回類型是 list
(1)通過標(biāo)簽名查找
print soup.select('title') #[<title>The Dormouse's story</title>] print soup.select('a') #[<a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>, <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>] print soup.select('b') #[<b>The Dormouse's story</b>]
(2)通過類名查找
print soup.select('.sister') #[<a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>, <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
(3)通過 id 名查找
print soup.select('#link1') #[<a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>]
(4)組合查找
組合查找即和寫 class 文件時(shí),標(biāo)簽名與類名、id名進(jìn)行的組合原理是一樣的,例如查找 p 標(biāo)簽中,id 等于 link1的內(nèi)容,二者需要用空格分開
print soup.select('p #link1') #[<a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>]
直接子標(biāo)簽查找
print soup.select("head > title") #[<title>The Dormouse's story</title>]
(5)屬性查找
查找時(shí)還可以加入屬性元素,屬性需要用中括號(hào)括起來,注意屬性和標(biāo)簽屬于同一節(jié)點(diǎn),所以中間不能加空格,否則會(huì)無法匹配到。
print soup.select("head > title") #[<title>The Dormouse's story</title>] print soup.select('a[ rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]') #[<a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>]
同樣,屬性仍然可以與上述查找方式組合,不在同一節(jié)點(diǎn)的空格隔開,同一節(jié)點(diǎn)的不加空格
print soup.select('p a[ rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]') #[<a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>]
以上就是本文的全部內(nèi)容,希望對(duì)大家的學(xué)習(xí)有所幫助,也希望大家多多支持腳本之家。
相關(guān)文章
python連接clickhouse數(shù)據(jù)庫的兩種方式小結(jié)
這篇文章主要介紹了python連接clickhouse數(shù)據(jù)庫的兩種方式小結(jié),具有很好的參考價(jià)值,希望對(duì)大家有所幫助。如有錯(cuò)誤或未考慮完全的地方,望不吝賜教2022-05-05python格式化輸出實(shí)例(居中、靠右及靠左對(duì)齊)
所謂格式化輸出就是數(shù)據(jù)按照某種特殊的格式和要求進(jìn)行輸出,下面這篇文章主要給大家介紹了關(guān)于python格式化輸出(居中、靠右及靠左對(duì)齊)的相關(guān)資料,文中介紹了format方式、其他擴(kuò)展寫法以及'%'方式,需要的朋友可以參考下2022-04-04python35種繪圖函數(shù)詳細(xì)總結(jié)
Python有許多用于繪圖的函數(shù)和庫,比如Matplotlib,Plotly,Bokeh,Seaborn等,這只是一些常用的繪圖函數(shù)和庫,Python還有其他繪圖工具,如Pandas、ggplot等,選擇適合你需求的庫,可以根據(jù)你的數(shù)據(jù)類型、圖形需求和個(gè)人偏好來決定,本文給大家總結(jié)了python35種繪圖函數(shù)2023-08-08python類型強(qiáng)制轉(zhuǎn)換long to int的代碼
python的int型最大值和系統(tǒng)有關(guān),32位和64位系統(tǒng)結(jié)果是不同的,分別為2的31次方減1和2的63次方減1,可以通過sys.maxint查看此值2013-02-02你應(yīng)該知道的Python3.6、3.7、3.8新特性小結(jié)
這篇文章主要介紹了你應(yīng)該知道的Python3.6、3.7、3.8新特性小結(jié),文中通過示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧2020-05-05