selenium+PhantomJS爬取豆瓣讀書
本文實(shí)例為大家分享了selenium+PhantomJS爬取豆瓣讀書的具體代碼,供大家參考,具體內(nèi)容如下
獲取關(guān)于Python的全部書籍信息;
通過代碼測(cè)試 request攜帶‘User-Agent'及 ‘data'數(shù)據(jù)信息的方式均無法獲取到相關(guān)信息,獲取數(shù)據(jù)時(shí),部分?jǐn)?shù)據(jù)為空,導(dǎo)致獲取過程中報(bào)錯(cuò),無法獲取全部數(shù)據(jù),初步判定豆瓣讀書的反爬機(jī)制較為嚴(yán)格;通過selenium 模擬瀏覽器請(qǐng)求的方法測(cè)試后發(fā)現(xiàn),可利用 selenium 方法請(qǐng)求獲取數(shù)據(jù);
#導(dǎo)入需要的模塊
from selenium import webdriver
import time
from lxml import etree
import pymysql
import re
#創(chuàng)建一個(gè)函數(shù)
def my_browers(url, page):
# 獲取瀏覽器對(duì)象
browers = webdriver.PhantomJS(executable_path=r'd:\Desktop\pythonjs\phantomjs-2.1.1-windows\bin\phantomjs.exe')
# 用瀏覽器發(fā)起請(qǐng)求
browers.get(url)
#休息兩秒,頻率低一點(diǎn),爬的時(shí)間久一點(diǎn),安全就多一點(diǎn)
time.sleep(2)
# 獲取頁(yè)面信息
html = browers.page_source
# 調(diào)用頁(yè)面解析函數(shù)
parse_html(html)
# 解析頁(yè)面信息
def parse_html(html):
# 生成一個(gè)xpath對(duì)象
html = etree.HTML(html)
# 獲取所有的書籍信息列表
books = html.xpath('//div[contains(@class,"sc-bZQynM")]')
# 遍歷每一本書籍 然后拿到我們想要的數(shù)據(jù)
for book in books:
# 創(chuàng)建一個(gè)存書字典存數(shù)據(jù)用
book_dict = {}
# 獲取封面信息
pic = book.xpath('//img/@src')
if pic:
book_dict['pic'] = pic[0]
else:
book_dict['pic'] = ''
# print(pic)
# 獲取書名
book_name = book.xpath('//div[@class="title"]/a/text()')
# print(book_name)
if book_name:
book_name = book_name[0]
# 刪除書名中最后出現(xiàn)的引號(hào),
#由于存數(shù)據(jù)庫(kù)的時(shí)候書名最后面的引號(hào)會(huì)導(dǎo)致數(shù)據(jù)庫(kù)報(bào)錯(cuò),刪除可以使代碼更健壯
if '"' in book_name:
pattern = re.compile(r'"')
book_name = pattern.sub('', book_name)
if "'" in book_name:
pattern = re.compile(r"'")
book_name = pattern.sub('', book_name)
# 刪除書名中最后出現(xiàn)的\,存數(shù)據(jù)的時(shí)候書名最后的\會(huì)把sql語句最后的引號(hào)轉(zhuǎn)義,
#刪除可以使代碼更健壯
if '\\' in book_name:
book_name = book_name[:-1]
book_dict['book_name'] = book_name
else:
book_dict['book_name'] = ''
# 獲取書籍詳情連接
book_url = book.xpath('//div[@class="title"]/a/@href')
if book_url:
book_dict['book_url'] = book_url[0]
else:
book_dict['book_url'] = ''
# 獲取評(píng)分信息
score_book = book.xpath('//span[@class="rating_nums"]/text()')
if score_book:
book_dict['score_book'] = score_book[0]
else:
book_dict['score_book'] = ''
# 獲取出版社信息
book_detail = book.xpath('//div[@class="meta abstract"]/text()')
if book_detail:
# 刪除書詳情中最后出現(xiàn)的引號(hào);
book_detail = book_detail[0]
if "'" in book_detail:
pattern = re.compile(r"'")
book_detail = pattern.sub('', book_detail)
book_dict['book_detail'] = book_detail
else:
book_dict['book_detail'] = ''
print(book_dict)
# 調(diào)用數(shù)據(jù)庫(kù)函數(shù)
insert_mysql(book_dict)
# 插入數(shù)據(jù)庫(kù)
def insert_mysql(book_dict):
# 連接數(shù)據(jù)庫(kù)
conn = pymysql.connect('localhost', 'root', 'root', 'test', charset='utf8')
# 創(chuàng)建操作數(shù)據(jù)庫(kù)的對(duì)象
cursor = conn.cursor()
pic = book_dict['pic']
book_name = book_dict['book_name']
book_url = book_dict['book_url']
score = book_dict['score_book']
book_detail = book_dict['book_detail']
sql = f"insert into python_book (pic,book_name,book_url,score,book_detail) " \
f"VALUE ('{pic}','{book_name}','{book_url}','{score}','{book_detail}')"
# 執(zhí)行并提交
cursor.execute(sql)
conn.commit()
if __name__ == '__main__':
for i in range(0, 199):
print('=================下載第{}頁(yè)========================'.format(i + 1))
page = i * 15
base_url = 'https://book.douban.com/subject_search?search_text=python&cat=1001&start={}'.format(page)
my_browers(base_url, page)
以上就是本文的全部?jī)?nèi)容,希望對(duì)大家的學(xué)習(xí)有所幫助,也希望大家多多支持腳本之家。
相關(guān)文章
Python+Django實(shí)現(xiàn)簡(jiǎn)單HelloWord網(wǎng)頁(yè)的示例代碼
本文主要介紹了Python+Django實(shí)現(xiàn)簡(jiǎn)單HelloWord網(wǎng)頁(yè)的示例代碼,文中通過示例代碼介紹的非常詳細(xì),具有一定的參考價(jià)值,感興趣的小伙伴們可以參考一下2022-04-04
Pytorch中TensorBoard及torchsummary的使用詳解
這篇文章主要介紹了Pytorch中TensorBoard及torchsummary的使用詳解,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過來看看吧2021-05-05
Python調(diào)用百度AI實(shí)現(xiàn)圖片上表格識(shí)別功能
這篇文章主要給大家介紹了關(guān)于Python調(diào)用百度AI實(shí)現(xiàn)圖片上表格識(shí)別功能的相關(guān)資料,在Python環(huán)境下,利用百度AI開放平臺(tái)文字識(shí)別技術(shù),對(duì)表格類圖片進(jìn)行識(shí)別,需要的朋友可以參考下2021-09-09
Django 項(xiàng)目通過加載不同env文件來區(qū)分不同環(huán)境
這篇文章主要介紹了Django 項(xiàng)目如何通過加載不同env文件來區(qū)分不同環(huán)境,文中通過示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧2020-02-02

