Python Scrapy實(shí)戰(zhàn)之古詩(shī)文網(wǎng)的爬取
需求
通過(guò)python,Scrapy框架,爬取古詩(shī)文網(wǎng)上的詩(shī)詞數(shù)據(jù),具體包括詩(shī)詞的標(biāo)題信息,作者,朝代,詩(shī)詞內(nèi)容,及譯文。爬取過(guò)程需要逐頁(yè)爬取,共4頁(yè)。第一頁(yè)的url為(https://www.gushiwen.cn/default_1.aspx)。
1. Scrapy項(xiàng)目創(chuàng)建
首先創(chuàng)建Scrapy項(xiàng)目及爬蟲程序
在目標(biāo)目錄下,創(chuàng)建一個(gè)名為prose的項(xiàng)目:
scrapy startproject prose
進(jìn)入項(xiàng)目目錄下,然后創(chuàng)建一個(gè)名為gs的爬蟲程序,爬取范圍為 gushiwen.cn
cd prose scrapy genspider gs gushiwen.cn
2. 全局配置 settings.py
對(duì)配置文件settings.py做如下編輯:
①選擇不遵守robots協(xié)議
②下載間隙設(shè)置為1
③并添加請(qǐng)求頭,啟用管道
④此外設(shè)置打印等級(jí):LOG_LEVEL=“WARNING”
具體如下:
# Scrapy settings for prose project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'prose' SPIDER_MODULES = ['prose.spiders'] NEWSPIDER_MODULE = 'prose.spiders' LOG_LEVEL = "WARNING" # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'prose (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 1 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: DEFAULT_REQUEST_HEADERS = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', } # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'prose.middlewares.ProseSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'prose.middlewares.ProseDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'prose.pipelines.ProsePipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
3. 爬蟲程序.py
首先是進(jìn)行頁(yè)面分析,這里不再贅述該過(guò)程。
這部分代碼,也即需要編輯的核心部分。
首先是要把初始URL加以修改,修改為要爬取的界面的第一頁(yè),而非古詩(shī)文網(wǎng)的首頁(yè)。
需求:我們要爬取的內(nèi)容包括:詩(shī)詞的標(biāo)題信息,作者,朝代,詩(shī)詞內(nèi)容,及譯文。爬取過(guò)程需要逐頁(yè)爬取。
其中,標(biāo)題信息,作者,朝代,詩(shī)詞內(nèi)容,及譯文都存在于同一個(gè)<div>標(biāo)簽中。
為了體現(xiàn)兩種不同的操作方式,
標(biāo)題信息,作者,朝代,詩(shī)詞內(nèi)容 四項(xiàng),我們使用一種方法獲取。并在該for循環(huán)中使用到一個(gè)異常處理語(yǔ)句(try…except…)來(lái)避免取到空值時(shí)使用索引導(dǎo)致的報(bào)錯(cuò);
對(duì)于譯文,我們額外定義一個(gè)parse_detail函數(shù),并在scrapy.Request()中傳入其,來(lái)獲取。
關(guān)于翻頁(yè),我們的思路是:遍歷獲取完每一頁(yè)需要的數(shù)據(jù)后(即一大輪循環(huán)結(jié)束后),從當(dāng)前頁(yè)面上獲取下一頁(yè)的鏈接,然后判斷獲取到的鏈接是否為空。如若不為空則表示獲取到了,則再一次使用scrapy.Requests()方法,傳入該鏈接,并再次調(diào)用parse函數(shù)。如果為空,則表明這已經(jīng)是最后一頁(yè)了,程序就會(huì)在此處結(jié)束。
具體代碼如下:
import scrapy from prose.items import ProseItem class GsSpider(scrapy.Spider): name = 'gs' allowed_domains = ['gushiwen.cn'] start_urls = ['https://www.gushiwen.cn/default_1.aspx'] # 解析列表頁(yè)面 def parse(self, response): # 一個(gè)class="sons"對(duì)應(yīng)的是一首詩(shī) div_list = response.xpath('//div[@class="left"]/div[@class="sons"]') for div in div_list: try: # 提取詩(shī)詞標(biāo)題信息 title = div.xpath('.//b/text()').get() # 提取作者和朝代 source = div.xpath('.//p[@class="source"]/a/text()').getall() # 作者 # replace author = source[0] # 朝代 dynasty = source[1] content_list = div.xpath('.//div[@class="contson"]//text()').getall() content_plus = ''.join(content_list).strip() # 拿到詩(shī)詞詳情頁(yè)面的url detail_url = div.xpath('.//p/a/@href').get() item = ProseItem(title=title, author=author, dynasty=dynasty, content_plus=content_plus, detail_url=detail_url) # print(item) yield scrapy.Request( url=detail_url, callback=self.parse_detail, meta={'prose_item': item} ) except: pass next_url = response.xpath('//a[@id="amore"]/@href').get() if next_url: print(next_url) yield scrapy.Request( url=next_url, callback=self.parse ) # 用于解析詳情頁(yè)面 def parse_detail(self, response): item = response.meta.get('prose_item') translation = response.xpath('//div[@class="sons"]/div[@class="contyishang"]/p//text()').getall() item['translation'] = ''.join(translation).strip() # print(item) yield item pass
4. 數(shù)據(jù)結(jié)構(gòu) items.py
在這里定義了ProseItem類,以便在上邊的爬蟲程序中調(diào)用。(此外要注意的是,爬蟲程序中導(dǎo)入了該模塊,有必要時(shí)需要將合適的文件夾標(biāo)記為根目錄。)
import scrapy class ProseItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() # 標(biāo)題 title = scrapy.Field() # 作者 author = scrapy.Field() # 朝代 dynasty = scrapy.Field() # 詩(shī)詞內(nèi)容 content_plus = scrapy.Field() # 詳情頁(yè)面的url detail_url = scrapy.Field() # 譯文 translation = scrapy.Field() pass
5. 管道 pipelines.py
管道,在這里編輯數(shù)據(jù)存儲(chǔ)的過(guò)程。
from itemadapter import ItemAdapter import json class ProsePipeline: def __init__(self): self.f = open('gs.txt', 'w', encoding='utf-8') def process_item(self, item, spider): # 將item先轉(zhuǎn)化為字典, 再轉(zhuǎn)化為 json類型的字符串 item_json = json.dumps(dict(item), ensure_ascii=False) self.f.write(item_json + '\n') return item def close_spider(self, spider): self.f.close()
6. 程序執(zhí)行 start.py
定義一個(gè)執(zhí)行命令的程序。
from scrapy import cmdline cmdline.execute('scrapy crawl gs'.split())
程序執(zhí)行效果如下:
我們需要的數(shù)據(jù),被保存在了一個(gè)名為gs.txt的文本文件中了。
以上就是Python Scrapy實(shí)戰(zhàn)之古詩(shī)文網(wǎng)的爬取的詳細(xì)內(nèi)容,更多關(guān)于Python Scrapy爬取古詩(shī)文網(wǎng)的資料請(qǐng)關(guān)注腳本之家其它相關(guān)文章!
相關(guān)文章

python入門turtle庫(kù)實(shí)現(xiàn)螺旋曲線圖的方法示例

Python對(duì)稱的二叉樹多種思路實(shí)現(xiàn)方法

python?pandas分割DataFrame中的字符串及元組的方法實(shí)現(xiàn)

Python異步編程之協(xié)程任務(wù)的調(diào)度操作實(shí)例分析

Python梯度提升庫(kù)XGBoost解決機(jī)器學(xué)習(xí)問(wèn)題使用探究

python中@property和property函數(shù)常見(jiàn)使用方法示例

Python快速轉(zhuǎn)換numpy數(shù)組中Nan和Inf的方法實(shí)例說(shuō)明