快捷導(dǎo)航

Python Scrapy實(shí)戰(zhàn)之古詩(shī)文網(wǎng)的爬取

更新時(shí)間：2022年05月20日 08:49:41 作者：侯小啾

本文將利用Python中Scrapy框架，實(shí)現(xiàn)爬取古詩(shī)文網(wǎng)上的詩(shī)詞數(shù)據(jù)，具體包括詩(shī)詞的標(biāo)題信息。文中的示例代碼講解詳細(xì)，感興趣的小伙伴可以了解一下

需求

通過(guò)python,Scrapy框架，爬取古詩(shī)文網(wǎng)上的詩(shī)詞數(shù)據(jù)，具體包括詩(shī)詞的標(biāo)題信息，作者，朝代，詩(shī)詞內(nèi)容，及譯文。爬取過(guò)程需要逐頁(yè)爬取，共4頁(yè)。第一頁(yè)的url為（https://www.gushiwen.cn/default_1.aspx）。

1. Scrapy項(xiàng)目創(chuàng)建

首先創(chuàng)建Scrapy項(xiàng)目及爬蟲程序

在目標(biāo)目錄下，創(chuàng)建一個(gè)名為prose的項(xiàng)目：

scrapy startproject prose

進(jìn)入項(xiàng)目目錄下，然后創(chuàng)建一個(gè)名為gs的爬蟲程序，爬取范圍為 gushiwen.cn

cd prose
scrapy genspider gs gushiwen.cn

2. 全局配置 settings.py

對(duì)配置文件settings.py做如下編輯：

①選擇不遵守robots協(xié)議

②下載間隙設(shè)置為1

③并添加請(qǐng)求頭，啟用管道

④此外設(shè)置打印等級(jí)：LOG_LEVEL=“WARNING”

具體如下：

# Scrapy settings for prose project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'prose'

SPIDER_MODULES = ['prose.spiders']
NEWSPIDER_MODULE = 'prose.spiders'

LOG_LEVEL = "WARNING"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'prose (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'prose.middlewares.ProseSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'prose.middlewares.ProseDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'prose.pipelines.ProsePipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

3. 爬蟲程序.py

首先是進(jìn)行頁(yè)面分析，這里不再贅述該過(guò)程。

這部分代碼，也即需要編輯的核心部分。

首先是要把初始URL加以修改，修改為要爬取的界面的第一頁(yè)，而非古詩(shī)文網(wǎng)的首頁(yè)。

需求：我們要爬取的內(nèi)容包括：詩(shī)詞的標(biāo)題信息，作者，朝代，詩(shī)詞內(nèi)容，及譯文。爬取過(guò)程需要逐頁(yè)爬取。

其中，標(biāo)題信息，作者，朝代，詩(shī)詞內(nèi)容，及譯文都存在于同一個(gè)<div>標(biāo)簽中。

為了體現(xiàn)兩種不同的操作方式，

標(biāo)題信息，作者，朝代，詩(shī)詞內(nèi)容四項(xiàng)，我們使用一種方法獲取。并在該for循環(huán)中使用到一個(gè)異常處理語(yǔ)句（try…except…）來(lái)避免取到空值時(shí)使用索引導(dǎo)致的報(bào)錯(cuò)；

對(duì)于譯文，我們額外定義一個(gè)parse_detail函數(shù)，并在scrapy.Request()中傳入其，來(lái)獲取。

關(guān)于翻頁(yè)，我們的思路是：遍歷獲取完每一頁(yè)需要的數(shù)據(jù)后（即一大輪循環(huán)結(jié)束后），從當(dāng)前頁(yè)面上獲取下一頁(yè)的鏈接，然后判斷獲取到的鏈接是否為空。如若不為空則表示獲取到了，則再一次使用scrapy.Requests()方法，傳入該鏈接，并再次調(diào)用parse函數(shù)。如果為空，則表明這已經(jīng)是最后一頁(yè)了，程序就會(huì)在此處結(jié)束。

具體代碼如下：

import scrapy
from prose.items import ProseItem


class GsSpider(scrapy.Spider):
    name = 'gs'
    allowed_domains = ['gushiwen.cn']
    start_urls = ['https://www.gushiwen.cn/default_1.aspx']

    # 解析列表頁(yè)面
    def parse(self, response):
        # 一個(gè)class="sons"對(duì)應(yīng)的是一首詩(shī)
        div_list = response.xpath('//div[@class="left"]/div[@class="sons"]')
        for div in div_list:
            try:
                # 提取詩(shī)詞標(biāo)題信息
                title = div.xpath('.//b/text()').get()
                # 提取作者和朝代
                source = div.xpath('.//p[@class="source"]/a/text()').getall()
                # 作者
                # replace
                author = source[0]
                # 朝代
                dynasty = source[1]
                content_list = div.xpath('.//div[@class="contson"]//text()').getall()
                content_plus = ''.join(content_list).strip()
                # 拿到詩(shī)詞詳情頁(yè)面的url
                detail_url = div.xpath('.//p/a/@href').get()
                item = ProseItem(title=title, author=author, dynasty=dynasty, content_plus=content_plus, detail_url=detail_url)
                # print(item)
                yield scrapy.Request(
                    url=detail_url,
                    callback=self.parse_detail,
                    meta={'prose_item': item}
                )
            except:
                pass

        next_url = response.xpath('//a[@id="amore"]/@href').get()
        if next_url:
            print(next_url)
            yield scrapy.Request(
                url=next_url,
                callback=self.parse
            )


    # 用于解析詳情頁(yè)面
    def parse_detail(self, response):
        item = response.meta.get('prose_item')
        translation = response.xpath('//div[@class="sons"]/div[@class="contyishang"]/p//text()').getall()
        item['translation'] = ''.join(translation).strip()
        # print(item)
        yield item
        pass

4. 數(shù)據(jù)結(jié)構(gòu) items.py

在這里定義了ProseItem類，以便在上邊的爬蟲程序中調(diào)用。（此外要注意的是，爬蟲程序中導(dǎo)入了該模塊，有必要時(shí)需要將合適的文件夾標(biāo)記為根目錄。）

import scrapy


class ProseItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 標(biāo)題
    title = scrapy.Field()
    # 作者
    author = scrapy.Field()
    # 朝代
    dynasty = scrapy.Field()
    # 詩(shī)詞內(nèi)容
    content_plus = scrapy.Field()
    # 詳情頁(yè)面的url
    detail_url = scrapy.Field()
    # 譯文
    translation = scrapy.Field()
    pass

5. 管道 pipelines.py

管道，在這里編輯數(shù)據(jù)存儲(chǔ)的過(guò)程。

from itemadapter import ItemAdapter
import json


class ProsePipeline:
    def __init__(self):
        self.f = open('gs.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
    	# 將item先轉(zhuǎn)化為字典， 再轉(zhuǎn)化為 json類型的字符串
        item_json = json.dumps(dict(item), ensure_ascii=False)
        self.f.write(item_json + '\n')
        return item

    def close_spider(self, spider):
        self.f.close()

6. 程序執(zhí)行 start.py

定義一個(gè)執(zhí)行命令的程序。

from scrapy import cmdline

cmdline.execute('scrapy crawl gs'.split())

程序執(zhí)行效果如下：

我們需要的數(shù)據(jù)，被保存在了一個(gè)名為gs.txt的文本文件中了。

以上就是Python Scrapy實(shí)戰(zhàn)之古詩(shī)文網(wǎng)的爬取的詳細(xì)內(nèi)容，更多關(guān)于Python Scrapy爬取古詩(shī)文網(wǎng)的資料請(qǐng)關(guān)注腳本之家其它相關(guān)文章！

您可能感興趣的文章:

相關(guān)文章

Python3如何在服務(wù)器打印資產(chǎn)信息
這篇文章主要介紹了Python3如何在服務(wù)器打印資產(chǎn)信息,文中通過(guò)示例代碼介紹的非常詳細(xì)，對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友可以參考下
2020-08-08
python實(shí)現(xiàn)3D地圖可視化
這篇文章主要為大家詳細(xì)介紹了python實(shí)現(xiàn)3D地圖可視化，文中示例代碼介紹的非常詳細(xì)，具有一定的參考價(jià)值，感興趣的小伙伴們可以參考一下
2020-03-03
這篇文章主要介紹了pytest自動(dòng)化測(cè)試中fixture的作用域、實(shí)例化順序及可用性的詳解示例有需要的朋友可以借鑒參考下，希望能夠有所幫助
2021-10-10

python入門turtle庫(kù)實(shí)現(xiàn)螺旋曲線圖的方法示例

turtle（海龜）是Python重要的標(biāo)準(zhǔn)庫(kù)之一，它能夠進(jìn)行基本的圖形繪制，本文就來(lái)介紹了一下python入門turtle庫(kù)實(shí)現(xiàn)螺旋曲線圖的方法示例，感興趣的可以了解一下

2021-11-11

Python對(duì)稱的二叉樹多種思路實(shí)現(xiàn)方法

這篇文章主要介紹了Python對(duì)稱的二叉樹多種思路實(shí)現(xiàn)方法，文中通過(guò)示例代碼介紹的非常詳細(xì)，對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值，需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧

2020-02-02

python?pandas分割DataFrame中的字符串及元組的方法實(shí)現(xiàn)

本文主要介紹了python?pandas分割DataFrame中的字符串及元組的方法實(shí)現(xiàn)，主要介紹了3種方法，具有一定的參考價(jià)值，感興趣的可以了解一下

2022-03-03

Python異步編程之協(xié)程任務(wù)的調(diào)度操作實(shí)例分析

這篇文章主要介紹了Python異步編程之協(xié)程任務(wù)的調(diào)度操作,結(jié)合實(shí)例形式分析了Python異步編程中協(xié)程任務(wù)的調(diào)度相關(guān)原理、實(shí)現(xiàn)方法與操作注意事項(xiàng),需要的朋友可以參考下

2020-02-02

Python梯度提升庫(kù)XGBoost解決機(jī)器學(xué)習(xí)問(wèn)題使用探究

XGBoost是一個(gè)流行的梯度提升庫(kù),特別適用于解決各種機(jī)器學(xué)習(xí)問(wèn)題,它在性能和速度上表現(xiàn)出色,常被用于分類、回歸、排序、推薦系統(tǒng)等應(yīng)用,本文將介紹XGBoost的基本原理、核心功能以及一些詳細(xì)的示例代碼

2024-01-01

python中@property和property函數(shù)常見(jiàn)使用方法示例

這篇文章主要介紹了python中@property和property函數(shù)常見(jiàn)使用方法,結(jié)合實(shí)例形式分析了Python @property和property函數(shù)功能、使用方法及相關(guān)操作注意事項(xiàng),需要的朋友可以參考下

2019-10-10

Python快速轉(zhuǎn)換numpy數(shù)組中Nan和Inf的方法實(shí)例說(shuō)明

今天小編就為大家分享一篇關(guān)于Python快速轉(zhuǎn)換numpy數(shù)組中Nan和Inf的方法實(shí)例說(shuō)明，小編覺(jué)得內(nèi)容挺不錯(cuò)的，現(xiàn)在分享給大家，具有很好的參考價(jià)值，需要的朋友一起跟隨小編來(lái)看看吧

2019-02-02

亚洲乱码中文字幕综合,中国熟女仑乱hd,亚洲精品乱拍国产一区二区三区,一本大道卡一卡二卡三乱码全集资源,又粗又黄又硬又爽的免费视频

軟件下載

源碼下載

軟件編程

網(wǎng)絡(luò)編程

在線工具

數(shù)據(jù)庫(kù)

CMS

常用工具

Python Scrapy實(shí)戰(zhàn)之古詩(shī)文網(wǎng)的爬取

目錄

需求

1. Scrapy項(xiàng)目創(chuàng)建

2. 全局配置 settings.py

3. 爬蟲程序.py

4. 數(shù)據(jù)結(jié)構(gòu) items.py

5. 管道 pipelines.py

6. 程序執(zhí)行 start.py

相關(guān)文章

最新評(píng)論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具