快捷導(dǎo)航

scrapy框架ItemPipeline的使用

更新時(shí)間：2022年08月15日 11:37:59 作者：卑微小鐘

本文主要介紹了scrapy框架ItemPipeline的使用，文中通過(guò)示例代碼介紹的非常詳細(xì)，對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值，需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧

Item Pipeline簡(jiǎn)介

Item管道的主要責(zé)任是負(fù)責(zé)處理有蜘蛛從網(wǎng)頁(yè)中抽取的Item，他的主要任務(wù)是清晰、驗(yàn)證和存儲(chǔ)數(shù)據(jù)。
當(dāng)頁(yè)面被蜘蛛解析后，將被發(fā)送到Item管道，并經(jīng)過(guò)幾個(gè)特定的次序處理數(shù)據(jù)。
每個(gè)Item管道的組件都是有一個(gè)簡(jiǎn)單的方法組成的Python類。
他們獲取了Item并執(zhí)行他們的方法，同時(shí)他們還需要確定的是是否需要在Item管道中繼續(xù)執(zhí)行下一步或是直接丟棄掉不處理。

調(diào)用時(shí)間：當(dāng)Item在Spider中被收集之后，它將會(huì)被傳遞到Item Pipeline，一些組件會(huì)按照一定的順序執(zhí)行對(duì)Item的處理。

功能:

清理HTML數(shù)據(jù)
驗(yàn)證爬取的數(shù)據(jù)(檢查item包含某些字段)
查重(并丟棄)
將爬取結(jié)果保存到數(shù)據(jù)庫(kù)中

一、一個(gè)自己的Pipeline類

必須實(shí)現(xiàn)以下方法：

process_item(self, item**,** spider**)**

每個(gè)item pipeline組件都需要調(diào)用該方法，這個(gè)方法必須返回一個(gè)具有數(shù)據(jù)的dict，或是 Item(或任何繼承類)對(duì)象，或是拋出 DropItem 異常，被丟棄的item將不會(huì)被之后的pipeline組件所處理。

參數(shù):

item （Item 對(duì)象或者一個(gè)dict) – 被爬取的item
spider (Spider 對(duì)象) – 爬取該item的spider

open_spider(self, spider)

當(dāng)spider被開啟時(shí)，這個(gè)方法被調(diào)用。參數(shù):spider (Spider對(duì)象) – 被開啟的spider

from_crawler(cls,crawler)

如果存在，則調(diào)用該類方法以從中創(chuàng)建管道實(shí)例Crawler。它必須返回管道的新實(shí)例。搜尋器對(duì)象提供對(duì)所有Scrapy核心組件（如設(shè)置和信號(hào)）的訪問(wèn)；這是管道訪問(wèn)它們并將其功能掛鉤到Scrapy中的一種方法。

close_spider(self, spider)

當(dāng)spider被關(guān)閉時(shí)，這個(gè)方法被調(diào)用參數(shù):spider (Spider對(duì)象) – 被關(guān)閉的spider

二、啟用一個(gè)Item Pipeline組件

為了啟用一個(gè)Item Pipeline組件，你必須將它的類添加到 ITEM_PIPELINES 配置，就像下面這個(gè)例子:

ITEM_PIPELINES = {
    'myproject.pipelines.PricePipeline': 300,
    'myproject.pipelines.JsonWriterPipeline': 800,
}

分配給每個(gè)類的整型值，確定了他們運(yùn)行的順序，item按數(shù)字從低到高的順序，通過(guò)pipeline，通常將這些數(shù)字定義在0-1000范圍內(nèi)。

將item寫入JSON文件

以下pipeline將所有爬取到的item，存儲(chǔ)到一個(gè)獨(dú)立地items.json 文件，每行包含一個(gè)序列化為'JSON'格式的'item':

import json
class JsonWriterPipeline(object):
    def __init__(self):
        self.file = open('items.json', 'wb')
    def process_item(self, item, spider):
        line = json.dumps(dict(item),ensure_ascii=False) + "\n"
        self.file.write(line)
        return item

在這里優(yōu)化：

以下pipeline將所有爬取到的item，存儲(chǔ)到一個(gè)獨(dú)立地items.json 文件，每行包含一個(gè)序列化為'JSON'格式的'item':

import json
import codecs
class JsonWriterPipeline(object):
    def __init__(self):
        self.file = codecs.open('items.json', 'w', encoding='utf-8')
    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item
    def spider_closed(self, spider):
        self.file.close()

針對(duì)spider里面的utf-8編碼格式去掉.encode('utf-8')

item = RecruitItem()
item['name']=name.encode('utf-8')
item['detailLink']=detailLink.encode('utf-8')
item['catalog']=catalog.encode('utf-8')
item['recruitNumber']=recruitNumber.encode('utf-8')
item['workLocation']=workLocation.encode('utf-8')
item['publishTime']=publishTime.encode('utf-8')

將item寫入MongoDB

from_crawler(cls, crawler)

如果使用，這類方法被調(diào)用創(chuàng)建爬蟲管道實(shí)例。必須返回管道的一個(gè)新實(shí)例。crawler提供存取所有Scrapy核心組件配置和信號(hào)管理器；對(duì)于pipelines這是一種訪問(wèn)配置和信號(hào)管理器的方式。

在這個(gè)例子中，我們將使用pymongo將Item寫到MongoDB。MongoDB的地址和數(shù)據(jù)庫(kù)名稱在Scrapy setttings.py配置文件中；

這個(gè)例子主要是說(shuō)明如何使用from_crawler()方法

import pymongo
class MongoPipeline(object):
    collection_name = 'scrapy_items'
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db
    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )
    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]
    def close_spider(self, spider):
        self.client.close()
    def process_item(self, item, spider):
        self.db[self.collection_name].insert(dict(item))
        return item

到此這篇關(guān)于scrapy框架ItemPipeline的使用的文章就介紹到這了,更多相關(guān)scrapy ItemPipeline內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: