Python自定義scrapy中間模塊避免重復(fù)采集的方法
更新時間:2015年04月07日 16:36:56 作者:pythoner
這篇文章主要介紹了Python自定義scrapy中間模塊避免重復(fù)采集的方法,實例分析了Python實現(xiàn)采集的技巧,非常具有實用價值,需要的朋友可以參考下
本文實例講述了Python自定義scrapy中間模塊避免重復(fù)采集的方法。分享給大家供大家參考。具體如下:
from scrapy import log from scrapy.http import Request from scrapy.item import BaseItem from scrapy.utils.request import request_fingerprint from myproject.items import MyItem class IgnoreVisitedItems(object): """Middleware to ignore re-visiting item pages if they were already visited before. The requests to be filtered by have a meta['filter_visited'] flag enabled and optionally define an id to use for identifying them, which defaults the request fingerprint, although you'd want to use the item id, if you already have it beforehand to make it more robust. """ FILTER_VISITED = 'filter_visited' VISITED_ID = 'visited_id' CONTEXT_KEY = 'visited_ids' def process_spider_output(self, response, result, spider): context = getattr(spider, 'context', {}) visited_ids = context.setdefault(self.CONTEXT_KEY, {}) ret = [] for x in result: visited = False if isinstance(x, Request): if self.FILTER_VISITED in x.meta: visit_id = self._visited_id(x) if visit_id in visited_ids: log.msg("Ignoring already visited: %s" % x.url, level=log.INFO, spider=spider) visited = True elif isinstance(x, BaseItem): visit_id = self._visited_id(response.request) if visit_id: visited_ids[visit_id] = True x['visit_id'] = visit_id x['visit_status'] = 'new' if visited: ret.append(MyItem(visit_id=visit_id, visit_status='old')) else: ret.append(x) return ret def _visited_id(self, request): return request.meta.get(self.VISITED_ID) or request_fingerprint(request)
希望本文所述對大家的Python程序設(shè)計有所幫助。
相關(guān)文章
Python實現(xiàn)判斷變量是否是函數(shù)方式
這篇文章主要介紹了Python實現(xiàn)判斷變量是否是函數(shù)方式,具有很好的參考價值,希望對大家有所幫助,如有錯誤或未考慮完全的地方,望不吝賜教2024-02-02使用python Telnet遠程登錄執(zhí)行程序的方法
今天小編就為大家分享一篇使用python Telnet遠程登錄執(zhí)行程序的方法,具有很好的參考價值,希望對大家有所幫助。一起跟隨小編過來看看吧2019-01-01使用python連接mysql數(shù)據(jù)庫數(shù)據(jù)方式
這篇文章主要介紹了使用python連接mysql數(shù)據(jù)庫數(shù)據(jù)方式,住喲有兩種方式,具體內(nèi)容,需要的小伙伴可以參考下面文章內(nèi)容,希望對你有所幫助2022-03-03Python 數(shù)據(jù)處理更容易的12個輔助函數(shù)總結(jié)
Python的產(chǎn)生似乎就是專門用來處理數(shù)據(jù)的,順理成章的成為大數(shù)據(jù)的主流語言,本文介紹十二個函數(shù)輔助你更容易更便捷的用Python進行數(shù)據(jù)處理2021-11-11