分布式爬蟲(chóng)scrapy-redis的實(shí)戰(zhàn)踩坑記錄
一、安裝redis
因?yàn)槭窃贑entOS系統(tǒng)下安裝的,并且是服務(wù)器。遇到的困難有點(diǎn)多不過(guò)。
1.首先要下載相關(guān)依賴(lài)
首先先檢查是否有c語(yǔ)言的編譯環(huán)境,你問(wèn)我問(wèn)什么下載這個(gè),我只能說(shuō)它是下載安裝redis的前提,就像水和魚(yú)一樣。
rpm -q gcc```
如果輸出版本號(hào),則證明下載好了,否則就執(zhí)行下面的命令,安裝gcc,
2.然后編譯redis
下載你想要的redis版本注意下面的3.0.6是版本號(hào),根據(jù)自己想要的下載版本號(hào),解壓
yum install gcc-c++ cd /usr/local/redis wget http://download.redis.io/releases/redis-3.0.6.tar.gz tar zxvf redis-3.0.6.tar.gz make && make install
什么?你問(wèn)我沒(méi)有redis文件夾怎么辦,mkdir創(chuàng)建啊!!!
一定要先進(jìn)入目錄再去執(zhí)行下載編譯,這樣下載的redis才會(huì)進(jìn)入系統(tǒng)變量。
redis-server redis-cli
啟動(dòng)服務(wù)你是下面這樣的嗎?
是的就不正常了?。∧悴畔螺d好了,你會(huì)發(fā)現(xiàn)你可以開(kāi)啟服務(wù)了,但是退不出來(lái),無(wú)法進(jìn)入命令行了,變成下面的這鬼摸樣了,別急,你還沒(méi)配置好,慢慢來(lái)。
還記得你剛剛創(chuàng)建的redis文件夾嗎?進(jìn)入那里面,找到redis.conf,修改這個(gè)配置文件。
redis-server redis-cli
找到這三個(gè)并改正。
- 首先將bind進(jìn)行注釋?zhuān)驗(yàn)槿绻蛔⑨尩脑?,你就只能本機(jī)訪問(wèn)了,我相信你下載肯定不只是自己訪問(wèn)吧。這就意味著所有ip都可以訪問(wèn)這個(gè)數(shù)據(jù)庫(kù),但你又問(wèn)了,這會(huì)不會(huì)影響安全性能呢?答:你都是租的服務(wù)器了,就算你想讓別人訪問(wèn),你還有安全組規(guī)則限制的啊,你問(wèn)我什么是安全組?快去百度??!
- 將守護(hù)模式關(guān)閉,這樣你才能遠(yuǎn)程讀寫(xiě)數(shù)據(jù)庫(kù)
- 開(kāi)啟后臺(tái)模式,你才能像我那樣,而不是退不出來(lái)
保存退出,重啟redis,這樣,redis就配置好了,還可以設(shè)置密碼,但是我懶,不想設(shè)置。
至此數(shù)據(jù)庫(kù)配置成功
二、scrapy框架出現(xiàn)的問(wèn)題
1.AttributeError: TaocheSpider object has no attribute make_requests_from_url 原因:
新版本的scrapy框架已經(jīng)丟棄了這個(gè)函數(shù)的功能,但是并沒(méi)有完全移除,雖然函數(shù)已經(jīng)移除,但是還是在某些地方用到了這個(gè),出現(xiàn)矛盾。
解決方法
自己在相對(duì)應(yīng)的報(bào)錯(cuò)文件中重寫(xiě)一下這個(gè)方法
就是在
def make_requests_from_url(self,url): return scrapy.Request(url,dont_filter=True)
2.ValueError: unsupported format character : (0x3a) at index 9 問(wèn)題:
我開(kāi)起了redis的管道,將數(shù)據(jù)保存在了redis中,但是每次存儲(chǔ)總是失敗報(bào)錯(cuò)。
原因:
我在settings.py文件中重寫(xiě)了保存的方法,但是保存的寫(xiě)法不對(duì)導(dǎo)致我一直以為是源碼的錯(cuò)誤
# item存儲(chǔ)鍵的設(shè)置 REDIS_ITEMS_KEY = '%(spider):items'
源碼是
return self.spider % {"spider":spider.name}
太坑了,我為了這個(gè)錯(cuò)誤差點(diǎn)重寫(xiě)了一個(gè)scrapy框架…
注意!如果你覺(jué)得你的主代碼一點(diǎn)問(wèn)題都沒(méi)有,那就一定是配置文件的問(wèn)題,大小寫(xiě),配置環(huán)境字母不對(duì)等
三、scrapy正確的源代碼
1.items.py文件
import scrapy class MyspiderItem(scrapy.Item): # define the fields for your item here like: lazyimg = scrapy.Field() title = scrapy.Field() resisted_data = scrapy.Field() mileage = scrapy.Field() city = scrapy.Field() price = scrapy.Field() sail_price = scrapy.Field()
2.settings.py文件
# Scrapy settings for myspider project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'myspider' SPIDER_MODULES = ['myspider.spiders'] NEWSPIDER_MODULE = 'myspider.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent # Obey robots.txt rules # LOG_LEVEL = "WARNING" # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'myspider.middlewares.MyspiderSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'myspider.middlewares.MyspiderDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36' LOG_LEVEL = 'WARNING' LOG_FILE = './log.log' # Obey robots.txt rules ROBOTSTXT_OBEY = False # 指定管道 ,scrapy-redis組件幫我們寫(xiě)好 ITEM_PIPELINES = { "scrapy_redis.pipelines.RedisPipeline":400 } # 指定redis REDIS_HOST = '' # redis的服務(wù)器地址,我們現(xiàn)在用的是虛擬機(jī)上的回環(huán)地址 REDIS_PORT = # virtual Box轉(zhuǎn)發(fā)redistribution的端口 # 去重容器類(lèi)配置 作用:redis的set集合來(lái)存儲(chǔ)請(qǐng)求的指紋數(shù)據(jù),從而實(shí)現(xiàn)去重的持久化 DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter' # 使用scrapy-redis的調(diào)度器 SCHEDULER = 'scrapy_redis.scheduler.Scheduler' # 配置調(diào)度器是否需要持久化,爬蟲(chóng)結(jié)束的時(shí)候要不要清空redis中請(qǐng)求隊(duì)列和指紋的set集合,要持久化設(shè)置為T(mén)rue SCHEDULER_PERSIST = True # 最大閑置時(shí)間,防止爬蟲(chóng)在分布式爬取的過(guò)程中關(guān)閉 # 這個(gè)僅在隊(duì)列是SpiderQueue 或者 SpiderStack才會(huì)有作用, # 也可以阻塞一段時(shí)間,當(dāng)你的爬蟲(chóng)剛開(kāi)始時(shí)(因?yàn)閯傞_(kāi)始時(shí),隊(duì)列是空的) SCHEDULER_IDLE_BEFORE_CLOSE = 10 # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
3.taoche.py文件
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from scrapy_redis.spiders import RedisCrawlSpider from ..items import MyspiderItem import logging log = logging.getLogger(__name__) class TaocheSpider(RedisCrawlSpider): name = 'taoche' # allowed_domains = ['taoche.com'] # 不做域名限制 # start_urls = ['http://taoche.com/'] # 起始的url應(yīng)該去redis(公共調(diào)度器) 里面獲取 redis_key = 'taoche' # 回去redis(公共調(diào)度器)里面獲取key為taoche的數(shù)據(jù) taoche:[] # 老師,我給你找一下我改的源碼在哪里,看看是那的錯(cuò)誤嗎 rules = ( # LinkExtractor 鏈接提取器,根據(jù)正則規(guī)則提取url地址 # callback 提取出來(lái)的url地址發(fā)送請(qǐng)求獲取響應(yīng),會(huì)把響應(yīng)對(duì)象給callback指定的函數(shù)進(jìn)行處理 # follow 獲取的響應(yīng)頁(yè)面是否再次經(jīng)過(guò)rules進(jìn)行提取url Rule(LinkExtractor(allow=r'/\?page=\d+?'), callback='parse_item', follow=True), ) def parse_item(self, response): print("開(kāi)始解析數(shù)據(jù)") car_list = response.xpath('//div[@id="container_base"]/ul/li') for car in car_list: lazyimg = car.xpath('./div[1]/div/a/img/@src').extract_first() title = car.xpath('./div[2]/a/span/text()').extract_first() resisted_data = car.xpath('./div[2]/p/i[1]/text()').extract_first() mileage = car.xpath('./div[2]/p/i[2]/text()').extract_first() city = car.xpath('./div[2]/p/i[3]/text()').extract_first() city = city.replace('\n', '') city = city.strip() price = car.xpath('./div[2]/div[1]/i[1]/text()').extract_first() sail_price = car.xpath('./div[2]/div[1]/i[2]/text()').extract_first() item = MyspiderItem() item['lazyimg'] = lazyimg item['title'] = title item['resisted_data'] = resisted_data item['mileage'] = mileage item['city'] = city item['price'] = price item['sail_price'] = sail_price log.warning(item) # scrapy.Request(url=function,dont_filter=True) yield item
4.其余文件
- 中間件沒(méi)有用到所以就沒(méi)有寫(xiě)
- 管道用的是scrapy_redis里面的,自己也就不用寫(xiě)
總結(jié)
到此這篇關(guān)于分布式爬蟲(chóng)scrapy-redis踩坑的文章就介紹到這了,更多相關(guān)分布式爬蟲(chóng)scrapy-redis踩坑內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
相關(guān)文章
flask利用flask-wtf驗(yàn)證上傳的文件的方法
這篇文章主要介紹了flask利用flask-wtf驗(yàn)證上傳的文件的方法,文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧2020-01-01python中使用正則表達(dá)式的后向搜索肯定模式(推薦)
這篇文章主要介紹了python里使用正則表達(dá)式的后向搜索肯定模式,本文通過(guò)代碼介紹的非常詳細(xì),包括語(yǔ)法介紹,非常不錯(cuò),具有參考借鑒價(jià)值,需要的朋友可以參考下2017-11-11Django Rest framework之認(rèn)證的實(shí)現(xiàn)代碼
這篇文章主要介紹了Django Rest framework之認(rèn)證的實(shí)現(xiàn)代碼,小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,也給大家做個(gè)參考。一起跟隨小編過(guò)來(lái)看看吧2018-12-12關(guān)于tensorflow和keras版本的對(duì)應(yīng)關(guān)系
這篇文章主要介紹了關(guān)于tensorflow和keras版本的對(duì)應(yīng)關(guān)系,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。如有錯(cuò)誤或未考慮完全的地方,望不吝賜教2023-06-06python腳本開(kāi)機(jī)自啟的實(shí)現(xiàn)方法
今天小編就為大家分享一篇python腳本開(kāi)機(jī)自啟的實(shí)現(xiàn)方法,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧2019-06-06Python將圖片轉(zhuǎn)為漫畫(huà)風(fēng)格的示例
本文主要介紹了Python將圖片轉(zhuǎn)為漫畫(huà)風(fēng)格的示例,文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧2023-04-04windows及l(fā)inux環(huán)境下永久修改pip鏡像源的方法
不知道有沒(méi)有人跟我一樣,在剛接觸Linux時(shí)被系統(tǒng)更新源問(wèn)題搞得暈頭轉(zhuǎn)向,不同的Linux更新源配置也是不一樣的,另外由于默認(rèn)安裝時(shí)的源大都是外國(guó)的更新源,速度相對(duì)國(guó)內(nèi)會(huì)慢很多,接下來(lái)本文主要介紹在windows和linux兩種系統(tǒng)環(huán)境中更新系統(tǒng)源的方法。2016-11-11