亚洲乱码中文字幕综合,中国熟女仑乱hd,亚洲精品乱拍国产一区二区三区,一本大道卡一卡二卡三乱码全集资源,又粗又黄又硬又爽的免费视频

Python爬蟲實戰(zhàn)之使用Scrapy爬取豆瓣圖片

 更新時間:2021年06月02日 11:29:24   作者:濯君  
在用Python的urllib和BeautifulSoup寫過了很多爬蟲之后,本人決定嘗試著名的Python爬蟲框架——Scrapy.本次分享將詳細講述如何利用Scrapy來下載豆瓣名人圖片,需要的朋友可以參考下

使用Scrapy爬取豆瓣某影星的所有個人圖片

莫妮卡·貝魯奇為例

在這里插入圖片描述

1.首先我們在命令行進入到我們要創(chuàng)建的目錄,輸入 scrapy startproject banciyuan 創(chuàng)建scrapy項目

創(chuàng)建的項目結(jié)構(gòu)如下

在這里插入圖片描述

2.為了方便使用pycharm執(zhí)行scrapy項目,新建main.py

from scrapy import cmdline

cmdline.execute("scrapy crawl banciyuan".split())

再edit configuration

在這里插入圖片描述

然后進行如下設(shè)置,設(shè)置后之后就能通過運行main.py運行scrapy項目了

在這里插入圖片描述

3.分析該HTML頁面,創(chuàng)建對應spider

在這里插入圖片描述

from scrapy import Spider
import scrapy

from banciyuan.items import BanciyuanItem


class BanciyuanSpider(Spider):
    name = 'banciyuan'
    allowed_domains = ['movie.douban.com']
    start_urls = ["https://movie.douban.com/celebrity/1025156/photos/"]
    url = "https://movie.douban.com/celebrity/1025156/photos/"

    def parse(self, response):
        num = response.xpath('//div[@class="paginator"]/a[last()]/text()').extract_first('')
        print(num)
        for i in range(int(num)):
            suffix = '?type=C&start=' + str(i * 30) + '&sortby=like&size=a&subtype=a'
            yield scrapy.Request(url=self.url + suffix, callback=self.get_page)

    def get_page(self, response):
        href_list = response.xpath('//div[@class="article"]//div[@class="cover"]/a/@href').extract()
        # print(href_list)
        for href in href_list:
            yield scrapy.Request(url=href, callback=self.get_info)

    def get_info(self, response):
        src = response.xpath(
            '//div[@class="article"]//div[@class="photo-show"]//div[@class="photo-wp"]/a[1]/img/@src').extract_first('')
        title = response.xpath('//div[@id="content"]/h1/text()').extract_first('')
        # print(response.body)
        item = BanciyuanItem()
        item['title'] = title
        item['src'] = [src]
        yield item

4.items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class BanciyuanItem(scrapy.Item):
    # define the fields for your item here like:
    src = scrapy.Field()
    title = scrapy.Field()

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from scrapy.pipelines.images import ImagesPipeline
import scrapy

class BanciyuanPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        yield scrapy.Request(url=item['src'][0], meta={'item': item})

    def file_path(self, request, response=None, info=None, *, item=None):
        item = request.meta['item']
        image_name = item['src'][0].split('/')[-1]
        # image_name.replace('.webp', '.jpg')
        path = '%s/%s' % (item['title'].split(' ')[0], image_name)

        return path

settings.py

# Scrapy settings for banciyuan project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'banciyuan'

SPIDER_MODULES = ['banciyuan.spiders']
NEWSPIDER_MODULE = 'banciyuan.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36'}


# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'banciyuan.middlewares.BanciyuanSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'banciyuan.middlewares.BanciyuanDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'banciyuan.pipelines.BanciyuanPipeline': 1,
}
IMAGES_STORE = './images'

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5.爬取結(jié)果

在這里插入圖片描述

reference

源碼

到此這篇關(guān)于Python爬蟲實戰(zhàn)之使用Scrapy爬取豆瓣圖片的文章就介紹到這了,更多相關(guān)Scrapy爬取豆瓣圖片內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!

相關(guān)文章

  • 利用Python寫個簡易版星空大戰(zhàn)游戲

    利用Python寫個簡易版星空大戰(zhàn)游戲

    通過小編觀察,大家好像對劃水摸魚是情有獨鐘啊。所以本文給大家?guī)砹艘粋€用Python編寫的簡單版的星空大戰(zhàn)小游戲,感興趣的小伙伴可以動手試一試
    2022-03-03
  • 解決pandas展示數(shù)據(jù)輸出時列名不能對齊的問題

    解決pandas展示數(shù)據(jù)輸出時列名不能對齊的問題

    今天小編就為大家分享一篇解決pandas展示數(shù)據(jù)輸出時列名不能對齊的問題,具有很好的參考價值,希望對大家有所幫助。一起跟隨小編過來看看吧
    2019-11-11
  • 使用python去除圖片白色像素的實例

    使用python去除圖片白色像素的實例

    今天小編就為大家分享一篇使用python去除圖片白色像素的實例,具有很好的參考價值,希望對大家有所幫助。一起跟隨小編過來看看吧
    2019-12-12
  • ?Python錯誤與異常處理

    ?Python錯誤與異常處理

    這篇文章主要介紹了?Python錯誤與異常處理,錯誤與異常處理在Python中具有非常重要的地位,熟練的使用錯誤與異常處理能夠為我們的Python編程提供很多的便利之處,希望您閱讀完本文后能夠有所收獲
    2022-01-01
  • Numpy一維線性插值函數(shù)的用法

    Numpy一維線性插值函數(shù)的用法

    這篇文章主要介紹了Numpy一維線性插值函數(shù)的用法,具有很好的參考價值,希望對大家有所幫助。一起跟隨小編過來看看吧
    2020-04-04
  • Python字符串處理的8招秘籍(小結(jié))

    Python字符串處理的8招秘籍(小結(jié))

    這篇文章主要介紹了Python字符串處理的8招秘籍,文中通過示例代碼介紹的非常詳細,對大家的學習或者工作具有一定的參考學習價值,需要的朋友們下面隨著小編來一起學習學習吧
    2019-08-08
  • python識別驗證碼的思路及解決方案

    python識別驗證碼的思路及解決方案

    在本篇內(nèi)容里小編給大家整理的是一篇關(guān)于python識別驗證碼的思路及解決方案,有需要的朋友們可以參考下。
    2020-09-09
  • 詳解Python中的文件操作

    詳解Python中的文件操作

    這篇文章主要介紹了Python中文件操作的相關(guān)資料,幫助大家更好的理解和學習python,感興趣的朋友可以了解下
    2021-01-01
  • Python數(shù)據(jù)分析pandas模塊用法實例詳解

    Python數(shù)據(jù)分析pandas模塊用法實例詳解

    這篇文章主要介紹了Python數(shù)據(jù)分析pandas模塊用法,結(jié)合實例形式分析了pandas模塊對象創(chuàng)建、數(shù)值運算等相關(guān)操作技巧與注意事項,需要的朋友可以參考下
    2019-11-11
  • python爬蟲scrapy框架的梨視頻案例解析

    python爬蟲scrapy框架的梨視頻案例解析

    這篇文章主要介紹了python爬蟲scrapy框架的梨視頻案例解析,本文通過實例代碼給大家介紹的非常詳細,對大家的學習或工作具有一定的參考借鑒價值,需要的朋友可以參考下
    2021-02-02

最新評論