快捷導(dǎo)航

python異步存儲數(shù)據(jù)詳解

更新時間：2019年03月19日 14:14:10 作者：我是李玉峰

這篇文章主要為大家詳細介紹了python異步存儲數(shù)據(jù)的相關(guān)資料，具有一定的參考價值，感興趣的小伙伴們可以參考一下

在Python中，數(shù)據(jù)存儲方式分為同步存儲和異步存儲。同步寫入速度比較慢，而爬蟲速度比較快，有可能導(dǎo)致數(shù)據(jù)保存不完整，一部分數(shù)據(jù)沒有入庫。而異步可以將爬蟲和寫入數(shù)據(jù)庫操作分開執(zhí)行，互不影響，所以寫入速度比較快，能夠保證數(shù)據(jù)的完整性。

異步存儲數(shù)據(jù)庫大致看分為以下步驟：

1. 在settings中配置Mysql鏈接需要的參數(shù)(主機地址、用戶賬號、密碼、需要操作的表名、編碼格式等)
2. 自定義Pipeline，實現(xiàn)from_settings函數(shù)
3. from twisted.enterprise import adbapi 引入連接池模塊
4. from pymysql import cursors 引入游標模塊
5. 在from_settings中，準備鏈接數(shù)據(jù)庫參數(shù)，創(chuàng)建db_pool連接池，創(chuàng)建返回當前類的對象，傳入db_pool
6. 實現(xiàn)初始化函數(shù),在初始化函數(shù)中,將db_pool賦值self的屬性
7. 實現(xiàn)process_item函數(shù)
7.1 query = self.db_pool.runInteraction(執(zhí)行插入數(shù)據(jù)操作的函數(shù)對象，函數(shù)需要參數(shù))，并接受執(zhí)行返回結(jié)果
7.2 query.addErrback(錯誤回調(diào)函數(shù)，函數(shù)需要參數(shù))，添加執(zhí)行sql失敗回調(diào)的函數(shù)，在回調(diào)函數(shù)中對錯誤數(shù)據(jù)進一步處理
8. 實現(xiàn)插入數(shù)據(jù)操作的函數(shù)，準備sql，執(zhí)行sql
9. 實現(xiàn)錯誤回調(diào)函數(shù)，在回調(diào)函數(shù)中對錯誤數(shù)據(jù)進一步處理

下面，我們以天堂圖片網(wǎng)為例，大致熟悉一下異步存儲：

1. 在存儲之前，可以選擇手動創(chuàng)建數(shù)據(jù)庫（表名、字段名、字段類型等自己定義），也可以選擇代碼創(chuàng)建。

2. 存儲數(shù)據(jù)之前還得先拿到數(shù)據(jù)

import scrapy
from ..items import ImgItem
class IvskySpider(scrapy.Spider):
  name = 'ivsky'
  allowed_domains = ['ivsky.com']
  start_urls = ['http://www.ivsky.com/tupian/ziranfengguang/']
  def parse(self, response):
    imgs = response.xpath('//div[@class="il_img"]/a/img')
    for img in imgs:
      alt = img.xpath('@alt').extract_first('')
      src = img.xpath('@src').extract_first('')
      item = ImgItem()
      item['alt'] = alt
      item['src'] = src
 
      yield item

3. 自定義item，并把數(shù)據(jù)傳進去

import scrapy
 
class IvskySpiderItem(scrapy.Item):
  # define the fields for your item here like:
  # name = scrapy.Field()
  pass
 
class ImgItem(scrapy.Item):
 
  alt = scrapy.Field()
  src = scrapy.Field()

4. 接下來就是settings中的配置，代碼如下（robots協(xié)議記得改為False）：

MYSQL_HOST = '127.0.0.1'
MYSQL_USER = 'root'
MYSQL_PW = '123456'
MYSQL_DB = 'ivskydb'
MYSQL_CHARSET = 'utf8'

5. 再然后自定義pipeline，并把該pipeline在settings中配置（設(shè)置優(yōu)先級）：

from twisted.enterprise import adbapi
from pymysql import cursors
 
class TwistedMysqlPipeline(object):
 
  # 在調(diào)用TwistedMysqlPipeline時，第一個調(diào)用該函數(shù)
  @classmethod
  def from_settings(cls, settings):
 
    #準備需要用到的鏈接mysql的參數(shù)
    db_prams = dict(
      host=settings['MYSQL_HOST'],
      user=settings['MYSQL_USER'],
      password=settings['MYSQL_PW'],
      db=settings['MYSQL_DB'],
      port=3306,
      use_unicode=True,
      charset=settings['MYSQL_CHARSET'],
      # 指定使用的游標類型
      cursorclass=cursors.DictCursor
    )
    # 創(chuàng)建連接池對象，需要傳入兩個參數(shù)
    # 1.使用操作mysql第三方包名
    # 2.連接數(shù)據(jù)庫需要的參數(shù)
    db_pool = adbapi.ConnectionPool('pymysql', **db_prams)
 
    return cls(db_pool)
 
  def __init__(self, db_pool):
    # 將連接池對象賦值self.db_pool屬性
    self.db_pool = db_pool
 
  def process_item(self, item, spider):
 
    # 準備sql
    # 執(zhí)行sql
    # 執(zhí)行一個將item數(shù)據(jù)寫入數(shù)據(jù)庫的動作
    # 1.執(zhí)行操作的函數(shù)
    # 2.執(zhí)行函數(shù)需要的參數(shù)....
    query = self.db_pool.runInteraction(self.insert_item, item)
    # 執(zhí)行sql出現(xiàn)異常錯誤時，回調(diào)的函數(shù)
    query.addErrback(self.handle_error, item, spider)
 
    return item
 
  # 插入數(shù)據(jù)出現(xiàn)錯誤時，回調(diào)的函數(shù)
  def handle_error(self, failure, item, spider):
    print(failure)
    print(item)
 
  # 執(zhí)行插入數(shù)據(jù)的函數(shù)
  def insert_item(self, cursor, item):
    # 創(chuàng)建sql
    sql = "INSERT INTO ivs(alt,src)VALUES(%s,%s)"
    # 執(zhí)行sql
    cursor.execute(sql,(item['alt'], item['src']))

6. pipeline在settings中的配置

ITEM_PIPELINES = {
  # 'ivsky_spider.pipelines.MysqlPipeline': 300,
  'ivsky_spider.pipelines.TwistedMysqlPipeline': 300,
}

代碼到這里就結(jié)束了。

以上就是本文的全部內(nèi)容，希望對大家的學習有所幫助，也希望大家多多支持腳本之家。

您可能感興趣的文章:

亚洲乱码中文字幕综合,中国熟女仑乱hd,亚洲精品乱拍国产一区二区三区,一本大道卡一卡二卡三乱码全集资源,又粗又黄又硬又爽的免费视频

python異步存儲數(shù)據(jù)詳解

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具