快捷導(dǎo)航

Scrapy將數(shù)據(jù)保存到Excel和MySQL中的方法實(shí)現(xiàn)

更新時(shí)間：2023年02月28日 15:42:28 作者：就是搞笑

本文主要介紹了Scrapy將數(shù)據(jù)保存到Excel和MySQL中的方法實(shí)現(xiàn)，文中通過示例代碼介紹的非常詳細(xì)，對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值，需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧

1. Excel

主要講解兩種方式：openpyxl和pandas

1.1 openpyxl

class ExcelPipeline:
    def __init__(self):
        # 創(chuàng)建Excel文件
        self.wb = Workbook()
        # 選取第一個(gè)工作表
        self.ws = self.wb.active
        # 寫入表頭
        self.ws.append(['title', 'link', 'country',
                        'author', 'translator', 'publisher',
                        'time', 'price', 'star', 'score',
                        'people', 'comment'
                        ])

    def process_item(self, item, spider):
        self.ws.append([
            item.get('title', ''),
            item.get('link', ''),
            item.get('country', ''),
            item.get('author', ''),
            item.get('translator', ''),
            item.get('publisher', ''),
            item.get('time', ''),
            item.get('price', ''),
            item.get('star', ''),
            item.get('score', ''),
            item.get('people', ''),
            item.get('comment', '')
        ])
        return item

    def close_spider(self, spider):
        self.wb.save('result.xlsx')

1.1.1 代碼說明

ExcelPipeline 繼承自 Scrapy 的 Pipeline 類，并重寫了三個(gè)方法：__init__()、process_item() 和 close_spider()。

在 __init__() 方法中：

創(chuàng)建了一個(gè) Excel 文件，并選取了第一個(gè)工作表。然后，我們寫入了表頭。
當(dāng)然你也可以將這部分代碼寫在open_spider方法中

在 process_item() 方法中，我們將每一行的數(shù)據(jù)寫入到工作表中。

process_item 方法：

不會(huì)覆蓋之前已經(jīng)寫入的數(shù)據(jù)，它會(huì)在數(shù)據(jù)末尾追加新的行。
你調(diào)用多次 process_item 方法，每次都會(huì)在表格的末尾追加一行新數(shù)據(jù)。

在 close_spider() 方法中，我們保存 Excel 文件。

1.1.2 注意

可以發(fā)現(xiàn)我在process_item()方法中使用了item.get(key, default)：

考慮可能存在某些 item 中沒有某些鍵值的情況，這可能會(huì)導(dǎo)致程序出錯(cuò)。

當(dāng)然如果你已經(jīng)進(jìn)行過數(shù)據(jù)處理也可以直接用item[key]。

使用了 item.get(key, default) 方法來獲取 item 中的鍵值，如果某個(gè)鍵不存在，則返回一個(gè)空字符串 ''

在 Scrapy 中，item 是一個(gè)字典類型，它由一系列鍵值對組成，每個(gè)鍵值對表示一個(gè)字段。在處理 item 時(shí)，我們通常需要從中獲取某個(gè)字段的值。使用字典的 get 方法可以方便地實(shí)現(xiàn)這個(gè)功能。

get 方法有兩個(gè)參數(shù)：key 表示要獲取的鍵，default 表示鍵不存在時(shí)的默認(rèn)值。例如：

1.2 pandas

class ExcelPipeline:
    def __init__(self):
        # 創(chuàng)建一個(gè)空的數(shù)據(jù)框
        self.df = pd.DataFrame(columns=['title', 'link', 'country',
                                        'author', 'translator', 'publisher',
                                        'time', 'price', 'star', 'score',
                                        'people', 'comment'
                                        ])

    def process_item(self, item, spider):
        # 將數(shù)據(jù)添加到數(shù)據(jù)框中
        item['title'] = item.get('title', '')
        item['link'] = item.get('link', '')
        item['country'] = item.get('country', '')
        item['author'] = item.get('author', '')
        item['translator'] = item.get('translator', '')
        item['publisher'] = item.get('publisher', '')
        item['time'] = item.get('time', '')
        item['price'] = item.get('price', '')
        item['star'] = item.get('star', '')
        item['score'] = item.get('score', '')
        item['people'] = item.get('people', '')
        item['comment'] = item.get('comment', '')
        series = pd.Series(item)
        self.df = self.df.append(series, ignore_index=True)
        return item

    def close_spider(self, spider):
        # 將數(shù)據(jù)框保存到 Excel 文件中
        self.df.to_excel('result.xlsx', index=False)

1.2.1 代碼說明

定義了一個(gè) ExcelPipeline 類，它包含了三個(gè)方法：__init__、process_item 和 close_spider。

__init__ 方法用于初始化類實(shí)例
process_item 方法用于處理每個(gè)爬取到的 item，將其添加到 items 列表中
close_spider 方法用于在爬蟲關(guān)閉時(shí)將 items 列表中的數(shù)據(jù)保存到 Excel 文件中。

1.2.2 常見錯(cuò)誤

在代碼中有大量的item['title'] = item.get('title', '')類似代碼

你可以選擇不寫，但如果item中有一些字段的值為None，而pandas不支持將None類型的值添加到DataFrame中，會(huì)導(dǎo)致程序錯(cuò)誤。這一點(diǎn)比openpyxl要嚴(yán)格的多。

字典對象轉(zhuǎn)換為Series對象

self.df是一個(gè)DataFrame對象，而item是一個(gè)字典對象。因此，需要將字典對象轉(zhuǎn)換為Series對象，然后再將其添加到DataFrame中。

series = pd.Series(item)
self.df = self.df.append(series, ignore_index=True)

only Series and DataFrame objs are valid這個(gè)錯(cuò)誤一般就是發(fā)生在使用Pandas將數(shù)據(jù)轉(zhuǎn)換成DataFrame時(shí)，傳入的參數(shù)不是Series或DataFrame類型。

上面的代碼就是用來避免這個(gè)問題的。

1.3 openpyxl和pandas對比

pandas和openpyxl都是非常強(qiáng)大的Python數(shù)據(jù)處理庫，兩者在不同的場景下可以發(fā)揮出各自的優(yōu)勢。

如果需要處理大量的Excel文件，需要對文件進(jìn)行復(fù)雜的操作，比如格式化、圖表等，那么openpyxl可能更適合，因?yàn)樗鼘Ｗ⒂贓xcel文件的讀寫和操作，具有更高的靈活性和控制力。
如果數(shù)據(jù)已經(jīng)在Python中，且需要進(jìn)行各種統(tǒng)計(jì)分析和處理，如數(shù)據(jù)聚合、數(shù)據(jù)透視表、數(shù)據(jù)分組、數(shù)據(jù)清洗、數(shù)據(jù)可視化等，那么pandas可能更適合，因?yàn)樗峁┝素S富的數(shù)據(jù)處理工具和函數(shù)。

總的來說，兩者都是很好的工具，具體使用哪一個(gè)取決于具體需求和場景。

2. MYSQL

可以使用Python的MySQL驅(qū)動(dòng)程序，例如 mysql-connector-python 或 pymysql。主要將pymysql。

class MySQLPipeline:
    def __init__(self):
        # 連接 MySQL 數(shù)據(jù)庫
        self.conn = pymysql.connect(
            host='localhost',
            port=3306,
            user='root',
            password='your_password',
            database='your_database',
            charset='utf8mb4',
            cursorclass=pymysql.cursors.DictCursor
        )
        # 創(chuàng)建一個(gè)游標(biāo)對象
        self.cursor = self.conn.cursor()
        # 創(chuàng)建表
        self.create_table()

    def create_table(self):
        # SQL 語句：創(chuàng)建數(shù)據(jù)表
        sql = '''CREATE TABLE IF NOT EXISTS `book` (
            `id` int(11) NOT NULL AUTO_INCREMENT,
            `title` varchar(255) NOT NULL,
            `link` varchar(255) NOT NULL,
            `country` varchar(255) NOT NULL,
            `author` varchar(255) NOT NULL,
            `translator` varchar(255) NOT NULL,
            `publisher` varchar(255) NOT NULL,
            `time` varchar(255) NOT NULL,
            `price` varchar(255) NOT NULL,
            `star` varchar(255) NOT NULL,
            `score` varchar(255) NOT NULL,
            `people` varchar(255) NOT NULL,
            `comment` varchar(255) NOT NULL,
            PRIMARY KEY (`id`)
        ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci'''
        # 執(zhí)行 SQL 語句
        self.cursor.execute(sql)
        # 提交事務(wù)
        self.conn.commit()

    def process_item(self, item, spider):
        # SQL 語句：插入數(shù)據(jù)
        sql = '''INSERT INTO `book` (
                `title`, `link`, `country`,
                `author`, `translator`, `publisher`,
                `time`, `price`, `star`, `score`,
                `people`, `comment`
            ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'''
        # 執(zhí)行 SQL 語句
        self.cursor.execute(sql, (
            item['title'], item['link'], item['country'],
            item['author'], item['translator'], item['publisher'],
            item['time'], item['price'], item['star'], item['score'],
            item['people'], item['comment']
        ))
        # 提交事務(wù)
        self.conn.commit()
        return item

    def close_spider(self, spider):
        # 關(guān)閉游標(biāo)對象
        self.cursor.close()
        # 關(guān)閉數(shù)據(jù)庫連接
        self.conn.close()

2.1 代碼說明

我們創(chuàng)建了一個(gè)名為MySQLPipeline的自定義ScrapyPipeline。

__init__方法中接收了MySQL數(shù)據(jù)庫的配置信息。

其中還調(diào)用了create_table，當(dāng)然如果保證表已經(jīng)存在，也沒有必要這么寫

如果你嫌每次連接都要寫信息的話，可以在setting.py中定義MySQL相關(guān)變量：

請?zhí)砑訄D片描述

create_table方法創(chuàng)建表book

process_item方法用于將抓取的數(shù)據(jù)插入到數(shù)據(jù)庫表中。

close_spider方法用于關(guān)閉游標(biāo)和連接。

2.2 pymysql介紹

2.2.1 游標(biāo)對象

在Python中，連接數(shù)據(jù)庫時(shí)需要?jiǎng)?chuàng)建一個(gè)數(shù)據(jù)庫連接對象，然后通過這個(gè)連接對象創(chuàng)建一個(gè)游標(biāo)對象。

游標(biāo)對象是執(zhí)行數(shù)據(jù)庫操作的主要對象，它負(fù)責(zé)向數(shù)據(jù)庫發(fā)送查詢和獲取結(jié)果。

在Python中，常用的游標(biāo)對象有Cursor、DictCursor、SSCursor等。

Cursor：普通游標(biāo)（默認(rèn)），返回結(jié)果為元組類型。
DictCursor：字典游標(biāo)，返回結(jié)果為字典類型。
SSCursor：嵌套游標(biāo)，可用于處理大數(shù)據(jù)集。

在獲取大量數(shù)據(jù)時(shí)效率比普通游標(biāo)更高，但是會(huì)占用更多的系統(tǒng)資源。

與普通游標(biāo)相比，嵌套游標(biāo)不會(huì)將整個(gè)查詢結(jié)果讀入內(nèi)存，而是每次只讀取部分?jǐn)?shù)據(jù)。

根據(jù)需要，選擇不同類型的游標(biāo)對象可以方便我們對返回結(jié)果進(jìn)行處理。

2.2.2 各種游標(biāo)說明

創(chuàng)建連接對象時(shí)有這么一段代碼：

cursorclass=pymysql.cursors.DictCursor

用于設(shè)置游標(biāo)返回的數(shù)據(jù)類型，默認(rèn)返回的是元組(tuple)類型，設(shè)置為DictCursor后可以返回字典(dict)類型，更方便處理數(shù)據(jù)。一般使用普通游標(biāo)就行了

三種游標(biāo)主要是在查詢時(shí)的方式存在區(qū)別：

cur = conn.cursor()
cur.execute('SELECT * FROM my_table')
result = cur.fetchone()  # 獲取一條記錄，返回的是元組類型
# 普通游標(biāo)
print(result[0])  # 訪問第一個(gè)字段的值
# 字典游標(biāo)
print(result['id'])  # 訪問數(shù)據(jù)庫中字段名為 id 的字段的值，{'id': 1, 'name': 'Alice'}

# 嵌套游標(biāo)
print(result[0])  # 訪問第一個(gè)字段的值

如果是查詢的多條數(shù)據(jù)，則返回的是元組或字典組成的列表：

# 普通游標(biāo)
[(1, 'John', 'Doe'), (2, 'Jane', 'Doe'), (3, 'Bob', 'Smith')]
# 字典游標(biāo)
[{'id': 1, 'first_name': 'John', 'last_name': 'Doe'}, {'id': 2, 'first_name': 'Jane', 'last_name': 'Doe'}, {'id': 3, 'first_name': 'Bob', 'last_name': 'Smith'}]

3. 特別說明

每個(gè)item在被提交給管道時(shí)都會(huì)調(diào)用一次管道類的process_item方法。

每個(gè)item都會(huì)經(jīng)過process_item方法進(jìn)行處理，而open_spider和close_spider方法只會(huì)在爬蟲啟動(dòng)和結(jié)束時(shí)執(zhí)行一次。

在Scrapy中，可以通過在管道類的open_spider和close_spider方法中建立和關(guān)閉數(shù)據(jù)庫連接，以減少連接建立和關(guān)閉的次數(shù)。

__init__方法也是只在Spider啟動(dòng)時(shí)只執(zhí)行一次

具體做法是，在open_spider方法中建立數(shù)據(jù)庫連接，在process_item方法中使用連接對數(shù)據(jù)進(jìn)行存儲(chǔ)操作，在close_spider方法中關(guān)閉連接。這樣做可以有效減少連接的建立和關(guān)閉次數(shù)，提高爬取效率。

如果你在open_spider方法中創(chuàng)建了數(shù)據(jù)庫連接，那么這個(gè)連接將會(huì)被共享并被多個(gè)process_item方法使用。

同樣的，如果在close_spider方法中關(guān)閉了數(shù)據(jù)庫連接，那么這個(gè)連接也會(huì)被所有的process_item方法共享并在爬蟲結(jié)束時(shí)關(guān)閉。

這種做法可以減少不必要的連接和關(guān)閉操作，從而提高性能。

到此這篇關(guān)于Scrapy將數(shù)據(jù)保存到Excel和MySQL中的方法實(shí)現(xiàn)的文章就介紹到這了,更多相關(guān)Scrapy數(shù)據(jù)保存到Excel和MySQL內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: