快捷導(dǎo)航

Python大批量搜索引擎圖像爬蟲工具詳解

更新時間：2020年11月16日 09:54:41 作者：aabbcccddd01

這篇文章主要介紹了Python大批量搜索引擎圖像爬蟲工具,本文給大家介紹的非常詳細，對大家的學(xué)習(xí)或工作具有一定的參考借鑒價值，需要的朋友可以參考下

python圖像爬蟲包

最近在做一些圖像分類的任務(wù)時，為了擴充我們的數(shù)據(jù)集，需要在搜索引擎下爬取額外的圖片來擴充我們的訓(xùn)練集。搞人工智能真的是太難了😭，居然還要會爬蟲。當(dāng)然網(wǎng)上也有許多python寫的爬蟲工具，當(dāng)然，用多了就知道，這些爬蟲工具不是不能進行多關(guān)鍵字的爬蟲就是用不了，或者是一會就被網(wǎng)站檢測到給停止了，最后發(fā)現(xiàn)了一款非常好用的python圖像爬蟲庫icrawler，直接就能通過pip進行安裝，使用時幾行代碼就能搞定，簡直不要太爽。
話不多說，附上安裝命令：

pip install icrawler

下面附上我爬蟲的代碼：

from icrawler.builtin import BaiduImageCrawler 
from icrawler.builtin import BingImageCrawler 
from icrawler.builtin import GoogleImageCrawler 
#需要爬蟲的關(guān)鍵字
list_word = ['抽煙 行人','吸煙 行人','接電話 行人','打電話 行人', '玩手機 行人']
for word in list_word:
  #bing爬蟲
  #保存路徑
  bing_storage = {'root_dir': 'bing\\'+word}
  #從上到下依次是解析器線程數(shù)，下載線程數(shù)，還有上面設(shè)置的保存路徑
  bing_crawler = BingImageCrawler(parser_threads=2,
                  downloader_threads=4,
                  storage=bing_storage)
  #開始爬蟲，關(guān)鍵字+圖片數(shù)量
  bing_crawler.crawl(keyword=word,
            max_num=2000)

  #百度爬蟲
  # baidu_storage = {'root_dir': 'baidu\\' + word}
  # baidu_crawler = BaiduImageCrawler(parser_threads=2,
  #                  downloader_threads=4,
  #                  storage=baidu_storage)
  # baidu_crawler.crawl(keyword=word,
  #           max_num=2000)


  # google爬蟲
  # google_storage = {'root_dir': '‘google\\' + word}
  # google_crawler = GoogleImageCrawler(parser_threads=4,
  #                  downloader_threads=4,
  #                  storage=google_storage)
  # google_crawler.crawl(keyword=word,
  #           max_num=2000)

這個爬蟲庫能夠?qū)崿F(xiàn)多線程，多搜索引擎（百度、必應(yīng)、谷歌）的爬蟲，當(dāng)然谷歌爬蟲需要梯子。這里展示的是基于必應(yīng)的爬蟲，百度和谷歌的代碼也在下面，只不過被我屏蔽掉了，當(dāng)然也可以三個同時全開！這樣的python爬蟲庫用起來簡直不要太爽。

到此這篇關(guān)于Python大批量搜索引擎圖像爬蟲工具的文章就介紹到這了,更多相關(guān)Python搜索引擎圖像爬蟲內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: