學(xué)習(xí)Python selenium自動化網(wǎng)頁抓取器

更新時間：2018年01月20日 14:52:50 作者：Rock_Song

本篇文章給大家介紹了Python selenium自動化網(wǎng)頁抓取器的實例應(yīng)用以及知識點分析，有需要的參考學(xué)習(xí)下。

直接入正題---Python selenium自動控制瀏覽器對網(wǎng)頁的數(shù)據(jù)進(jìn)行抓取，其中包含按鈕點擊、跳轉(zhuǎn)頁面、搜索框的輸入、頁面的價值數(shù)據(jù)存儲、mongodb自動id標(biāo)識等等等。

1、首先介紹一下 Python selenium ---自動化測試工具，用來控制瀏覽器來對網(wǎng)頁的操作，在爬蟲中與BeautifulSoup結(jié)合那就是天衣無縫，除去國外的一些變態(tài)的驗證網(wǎng)頁，對于圖片驗證碼我有自己寫的破解圖片驗證碼的源代碼，成功率在85%。

詳情請咨詢QQ群--607021567（這不算廣告，群里有好多Python的資源分享，還有大數(shù)據(jù)的一些知識【hadoop】）

2、beautifulsoup就不需要詳細(xì)的介紹了，直接上網(wǎng)址:：https://www.crummy.com/software/BeautifulSoup/bs4/doc/（BeautifulSoup的官方文檔）

3、關(guān)于mongodb的自動id的生成。mongodb中所有的存儲數(shù)據(jù)都是有固定的id的，但是mongodb的id對于人類來講是復(fù)雜的，對于機(jī)器來講是小菜一碟的，所以在存入數(shù)據(jù)的同時，我習(xí)慣用新id來對每一條數(shù)據(jù)的負(fù)責(zé)！

在Python中使用mongodb的話需要引進(jìn)模塊 from pymongo import MongoClient,ASCENDING, DESCENDING ，該模塊就是你的責(zé)任！

接下來開始講程序，直接上實例（一步一步來）：

引入模塊：

from selenium import webdriver
from bs4 import BeautifulSoup
import requests
from pymongo import MongoClient,ASCENDING, DESCENDING
import time
import re

其中的每一個模塊都會說已經(jīng)解釋過了，其中的re、requests都是之前就有提過的，他們都是核心缺一不可！

首先，我舉一個小例子，淘寶的自動模擬搜索功能（源碼）：

先說一下selenium 的定位方法

find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector

源碼：

from selenium import webdriver
from bs4 import BeautifulSoup
import requests
from pymongo import MongoClient,ASCENDING, DESCENDING
import time
import re
def TaoBao():
 try:
  Taobaourl = 'https://www.taobao.com/'
  driver = webdriver.Chrome()
  driver.get(Taobaourl)
  time.sleep(5)#通常這里需要停頓，不然你的程序很有可能被檢測到是Spider
  text='Strong Man'#輸入的內(nèi)容
  driver.find_element_by_xpath('//input[@class="search-combobox-input"]').send_keys(text).click()
  driver.find_element_by_xpath('//button[@class="btn-search tb-bg"]').click()
  driver.quit()
 except Exception,e:
  print e
if __name__ == '__main__':
 TaoBao()

效果的實現(xiàn)，你們可以直接復(fù)制后直接運行！我只用了xpath的這個方法，因為它最實在！橙色字體（如果我沒有色盲的話），就是網(wǎng)頁中定位的元素，可以找到的！

接下來就是與BeautifulSoup的結(jié)合了，但是我們看到的只是打開了網(wǎng)頁，并沒有源碼，那么就需要 “變量名.page_source”這個方法，他會實現(xiàn)你的夢想，你懂得?

ht = driver.page_source
#print ht 你可以Print出啦看看
soup = BeautifulSoup(ht,'html.parser')

下面就是BeautifulSoup的一些語法操作了，對于數(shù)據(jù)的結(jié)構(gòu)還有采集，在上一篇里面有詳細(xì)的抓取操作?。。?/p>

算了！說一個最簡單的定位抓?。?/p>

soup = BeautifulSoup(ht,'html.parser')
a = soup.find('table',id="ctl00_ContentMain_SearchResultsGrid_grid")
if a: #必須加判斷，不然訪問的網(wǎng)頁或許沒有這一元素，程序就會都停止！

class的標(biāo)簽必須是class_,一定要記住！

哈哈哈！mongodb了昂，細(xì)節(jié)細(xì)節(jié)，首先需要用到模塊----from pymongo import MongoClient,ASCENDING, DESCENDING

因為在python，mongodb的語法仍然實用，所以需要定義一個庫，并且是全局性的，還有鏈接你計算機(jī)的一個全局變量。

if __name__ == '__main__': 
 global db#全局變量     
 global table#全局?jǐn)?shù)據(jù)庫
 table = 'mouser_product'
 mconn=MongoClient("mongodb://localhost")#地址
 db=mconn.test
 db.authenticate('test','test')#用戶名和密碼
 Taobao()

定義這些后，需要我們的新id來對數(shù)據(jù)的跟蹤加定義：

db.sn.find_and_modify({"_id": table}, update={ "$inc": {'currentIdValue': 1}},upsert=True)
dic = db.ids.find({"_id":table}).limit(1)
return dic[0].get("currentIdValue")

這個方法是通用的，所以只要記住其中的mongodb的語法就可以了！因為這里是有返回值的，所以這個是個方法體，這里不需要太過于糾結(jié)是怎么實現(xiàn)的，理解就好，中心還是在存數(shù)據(jù)的過程中

count = db[table].find({'數(shù)據(jù)':數(shù)據(jù)}).count() #是檢索數(shù)據(jù)庫中的數(shù)據(jù)
if count <= 0:        #判斷是否有
ids= getNewsn()       #ids就是我們新定義的id，這里的id是1開始的增長型id
db[table].insert({"ids":ids,"數(shù)據(jù)":數(shù)據(jù)})

這樣我們的數(shù)據(jù)就直接存入到mongodb的數(shù)據(jù)庫中了，這里解釋一下為什么在大數(shù)據(jù)中這么喜歡mongodb，因為它小巧，速度佳！

最后來一個實例源碼：

from selenium import webdriver
from bs4 import BeautifulSoup
import requests
from pymongo import MongoClient,ASCENDING, DESCENDING
import time
import re
def parser():
 try:
  f = open('sitemap.txt','r')
  for i in f.readlines():
   sorturl=i.strip()
   driver = webdriver.Firefox()
   driver.get(sorturl)
   time.sleep(50)
   ht = driver.page_source
   #pageurl(ht)
   soup = BeautifulSoup(ht,'html.parser')
   a = soup.find('a',class_="first-last")
   if a:
    pagenum = int(a.get_text().strip())
    print pagenum
    for i in xrange(1,pagenum):
     element = driver.find_element_by_xpath('//a[@id="ctl00_ContentMain_PagerTop_%s"]' %i)
     element.click()
     html = element.page_source
     pageurl(html)
     time.sleep(50)
     driver.quit()
 except Exception,e:
  print e
def pageurl(ht):
 try:
  soup = BeautifulSoup(ht,'html.parser')
  a = soup.find('table',id="ctl00_ContentMain_SearchResultsGrid_grid")
  if a:
   tr = a.find_all('tr',class_="SearchResultsRowOdd")
   if tr:
     for i in tr:
      td = i.find_all('td')
      if td:
       url = td[2].find('a')
       if url:
        producturl = '網(wǎng)址'+url['href']
        print producturl
        count = db[table].find({"url":producturl}).count()
        if count<=0:
         sn = getNewsn()
         db[table].insert({"sn":sn,"url":producturl})
         print str(sn) + ' inserted successfully'
         time.sleep(3)
        else:
         print 'exists url'
   tr1 = a.find_all('tr',class_="SearchResultsRowEven")
   if tr1:
     for i in tr1:
      td = i.find_all('td')
      if td:
       url = td[2].find('a')
       if url:
        producturl = '網(wǎng)址'+url['href']
        print producturl
        count = db[table].find({"url":producturl}).count()
        if count<=0:
         sn = getNewsn()
         db[table].insert({"sn":sn,"url":producturl})
         print str(sn) + ' inserted successfully'
         time.sleep(3)
        else:
         print 'exists url'
        #time.sleep(5)
 except Exception,e:
  print e
def getNewsn(): 
 db.sn.find_and_modify({"_id": table}, update={ "$inc"{'currentIdValue': 1}},upsert=True)
 dic = db.sn.find({"_id":table}).limit(1)
 return dic[0].get("currentIdValue")
if __name__ == '__main__': 
 global db     
 global table
 table = 'mous_product'
 mconn=MongoClient("mongodb://localhost")
 db=mconn.test
 db.authenticate('test','test')
 parser()

這一串代碼是破解一個老外的無聊驗證碼界面結(jié)緣的，我真的對他很無語了！破解方法還是實踐中！這是完整的源碼，無刪改的哦！純手工！

您可能感興趣的文章: