python爬蟲(chóng) 貓眼電影和電影天堂數(shù)據(jù)csv和mysql存儲(chǔ)過(guò)程解析
字符串常用方法
# 去掉左右空格 'hello world'.strip() # 'hello world' # 按指定字符切割 'hello world'.split(' ') # ['hello','world'] # 替換指定字符串 'hello world'.replace(' ','#') # 'hello#world'
csv模塊
作用:將爬取的數(shù)據(jù)存放到本地的csv文件中
使用流程
- 導(dǎo)入模塊
- 打開(kāi)csv文件
- 初始化寫(xiě)入對(duì)象
- 寫(xiě)入數(shù)據(jù)(參數(shù)為列表)
import csv with open('test.csv','w') as f: writer = csv.writer(f) # 初始化寫(xiě)入對(duì)象 # 寫(xiě)一行 writer.writerow(['超哥哥',20]) writer.writerow(['步驚云',22]) with open('test.csv','a') as f: writer = csv.writer(f) # 寫(xiě)多行 data_list = [('聶風(fēng)',23),('秦霜',30)] writer.writerows(data_list)
Windows中使用csv模塊默認(rèn)會(huì)在每行后面添加一個(gè)空行,使用newline=''可解決
with open('xxx.csv','w',newline='') as f:
貓眼電影top100抓取案例
確定URL網(wǎng)址
貓眼電影 - 榜單 - top100榜 目標(biāo)
電影名稱(chēng)、主演、上映時(shí)間 操作步驟
1、查看是否為動(dòng)態(tài)加載
右鍵 - 查看網(wǎng)頁(yè)源代碼 - 搜索爬取關(guān)鍵字(查看在源代碼中是否存在)
2、找URL規(guī)律
- 第1頁(yè):https://maoyan.com/board/4?offset=0
- 第2頁(yè):https://maoyan.com/board/4?offset=10
- 第n頁(yè):offset=(n-1)*10
3、正則表達(dá)式
<div class="movie-item-info">.*?title="(.*?)".*?class="star">(.*?)</p>.*?releasetime">(.*?)</p>
4、編寫(xiě)程序框架,完善程序
- 打印程序執(zhí)行時(shí)間
- 隨機(jī)的User-Agent,(確保每次發(fā)請(qǐng)求使用隨機(jī))
- 數(shù)據(jù)爬下來(lái)后做處理(字符串),定義成字典
- 一條龍: 獲取 -> 調(diào)用解析 -> 數(shù)據(jù)處理
- 貓眼電影數(shù)據(jù)存入本地 maoyanfilm.csv 文件
from urllib import request import time import re import csv import random class MaoyanSpider(object): def __init__(self): self.page = 1 # 用于記錄頁(yè)數(shù) self.url = 'https://maoyan.com/board/4?offset={}' self.agent = [ 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 \ Safari/535.1', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; \ .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1'] # 請(qǐng)求 def get_page(self, url): headers = {'User-Agent': random.choice(self.agent)} # 每次使用隨機(jī)的user-agent req = request.Request(url=url, headers=headers) # 創(chuàng)建請(qǐng)求對(duì)象 res = request.urlopen(req) # 發(fā)起請(qǐng)求 html = res.read().decode('utf-8') # 獲取請(qǐng)求內(nèi)容 self.parse_page(html) # 直接調(diào)用解析函數(shù) # 解析 def parse_page(self, html): pattren = re.compile( '<div class="movie-item-info">.*?title="(.*?)".*?class="star">(.*?)</p>.*?releasetime">(.*?)</p>', re.S) r_list = pattren.findall(html) # rlist: [('霸王別姬', '\n 主演:張國(guó)榮,張豐毅,鞏俐\n ', '上映時(shí)間:1993-01-01'),(...),(...)] self.write_page(r_list) # 寫(xiě)入csv文件 # # 保存,打印輸出 # def write_page(self,r_list): # one_film_dict = {} # for rt in r_list: # one_film_dict['name'] = rt[0].strip() # one_film_dict['star'] = rt[1].strip() # one_film_dict['time'] = rt[2].strip()[5:15] # # print(one_film_dict) # 保存到csv文件(writerows) -- 推薦使用此方法 def write_page(self, r_list): # 空列表,最終writerows()的參數(shù): [(),(),()] film_list = [] with open('maoyan.csv', 'a',newline="") as f: writer = csv.writer(f) for rt in r_list: # 把處理過(guò)的數(shù)據(jù)定義成元組 t = (rt[0], rt[1].strip(), rt[2].strip()[5:15]) film_list.append(t) writer.writerows(film_list) def main(self): for offset in range(0, 31, 10): url = self.url.format(offset) self.get_page(url) time.sleep(random.randint(1, 3)) print('第%d頁(yè)爬取完成' % self.page) self.page += 1 if __name__ == '__main__': start = time.time() spider = MaoyanSpider() spider.main() end = time.time() print('執(zhí)行時(shí)間: %.2f' % (end - start))
數(shù)據(jù)持久化存儲(chǔ)(MySQL數(shù)據(jù)庫(kù))
讓我們來(lái)回顧一下pymysql模塊的基本使用
import pymysql db = pymysql.connect('localhost', 'root', '123456', 'maoyandb', charset='utf8') cursor = db.cursor() # 創(chuàng)建游標(biāo)對(duì)象 # execute()方法第二個(gè)參數(shù)為列表傳參補(bǔ)位 cursor.execute('insert into film values(%s,%s,%s)', ['霸王別姬', '張國(guó)榮', '1993']) db.commit() # 提交到數(shù)據(jù)庫(kù)執(zhí)行 cursor.close() # 關(guān)閉 db.close()
讓我們來(lái)回顧一下pymysql中executemany()的用法
import pymysql # 數(shù)據(jù)庫(kù)連接對(duì)象 db = pymysql.connect('localhost', 'root', '123456', charset='utf8') cursor = db.cursor() # 游標(biāo)對(duì)象 ins_list = [] # 存放所有數(shù)據(jù)的大列表 for i in range(2): name = input('請(qǐng)輸入第%d個(gè)學(xué)生姓名:' % (i + 1)) age = input('請(qǐng)輸入第%d個(gè)學(xué)生年齡:' % (i + 1)) ins_list.append([name, age]) ins = 'insert into t3 values(%s,%s)' # 定義插入語(yǔ)句 cursor.executemany(ins, ins_list) # 一次數(shù)據(jù)庫(kù)的IO操作可插入多條語(yǔ)句,提升性能 db.commit() # 提交到數(shù)據(jù)庫(kù)執(zhí)行 cursor.close() # 關(guān)閉游標(biāo) db.close() # 關(guān)閉數(shù)據(jù)庫(kù) ins = 'insert into maoyanfilm values(%s,%s,%s)' cursor.execute(['霸王', '國(guó)榮', '1991']) cursor.executemany([ ['月光寶盒', '周星馳', '1993'], ['大圣娶親', '周星馳', '1993']])
練習(xí):把貓眼電影案例中電影信息存入MySQL數(shù)據(jù)庫(kù)中(盡量使用executemany方法)
from urllib import request import time import re import pymysql import random class MaoyanSpider(object): def __init__(self): self.page = 1 # 用于記錄頁(yè)數(shù) self.url = 'https://maoyan.com/board/4?offset={}' self.ua_list = [ 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) \ Chrome/14.0.835.163 Safari/535.1', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; \ .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)'] # 創(chuàng)建數(shù)據(jù)庫(kù)連接對(duì)象和游標(biāo)對(duì)象 self.db = pymysql.connect('localhost', 'root', '123456', 'maoyandb', charset='utf8') self.cursor = self.db.cursor() # 獲取 def get_page(self, url): # 每次使用隨機(jī)的user-agent headers = {'User-Agent': random.choice(self.ua_list)} req = request.Request(url=url, headers=headers) res = request.urlopen(req) html = res.read().decode('utf-8') self.parse_page(html) # 直接調(diào)用解析函數(shù) # 解析 def parse_page(self, html): pattren = re.compile( '<div class="movie-item-info">.*?title="(.*?)".*?class="star">(.*?)</p>.*?releasetime">(.*?)</p>', re.S) # rlist: [('霸王別姬','張國(guó)榮','1993'),(),()] r_list = pattren.findall(html) print(r_list) self.write_page(r_list) # 存入mysql數(shù)據(jù)庫(kù)(executemany([ [],[],[] ])) def write_page(self, r_list): film_list = [] ins = 'insert into filmtab values(%s,%s,%s)' # 定義插入語(yǔ)句 # 處理數(shù)據(jù),放到大列表film_list中 for rt in r_list: one_film = [rt[0], rt[1].strip(), rt[2].strip()[5:15]] # 添加到大列表中 film_list.append(one_film) # 一次數(shù)據(jù)庫(kù)IO把1頁(yè)數(shù)據(jù)存入 self.cursor.executemany(ins, film_list) # 提交到數(shù)據(jù)庫(kù)執(zhí)行 self.db.commit() def main(self): for offset in range(0, 31, 10): url = self.url.format(offset) self.get_page(url) time.sleep(random.randint(1, 3)) print('第%d頁(yè)爬取完成' % self.page) self.page += 1 # 斷開(kāi)數(shù)據(jù)庫(kù)(所有頁(yè)爬完之后) self.cursor.close() self.db.close() if __name__ == '__main__': start = time.time() spider = MaoyanSpider() spider.main() end = time.time() print('執(zhí)行時(shí)間: %.2f' % (end - start))
讓我們來(lái)做個(gè)SQL命令查詢(xún)
1、查詢(xún)20年以前的電影的名字和上映時(shí)間
select name,time from filmtab where time<(now()-interval 20 year);
2、查詢(xún)1990-2000年的電影名字和上映時(shí)間
select name,time from filmtab where time>='1990-01-01' and time<='2000-12-31';
讓我們來(lái)復(fù)習(xí)一下mongdb數(shù)據(jù)庫(kù)
import pymongo # 1.連接對(duì)象 conn = pymongo.MongoClient(host='127.0.0.1', port=27017) db = conn['maoyandb'] # 2.庫(kù)對(duì)象 myset = db['filmtab'] # 3.集合對(duì)象 myset.insert_one({'name': '趙敏'}) # 4.插入數(shù)據(jù)庫(kù)
練習(xí):把貓眼電影案例中電影信息存入MongDB數(shù)據(jù)庫(kù)中
from urllib import request import re import time import random import pymongo class MaoyanSpider(object): def __init__(self): self.url = 'https://maoyan.com/board/4?offset={}' # 計(jì)數(shù) self.num = 0 # 創(chuàng)建3個(gè)對(duì)象 self.conn = pymongo.MongoClient('localhost', 27017) self.db = self.conn['maoyandb'] self.myset = self.db['filmset'] self.ua_list = [ 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET \ CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)', ] def get_html(self, url): headers = { 'User-Agent': random.choice(self.ua_list) } req = request.Request(url=url, headers=headers) res = request.urlopen(req) html = res.read().decode('utf-8') # 直接調(diào)用解析函數(shù) self.parse_html(html) def parse_html(self, html): re_bds = r'<div class="movie-item-info">.*?title="(.*?)".*?class="star">(.*?)</p>.*?releasetime">(.*?)</p>' pattern = re.compile(re_bds, re.S) # film_list: [('霸王別姬','張國(guó)榮','1993'),()] film_list = pattern.findall(html) # 直接調(diào)用寫(xiě)入函數(shù) self.write_html(film_list) # mongodb數(shù)據(jù)庫(kù) def write_html(self, film_list): for film in film_list: film_dict = { 'name': film[0].strip(), 'star': film[1].strip(), 'time': film[2].strip()[5:15] } # 插入mongodb數(shù)據(jù)庫(kù) self.myset.insert_one(film_dict) def main(self): for offset in range(0, 31, 10): url = self.url.format(offset) self.get_html(url) time.sleep(random.randint(1, 2)) if __name__ == '__main__': start = time.time() spider = MaoyanSpider() spider.main() end = time.time() print('執(zhí)行時(shí)間:%.2f' % (end - start))
電影天堂案例(二級(jí)頁(yè)面抓取)
1、查看是否為靜態(tài)頁(yè)面,是否為動(dòng)態(tài)加載
右鍵 - 查看網(wǎng)頁(yè)源代碼
2、確定URL地址
百度搜索 :電影天堂 - 2019年新片 - 更多
3、目標(biāo)
*********一級(jí)頁(yè)面***********
1、電影名稱(chēng)
2、電影鏈接
*********二級(jí)頁(yè)面***********
1、下載鏈接
4、步驟
找URL規(guī)律
第1頁(yè) :https://www.dytt8.net/html/gndy/dyzz/list_23_1.html
第2頁(yè) :https://www.dytt8.net/html/gndy/dyzz/list_23_2.html
第n頁(yè) :https://www.dytt8.net/html/gndy/dyzz/list_23_n.html
寫(xiě)正則表達(dá)式
1、一級(jí)頁(yè)面正則表達(dá)式(電影名稱(chēng)、電影詳情鏈接)
<table width="100%".*?<td height="26">.*?<a href="(.*?)" rel="external nofollow" rel="external nofollow" .*?>(.*?)</a>
2、二級(jí)頁(yè)面正則表達(dá)式
<td style="WORD-WRAP.*?>.*?>(.*?)</a>
代碼實(shí)現(xiàn)
# decode('gbk','ignore') 注意ignore參數(shù)
# 注意結(jié)構(gòu)和代碼可讀性(一個(gè)函數(shù)不要太冗余)
from urllib import request import re import time import random from useragents import * import pymysql class FilmSky(object): def __init__(self): self.url = 'https://www.dytt8.net/html/gndy/dyzz/list_23_{}.html' # 定義兩個(gè)對(duì)象 self.db = pymysql.connect('127.0.0.1', 'root', '123456', 'maoyandb', charset='utf8') self.cursor = self.db.cursor() # 獲取html函數(shù)(因?yàn)閮蓚€(gè)頁(yè)面都需要發(fā)請(qǐng)求) def get_page(self, url): req = request.Request(url=url, headers={'User-Agent': random.choice(ua_list)}) res = request.urlopen(req) # ignore參數(shù),實(shí)在處理不了的編碼錯(cuò)誤忽略 # 查看網(wǎng)頁(yè)源碼,發(fā)現(xiàn)網(wǎng)頁(yè)編碼為 gb2312,不是 utf-8 html = res.read().decode('gbk', 'ignore') return html # 解析提取數(shù)據(jù)(把名稱(chēng)和下載鏈接一次性拿到) # html為一級(jí)頁(yè)面響應(yīng)內(nèi)容 def parse_page(self, html): # 1. 先解析一級(jí)頁(yè)面(電影名稱(chēng) 和 詳情鏈接) pattern = re.compile('<table width="100%".*?<td height="26">.*?<a href="(.*?)" rel="external nofollow" rel="external nofollow" .*?>(.*?)</a>', re.S) # film_list: [('詳情鏈接','名稱(chēng)'),()] film_list = pattern.findall(html) # [('/html/gndy/dyzz/20190806/58956.html', '019年驚悚動(dòng)作《報(bào)仇雪恨/血債血償》BD中英雙字幕'),(),()] ins = 'insert into filmsky values(%s,%s)' for film in film_list: film_name = film[1] film_link = 'https://www.dytt8.net' + film[0] # 2. 拿到詳情鏈接后,再去獲取詳情鏈接html,提取下載鏈接 download_link = self.parse_two_html(film_link) self.cursor.execute(ins, [film_name, film_link]) self.db.commit() # 打印測(cè)試 d = {'電影名稱(chēng)': film_name, '下載鏈接': download_link} print(d) # {'電影名稱(chēng)': '019年驚悚動(dòng)作《報(bào)仇雪恨/血債血償》BD中英雙字幕', '下載鏈接': 'ftp://ygdy8:ygdy8@yg90.dydytt.net:8590/陽(yáng)光電影www.ygdy8.com.報(bào)仇雪恨.BD.720p.中英雙字幕.mkv'} # 解析二級(jí)頁(yè)面,獲取下載鏈接 def parse_two_html(self, film_link): two_html = self.get_page(film_link) pattern = re.compile('<td style="WORD-WRAP.*?>.*?>(.*?)</a>', re.S) download_link = pattern.findall(two_html)[0] return download_link # 主函數(shù) def main(self): for page in range(1, 11): url = self.url.format(page) html = self.get_page(url) self.parse_page(html) time.sleep(random.randint(1, 3)) print('第%d頁(yè)完成' % page) if __name__ == '__main__': start = time.time() spider = FilmSky() spider.main() end = time.time() print('執(zhí)行時(shí)間:%.2f' % (end - start))
以上就是本文的全部?jī)?nèi)容,希望對(duì)大家的學(xué)習(xí)有所幫助,也希望大家多多支持腳本之家。
- Python構(gòu)建圖像分類(lèi)識(shí)別器的方法
- Python爬蟲(chóng)實(shí)例之2021貓眼票房字體加密反爬策略(粗略版)
- python爬取2021貓眼票房字體加密實(shí)例
- 利用python如何實(shí)現(xiàn)貓捉老鼠小游戲
- Python貓眼電影最近上映的電影票房信息
- 用Python 爬取貓眼電影數(shù)據(jù)分析《無(wú)名之輩》
- python爬蟲(chóng)開(kāi)發(fā)之使用Python爬蟲(chóng)庫(kù)requests多線(xiàn)程抓取貓眼電影TOP100實(shí)例
- Python通過(guò)TensorFlow卷積神經(jīng)網(wǎng)絡(luò)實(shí)現(xiàn)貓狗識(shí)別
- python調(diào)用opencv實(shí)現(xiàn)貓臉檢測(cè)功能
- Python爬取酷狗MP3音頻的步驟
- python發(fā)qq消息轟炸虐狗好友思路詳解(完整代碼)
- python使用beautifulsoup4爬取酷狗音樂(lè)代碼實(shí)例
- Java基礎(chǔ)之ClassLoader詳解
相關(guān)文章
Win10環(huán)境python3.7安裝dlib模塊趟過(guò)的坑
這篇文章主要介紹了Win10環(huán)境python3.7安裝dlib模塊趟過(guò)的坑,本文圖文并茂給大家介紹的非常詳細(xì),具有一定的參考借鑒價(jià)值,需要的朋友可以參考下2019-08-08Python通過(guò)kerberos安全認(rèn)證操作kafka方式
這篇文章主要介紹了Python通過(guò)kerberos安全認(rèn)證操作kafka方式,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧2020-06-06在Django中動(dòng)態(tài)地過(guò)濾查詢(xún)集的實(shí)現(xiàn)
本文主要介紹了Django中動(dòng)態(tài)地過(guò)濾查詢(xún)集的實(shí)現(xiàn),文中通過(guò)示例代碼介紹的非常詳細(xì),具有一定的參考價(jià)值,感興趣的小伙伴們可以參考一下2022-03-03python utc datetime轉(zhuǎn)換為時(shí)間戳的方法
今天小編就為大家分享一篇python utc datetime轉(zhuǎn)換為時(shí)間戳的方法,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧2019-01-01利用Python實(shí)現(xiàn)自動(dòng)化監(jiān)控文件夾完成服務(wù)部署
本篇文章將為大家詳細(xì)介紹如何利用Python語(yǔ)言實(shí)現(xiàn)監(jiān)控文件夾,以此輔助完成服務(wù)的部署動(dòng)作,文中的示例代碼講解詳細(xì),感興趣的可以嘗試一下2022-07-07