Python爬取視頻(其實(shí)是一篇福利)過(guò)程解析
窗外下著小雨,作為單身程序員的我逛著逛著發(fā)現(xiàn)一篇好東西,來(lái)自知乎 你都用 Python 來(lái)做什么?的第一個(gè)高亮答案。
到上面去看了看,地址都是明文的,得,趕緊開(kāi)始吧。
下載流式文件,requests庫(kù)中請(qǐng)求的stream設(shè)為T(mén)rue就可以啦,文檔在此。
先找一個(gè)視頻地址試驗(yàn)一下:
# -*- coding: utf-8 -*- import requests def download_file(url, path): with requests.get(url, stream=True) as r: chunk_size = 1024 content_size = int(r.headers['content-length']) print '下載開(kāi)始' with open(path, "wb") as f: for chunk in r.iter_content(chunk_size=chunk_size): f.write(chunk) if __name__ == '__main__': url = '就在原帖...' path = '想存哪都行' download_file(url, path)
遭遇當(dāng)頭一棒:
AttributeError: __exit__
這文檔也會(huì)騙人的么!
看樣子是沒(méi)有實(shí)現(xiàn)上下文需要的__exit__方法。既然只是為了保證要讓r最后close以釋放連接池,那就使用contextlib的closing特性好了:
# -*- coding: utf-8 -*- import requests from contextlib import closing def download_file(url, path): with closing(requests.get(url, stream=True)) as r: chunk_size = 1024 content_size = int(r.headers['content-length']) print '下載開(kāi)始' with open(path, "wb") as f: for chunk in r.iter_content(chunk_size=chunk_size): f.write(chunk)
程序正常運(yùn)行了,不過(guò)我盯著這文件,怎么大小不見(jiàn)變啊,到底是完成了多少了呢?還是要讓下好的內(nèi)容及時(shí)存進(jìn)硬盤(pán),還能省點(diǎn)內(nèi)存是不是:
# -*- coding: utf-8 -*- import requests from contextlib import closing import os def download_file(url, path): with closing(requests.get(url, stream=True)) as r: chunk_size = 1024 content_size = int(r.headers['content-length']) print '下載開(kāi)始' with open(path, "wb") as f: for chunk in r.iter_content(chunk_size=chunk_size): f.write(chunk) f.flush() os.fsync(f.fileno())
文件以肉眼可見(jiàn)的速度在增大,真心疼我的硬盤(pán),還是最后一次寫(xiě)入硬盤(pán)吧,程序中記個(gè)數(shù)就好了:
def download_file(url, path): with closing(requests.get(url, stream=True)) as r: chunk_size = 1024 content_size = int(r.headers['content-length']) print '下載開(kāi)始' with open(path, "wb") as f: n = 1 for chunk in r.iter_content(chunk_size=chunk_size): loaded = n*1024.0/content_size f.write(chunk) print '已下載{0:%}'.format(loaded) n += 1
結(jié)果就很直觀了:
已下載2.579129% 已下載2.581255% 已下載2.583382% 已下載2.585508%
心懷遠(yuǎn)大理想的我怎么會(huì)只滿足于這一個(gè)呢,寫(xiě)個(gè)類(lèi)一起使用吧:
# -*- coding: utf-8 -*- import requests from contextlib import closing import time def download_file(url, path): with closing(requests.get(url, stream=True)) as r: chunk_size = 1024*10 content_size = int(r.headers['content-length']) print '下載開(kāi)始' with open(path, "wb") as f: p = ProgressData(size = content_size, unit='Kb', block=chunk_size) for chunk in r.iter_content(chunk_size=chunk_size): f.write(chunk) p.output() class ProgressData(object): def __init__(self, block,size, unit, file_name='', ): self.file_name = file_name self.block = block/1000.0 self.size = size/1000.0 self.unit = unit self.count = 0 self.start = time.time() def output(self): self.end = time.time() self.count += 1 speed = self.block/(self.end-self.start) if (self.end-self.start)>0 else 0 self.start = time.time() loaded = self.count*self.block progress = round(loaded/self.size, 4) if loaded >= self.size: print u'%s下載完成\r\n'%self.file_name else: print u'{0}下載進(jìn)度{1:.2f}{2}/{3:.2f}{4} 下載速度{5:.2%} {6:.2f}{7}/s'.\ format(self.file_name, loaded, self.unit,\ self.size, self.unit, progress, speed, self.unit) print '%50s'%('/'*int((1-progress)*50))
運(yùn)行:
下載開(kāi)始 下載進(jìn)度10.24Kb/120174.05Kb 0.01% 下載速度4.75Kb/s ///////////////////////////////////////////////// 下載進(jìn)度20.48Kb/120174.05Kb 0.02% 下載速度32.93Kb/s /////////////////////////////////////////////////
看上去舒服多了。
下面要做的就是多線程同時(shí)下載了,主線程生產(chǎn)url放入隊(duì)列,下載線程獲取url:
# -*- coding: utf-8 -*- import requests from contextlib import closing import time import Queue import hashlib import threading import os def download_file(url, path): with closing(requests.get(url, stream=True)) as r: chunk_size = 1024*10 content_size = int(r.headers['content-length']) if os.path.exists(path) and os.path.getsize(path)>=content_size: print '已下載' return print '下載開(kāi)始' with open(path, "wb") as f: p = ProgressData(size = content_size, unit='Kb', block=chunk_size, file_name=path) for chunk in r.iter_content(chunk_size=chunk_size): f.write(chunk) p.output() class ProgressData(object): def __init__(self, block,size, unit, file_name='', ): self.file_name = file_name self.block = block/1000.0 self.size = size/1000.0 self.unit = unit self.count = 0 self.start = time.time() def output(self): self.end = time.time() self.count += 1 speed = self.block/(self.end-self.start) if (self.end-self.start)>0 else 0 self.start = time.time() loaded = self.count*self.block progress = round(loaded/self.size, 4) if loaded >= self.size: print u'%s下載完成\r\n'%self.file_name else: print u'{0}下載進(jìn)度{1:.2f}{2}/{3:.2f}{4} {5:.2%} 下載速度{6:.2f}{7}/s'.\ format(self.file_name, loaded, self.unit,\ self.size, self.unit, progress, speed, self.unit) print '%50s'%('/'*int((1-progress)*50)) queue = Queue.Queue() def run(): while True: url = queue.get(timeout=100) if url is None: print u'全下完啦' break h = hashlib.md5() h.update(url) name = h.hexdigest() path = 'e:/download/' + name + '.mp4' download_file(url, path) def get_url(): queue.put(None) if __name__ == '__main__': get_url() for i in xrange(4): t = threading.Thread(target=run) t.daemon = True t.start()
加了重復(fù)下載的判斷,至于怎么源源不斷的生產(chǎn)url,諸位摸索吧,保重身體!
以上就是本文的全部?jī)?nèi)容,希望對(duì)大家的學(xué)習(xí)有所幫助,也希望大家多多支持腳本之家。
相關(guān)文章
python實(shí)現(xiàn)opencv+scoket網(wǎng)絡(luò)實(shí)時(shí)圖傳
這篇文章主要為大家詳細(xì)介紹了python實(shí)現(xiàn)opencv+scoket網(wǎng)絡(luò)實(shí)時(shí)圖傳,文中示例代碼介紹的非常詳細(xì),具有一定的參考價(jià)值,感興趣的小伙伴們可以參考一下2020-03-03python爬蟲(chóng)之模擬登陸csdn的實(shí)例代碼
今天小編就為大家分享一篇python爬蟲(chóng)之模擬登陸csdn的實(shí)例代碼,具有很好的參考價(jià)值希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧2018-05-05通過(guò)實(shí)例了解Python str()和repr()的區(qū)別
這篇文章主要介紹了通過(guò)實(shí)例了解Python str()和repr()的區(qū)別,文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友可以參考下2020-01-01python自制包并用pip免提交到pypi僅安裝到本機(jī)【推薦】
這篇文章主要介紹了python自制包并用pip免提交到pypi僅安裝到本機(jī),本文分步驟給大家介紹的非常詳細(xì),具有一定的參考借鑒價(jià)值,需要的朋友可以參考下2019-06-06Matplotlib多子圖使用一個(gè)圖例的實(shí)現(xiàn)
多子圖是Matplotlib中的一個(gè)功能,可以在同一圖形中創(chuàng)建多個(gè)子圖,本文主要介紹了Matplotlib多子圖使用一個(gè)圖例的實(shí)現(xiàn),感興趣的可以了解一下2023-08-08python實(shí)現(xiàn)多線程采集的2個(gè)代碼例子
這篇文章主要介紹了python多線程采集代碼例子,使用了Threading、Queue、MySQLdb等模塊,需要的朋友可以參考下2014-07-07使用Python?Cupy模塊加速大規(guī)模數(shù)值計(jì)算實(shí)例深究
Cupy是一個(gè)基于NumPy的庫(kù),專(zhuān)門(mén)設(shè)計(jì)用于在GPU上進(jìn)行高性能計(jì)算,它提供了與NumPy相似的API,因此用戶可以很容易地將現(xiàn)有的NumPy代碼遷移到Cupy上,從而充分利用GPU的并行計(jì)算能力2023-12-12