快捷導(dǎo)航

Python無法用requests獲取網(wǎng)頁源碼的解決方法

更新時(shí)間：2022年07月08日 09:47:14 作者：henanlion

爬蟲獲取信息,很多時(shí)候是需要從網(wǎng)頁源碼中獲取鏈接信息的,下面這篇文章主要給大家介紹了關(guān)于Python無法用requests獲取網(wǎng)頁源碼的解決方法,文中通過示例代碼介紹的非常詳細(xì),需要的朋友可以參考下

最近在抓取http://skell.sketchengine.eu網(wǎng)頁時(shí)，發(fā)現(xiàn)用requests無法獲得網(wǎng)頁的全部?jī)?nèi)容，所以我就用selenium先模擬瀏覽器打開網(wǎng)頁，再獲取網(wǎng)頁的源代碼，通過BeautifulSoup解析后拿到網(wǎng)頁中的例句，為了能讓循環(huán)持續(xù)進(jìn)行，我們?cè)谘h(huán)體中加了refresh()，這樣當(dāng)瀏覽器得到新網(wǎng)址時(shí)通過刷新再更新網(wǎng)頁內(nèi)容，注意為了更好地獲取網(wǎng)頁內(nèi)容，設(shè)定刷新后停留2秒，這樣可以降低抓不到網(wǎng)頁內(nèi)容的機(jī)率。為了減少被封的可能，我們還加入了Chrome，請(qǐng)看以下代碼：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import time,re
 
path = Service("D:\\MyDrivers\\chromedriver.exe")#
# 配置不顯示瀏覽器
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('User-Agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36')
 
# 創(chuàng)建Chrome實(shí)例 。
 
driver = webdriver.Chrome(service=path,options=chrome_options)
lst=["happy","help","evening","great","think","adapt"]
 
for word in lst:
    url="https://skell.sketchengine.eu/#result?lang=en&query="+word+"&f=concordance"
    driver.get(url)
    # 刷新網(wǎng)頁獲取新數(shù)據(jù)
    driver.refresh()
    time.sleep(2)
    # page_source——》獲得頁面源碼
    resp=driver.page_source
    # 解析源碼
    soup=BeautifulSoup(resp,"html.parser")
    table = soup.find_all("td")
    with open("eps.txt",'a+',encoding='utf-8') as f:
        f.write(f"\n{word}的例子\n")
    for i in table[0:6]:
        text=i.text
        #替換多余的空格
        new=re.sub("\s+"," ",text)
        #寫入txt文本
        with open("eps.txt",'a+',encoding='utf-8') as f:
            f.write(re.sub(r"^(\d+\.)",r"\n\1",new))
driver.close()

1. 為了加快訪問速度，我們?cè)O(shè)置不顯示瀏覽器，通過chrome.options實(shí)現(xiàn)

2. 最近通過re正則表達(dá)式來清理格式。

3. 我們?cè)O(shè)置table[0:6]來獲取前三個(gè)句子的內(nèi)容，最后顯示結(jié)果如下。

happy的例子
1. This happy mood lasted roughly until last autumn.
2. The lodging was neither convenient nor happy .
3. One big happy family "fighting communism".
help的例子
1. Applying hot moist towels may help relieve discomfort.
2. The intense light helps reproduce colors more effectively.
3. My survival route are self help books.
evening的例子
1. The evening feast costs another $10.
2. My evening hunt was pretty flat overall.
3. The area nightclubs were active during evenings .
great的例子
1. The three countries represented here are three great democracies.
2. Our three different tour guides were great .
3. Your receptionist "crew" is great !
think的例子
1. I said yes immediately without thinking everything through.
2. This book was shocking yet thought provoking.
3. He thought "disgusting" was more appropriate.
adapt的例子
1. The novel has been adapted several times.
2. There are many ways plants can adapt .
3. They must adapt quickly to changing deadlines.

補(bǔ)充：經(jīng)過代碼的優(yōu)化以后，例句的爬取更加快捷，代碼如下：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import time,re
import os
 
# 配置模擬瀏覽器的位置
path = Service("D:\\MyDrivers\\chromedriver.exe")#
# 配置不顯示瀏覽器
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('User-Agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36')
 
# 創(chuàng)建Chrome實(shí)例 。
 
def get_wordlist():
    wordlist=[]
    with open("wordlist.txt",'r',encoding='utf-8') as f:
        lines=f.readlines()
        for line in lines:
            word=line.strip()
            wordlist.append(word)
    return wordlist
 
def main(lst):
    driver = webdriver.Chrome(service=path,options=chrome_options)
    for word in lst:
        url="https://skell.sketchengine.eu/#result?lang=en&query="+word+"&f=concordance"
        driver.get(url) 
        driver.refresh()
        time.sleep(2)
        # page_source——》頁面源碼
        resp=driver.page_source
        # 解析源碼
        soup=BeautifulSoup(resp,"html.parser")
        table = soup.find_all("td")
        with open("examples.txt",'a+',encoding='utf-8') as f:
            f.writelines(f"\n{word}的例子\n")
        for i in table[0:6]:
            text=i.text
            new=re.sub("\s+"," ",text)
            with open("eps.txt",'a+',encoding='utf-8') as f:
                f.write(new)
#                 f.writelines(re.sub("(\.\s)(\d+\.)","\1\n\2",new))
 
if __name__=="__main__":
    lst=get_wordlist()
    main(lst)
    os.startfile("examples.txt")

總結(jié)

到此這篇關(guān)于Python無法用requests獲取網(wǎng)頁源碼的文章就介紹到這了,更多相關(guān)requests獲取網(wǎng)頁源碼內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: