腳本之家服務(wù)器常用軟件

快捷導(dǎo)航

軟件下載

android MAC 驅(qū)動(dòng)下載字體下載 DLL

源碼下載

PHP ASP.NET ASP JSP

軟件編程

C# JAVA C 語(yǔ)言 Delphi Android

網(wǎng)絡(luò)編程

PHP ASP.NET ASP JavaScript

在線工具

CSS格式化 JS格式化 Html轉(zhuǎn)化為Js

數(shù)據(jù)庫(kù)

MYSQL MSSQL oracle DB2 MARIADB

CMS

PHPCMS DEDECMS 帝國(guó)CMS WordPress

常用工具

PHP開(kāi)發(fā)工具 python Photoshop 必備軟件

基于Python獲取亞馬遜的評(píng)論信息的處理

更新時(shí)間：2022年02月19日 09:21:37 作者：CorGi_8456

這篇文章主要介紹了基于Python獲取亞馬遜的評(píng)論信息的處理方法，用戶的評(píng)論能直觀的反映當(dāng)前商品值不值得購(gòu)買，亞馬遜的評(píng)分信息也能獲取到做一個(gè)評(píng)分的權(quán)重,感興趣的朋友跟隨小編一起看看吧

一、分析亞馬遜的評(píng)論請(qǐng)求

首先打開(kāi)開(kāi)發(fā)者模式的Network，Clear清屏做一次請(qǐng)求：

你會(huì)發(fā)現(xiàn)在Doc中的get請(qǐng)求正好就有我們想要的評(píng)論信息。

可是真正的評(píng)論數(shù)據(jù)可不是全部都在這里的，頁(yè)面往下翻，有個(gè)翻頁(yè)的button：

點(diǎn)擊翻頁(yè)請(qǐng)求下一頁(yè)，在Fetch/XHR選項(xiàng)卡中多了一個(gè)新的請(qǐng)求，剛才的Doc選項(xiàng)卡中并無(wú)新的get請(qǐng)求。這下發(fā)現(xiàn)了所有的評(píng)論信息是XHR類型的請(qǐng)求。

獲取到post請(qǐng)求的鏈接和payload數(shù)據(jù)，里面含有控制翻頁(yè)的參數(shù)，真正的評(píng)論請(qǐng)求已經(jīng)找到了。

這一堆就是未處理的信息，這些請(qǐng)求未處理的信息里面，帶有data-hook=\"review\"的就是帶有評(píng)論的信息。分析完畢，下面開(kāi)始一步一步去寫(xiě)請(qǐng)求。

二、獲取亞馬遜評(píng)論的內(nèi)容

首先拼湊請(qǐng)求所需的post參數(shù)，請(qǐng)求鏈接，以便之后的自動(dòng)翻頁(yè)，然后帶參數(shù)post請(qǐng)求鏈接：

headers = {
    'authority': 'www.amazon.it',
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36",
}
 
page = 1
post_data = {
    "sortBy": "recent",
    "reviewerType": "all_reviews",
    "formatType": "",
    "mediaType": "",
    "filterByStar": "",
    "filterByLanguage": "",
    "filterByKeyword": "",
    "shouldAppend": "undefined",
    "deviceType": "desktop",
    "canShowIntHeader": "undefined",
    "pageSize": "10",
    "asin": "B08GHGTGQ2",
}
# 翻頁(yè)關(guān)鍵payload參數(shù)賦值
post_data["pageNumber"] = page,
post_data["reftag"] = f"cm_cr_getr_d_paging_btm_next_{page}",
post_data["scope"] = f"reviewsAjax{page}",
# 翻頁(yè)鏈接賦值
spiderurl=f'https://www.amazon.it/hz/reviewsrender/ajax/reviews/get/ref=cm_cr_getr_d_paging_btm_next_{page}'
res = requests.post(spiderurl,headers=headers,data=post_data)
if res and res.status_code == 200:
    res = res.content.decode('utf-8')
    print(res)

現(xiàn)在已經(jīng)獲取到了這一堆未處理的信息，接下來(lái)開(kāi)始對(duì)這些數(shù)據(jù)進(jìn)行處理。

三、亞馬遜評(píng)論信息的處理

上圖的信息會(huì)發(fā)現(xiàn)，每一段的信息都由“&&&”進(jìn)行分隔，而分隔之后的每一條信息都是由'","'分隔開(kāi)的：

所以用python的split方法進(jìn)行處理，把字符串分隔成list列表：

# 返回值字符串處理
contents = res.split('&&&')
for content in contents:
    infos = content.split('","')

由'","'分隔的數(shù)據(jù)通過(guò)split處理生成新的list列表，評(píng)論內(nèi)容是列表的最后一個(gè)元素，去掉里面的"\","\n"和多余的符號(hào)，就可以通過(guò)css/xpath選擇其進(jìn)行處理了：

for content in contents:
    infos = content.split('","')
    info = infos[-1].replace('"]','').replace('\\n','').replace('\\','')
    # 評(píng)論內(nèi)容判斷
    if 'data-hook="review"' in info:
        sel = Selector(text=info)
        data = {}
        data['username'] = sel.xpath('//span[@class="a-profile-name"]/text()').extract_first() #用戶名
        data['point'] = sel.xpath('//span[@class="a-icon-alt"]/text()').extract_first() #評(píng)分
        data['date'] = sel.xpath('//span[@data-hook="review-date"]/text()').extract_first() #日期地址
        data['review'] = sel.xpath('//span[@data-hook="review-title"]/span/text()').extract_first() #評(píng)價(jià)標(biāo)題
        data['detail'] = sel.xpath('//span[@data-hook="review-body"]').extract_first() #評(píng)價(jià)內(nèi)容
        image = sel.xpath('div[@class="review-image-tile-section"]').extract_first()
        data['image'] = image if image else "not image" #圖片
        print(data)

四、代碼整合

4.1 代理設(shè)置

穩(wěn)定的IP代理是你數(shù)據(jù)獲取最有力的工具。目前國(guó)內(nèi)還是無(wú)法穩(wěn)定的訪問(wèn)亞馬遜，會(huì)出現(xiàn)連接失敗的情況。我這里使用的ipidea代理請(qǐng)求的意大利地區(qū)的亞馬遜，可以通過(guò)賬密和api獲取代理，速度還是非常穩(wěn)定的。

地址：http://www.ipidea.net/?utm-source=csdn&utm-keyword=?wb

下面的代理獲取的方法：

# api獲取ip
    def getApiIp(self):
        # 獲取且僅獲取一個(gè)ip------意大利
        api_url = '獲取代理地址'
        res = requests.get(api_url, timeout=5)
        try:
            if res.status_code == 200:
                api_data = res.json()['data'][0]
                proxies = {
                    'http': 'http://{}:{}'.format(api_data['ip'], api_data['port']),
                    'https': 'http://{}:{}'.format(api_data['ip'], api_data['port']),
                }
                print(proxies)
                return proxies
            else:
                print('獲取失敗')
        except:
            print('獲取失敗')

4.2 while循環(huán)翻頁(yè)

while循環(huán)進(jìn)行翻頁(yè)，評(píng)論最大頁(yè)數(shù)是99頁(yè)，99頁(yè)之后就break跳出while循環(huán)：

 def getPLPage(self):
        while True:
            # 翻頁(yè)關(guān)鍵payload參數(shù)賦值
            self.post_data["pageNumber"]= self.page,
            self.post_data["reftag"] = f"cm_cr_getr_d_paging_btm_next_{self.page}",
            self.post_data["scope"] = f"reviewsAjax{self.page}",
            # 翻頁(yè)鏈接賦值
            spiderurl = f'https://www.amazon.it/hz/reviews-render/ajax/reviews/get/ref=cm_cr_getr_d_paging_btm_next_{self.page}'
            res = self.getRes(spiderurl,self.headers,'',self.post_data,'POST',check)#自己封裝的請(qǐng)求方法
            if res:
                res = res.content.decode('utf-8')
                # 返回值字符串處理
                contents = res.split('&&&')
                for content in contents:
                    infos = content.split('","')
                    info = infos[-1].replace('"]','').replace('\\n','').replace('\\','')
                    # 評(píng)論內(nèi)容判斷
                    if 'data-hook="review"' in info:
                        sel = Selector(text=info)
                        data = {}
                        data['username'] = sel.xpath('//span[@class="a-profile-name"]/text()').extract_first() #用戶名
                        data['point'] = sel.xpath('//span[@class="a-icon-alt"]/text()').extract_first() #評(píng)分
                        data['date'] = sel.xpath('//span[@data-hook="review-date"]/text()').extract_first() #日期地址
                        data['review'] = sel.xpath('//span[@data-hook="review-title"]/span/text()').extract_first() #評(píng)價(jià)標(biāo)題
                        data['detail'] = sel.xpath('//span[@data-hook="review-body"]').extract_first() #評(píng)價(jià)內(nèi)容
                        image = sel.xpath('div[@class="review-image-tile-section"]').extract_first()
                        data['image'] = image if image else "not image" #圖片
                        print(data)
            if self.page <= 99:
                print('Next Page')
                self.page += 1
            else:
                break

最后的整合代碼：

# coding=utf-8
import requests
from scrapy import Selector
 
class getReview():
    page = 1
    headers = {
        'authority': 'www.amazon.it',
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36",
    }
    post_data = {
        "sortBy": "recent",
        "reviewerType": "all_reviews",
        "formatType": "",
        "mediaType": "",
        "filterByStar": "",
        "filterByLanguage": "",
        "filterByKeyword": "",
        "shouldAppend": "undefined",
        "deviceType": "desktop",
        "canShowIntHeader": "undefined",
        "pageSize": "10",
        "asin": "B08GHGTGQ2",
    #post_data中asin參數(shù)目前寫(xiě)死在
    #"https://www.amazon.it/product-reviews/B08GHGTGQ2?ie=UTF8&pageNumber=1&reviewerType=all_reviews&pageSize=10&sortBy=recent"
    #這個(gè)鏈接里，不排除asin值變化的可能，如要獲取get請(qǐng)求即可
    def getPLPage(self):
        while True:
            # 翻頁(yè)關(guān)鍵payload參數(shù)賦值
            self.post_data["pageNumber"]= self.page,
            self.post_data["reftag"] = f"cm_cr_getr_d_paging_btm_next_{self.page}",
            self.post_data["scope"] = f"reviewsAjax{self.page}",
            # 翻頁(yè)鏈接賦值
            spiderurl = f'https://www.amazon.it/hz/reviews-render/ajax/reviews/get/ref=cm_cr_getr_d_paging_btm_next_{self.page}'
            res = self.getRes(spiderurl,self.headers,'',self.post_data,'POST',check)#自己封裝的請(qǐng)求方法
            if res:
                res = res.content.decode('utf-8')
                # 返回值字符串處理
                contents = res.split('&&&')
                for content in contents:
                    infos = content.split('","')
                    info = infos[-1].replace('"]','').replace('\\n','').replace('\\','')
                    # 評(píng)論內(nèi)容判斷
                    if 'data-hook="review"' in info:
                        sel = Selector(text=info)
                        data = {}
                        data['username'] = sel.xpath('//span[@class="a-profile-name"]/text()').extract_first() #用戶名
                        data['point'] = sel.xpath('//span[@class="a-icon-alt"]/text()').extract_first() #評(píng)分
                        data['date'] = sel.xpath('//span[@data-hook="review-date"]/text()').extract_first() #日期地址
                        data['review'] = sel.xpath('//span[@data-hook="review-title"]/span/text()').extract_first() #評(píng)價(jià)標(biāo)題
                        data['detail'] = sel.xpath('//span[@data-hook="review-body"]').extract_first() #評(píng)價(jià)內(nèi)容
                        image = sel.xpath('div[@class="review-image-tile-section"]').extract_first()
                        data['image'] = image if image else "not image" #圖片
                        print(data)
            if self.page <= 99:
                print('Next Page')
                self.page += 1
            else:
                break
    # api獲取ip
    def getApiIp(self):
        # 獲取且僅獲取一個(gè)ip------意大利
        api_url = '獲取代理地址'
        res = requests.get(api_url, timeout=5)
        try:
            if res.status_code == 200:
                api_data = res.json()['data'][0]
                proxies = {
                    'http': 'http://{}:{}'.format(api_data['ip'], api_data['port']),
                    'https': 'http://{}:{}'.format(api_data['ip'], api_data['port']),
                }
                print(proxies)
                return proxies
                print('獲取失敗')
        except:
            print('獲取失敗')
    #專門(mén)發(fā)送請(qǐng)求的方法,代理請(qǐng)求三次，三次失敗返回錯(cuò)誤
    def getRes(self,url,headers,proxies,post_data,method):
        if proxies:
            for i in range(3):
                try:
                    # 傳代理的post請(qǐng)求
                    if method == 'POST':
                        res = requests.post(url,headers=headers,data=post_data,proxies=proxies)
                    # 傳代理的get請(qǐng)求
                    else:
                        res = requests.get(url, headers=headers,proxies=proxies)
                    if res:
                        return res
                except:
                    print(f'第{i+1}次請(qǐng)求出錯(cuò)')
                else:
                    return None
        else:
                proxies = self.getApiIp()
                    # 請(qǐng)求代理的post請(qǐng)求
                        res = requests.post(url, headers=headers, data=post_data, proxies=proxies)
                    # 請(qǐng)求代理的get請(qǐng)求
                        res = requests.get(url, headers=headers, proxies=proxies)
                    print(f"第{i+1}次請(qǐng)求出錯(cuò)")
if __name__ == '__main__':
    getReview().getPLPage()

總結(jié)

本次的亞馬遜評(píng)論獲取就是兩個(gè)坑，一是評(píng)論信息通過(guò)的XHR請(qǐng)求方式，二是評(píng)論信息的處理。分析之后這次的數(shù)據(jù)獲取還是非常簡(jiǎn)單的，找到正確的請(qǐng)求方式，穩(wěn)定的IP代理讓你事半功倍，找到信息的共同點(diǎn)進(jìn)行處理，問(wèn)題就迎刃而解了。

到此這篇關(guān)于基于Python獲取亞馬遜的評(píng)論的文章就介紹到這了,更多相關(guān)Python亞馬遜的評(píng)論內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: