前陣子阿里云盤大火，送了好多的容量空間。而且阿里云盤下載是不限速，這點(diǎn)比百度網(wǎng)盤好太多了。這兩天看到一個(gè)第三方網(wǎng)站可以搜索阿里云盤上的資源，但是它的資源順序不是按時(shí)間排序的。這種情況會(huì)造成排在前面時(shí)間久遠(yuǎn)的資源是一個(gè)已經(jīng)失效的資源。小編這里用 python 抓取后重新排序。

網(wǎng)頁分析

這個(gè)網(wǎng)站有兩個(gè)搜索路線：搜索線路一和搜索線路二，本文章使用的是搜索線路二。

打開控制面板下的網(wǎng)絡(luò)，一眼就看到一個(gè) seach.html 的 get 請(qǐng)求。

上面帶了好幾個(gè)參數(shù)，四個(gè)關(guān)鍵參數(shù)：

page：頁數(shù)，
keyword：搜索的關(guān)鍵字
category：文件分類，all(全部)，video(視頻)，image(圖片)，doc(文檔)，audio(音頻)，zip(壓縮文件)，others(其他)，腳本中默認(rèn)寫 all
search_model：搜索的線路

也是在控制面板中，看出這個(gè)網(wǎng)頁跳轉(zhuǎn)到阿里云盤獲取真實(shí)的的鏈接是在標(biāo)題上面的。用 bs4 解析頁面上的 div(class=resource-item border-dashed-eee) 標(biāo)簽下的 a 標(biāo)簽就能得到跳轉(zhuǎn)網(wǎng)盤的地址，解析 div 下的 p 標(biāo)簽獲取資源日期。

抓取與解析

首先安裝需要的 bs4 第三方庫用于解析頁面。

pip3?install?bs4

下面是抓取解析網(wǎng)頁的腳本代碼，最后按日期降序排序。

import?requests
from?bs4?import?BeautifulSoup
import?string


word?=?input('請(qǐng)輸入要搜索的資源名稱：')
????
headers?=?{
????'User-Agent':?'Mozilla/5.0?(Windows?NT?10.0;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/96.0.4664.45?Safari/537.36'
}

result_list?=?[]
for?i?in?range(1,?11):
????print('正在搜索第?{}?頁'.format(i))
????params?=?{
????????'page':?i,
????????'keyword':?word,
????????'search_folder_or_file':?0,
????????'is_search_folder_content':?0,
????????'is_search_path_title':?0,
????????'category':?'all',
????????'file_extension':?'all',
????????'search_model':?0
????}
????response_html?=?requests.get('https://www.alipanso.com/search.html',?headers?=?headers,params=params)
????response_data?=?response_html.content.decode()
???
????soup?=?BeautifulSoup(response_data,?"html.parser");
????divs?=?soup.find_all('div',?class_='resource-item?border-dashed-eee')
????
????if?len(divs)?<=?0:
????????break

????for?div?in?divs[1:]:
????????p?=?div.find('p',class_='em')
????????if?p?==?None:
????????????break

????????download_url?=?'https://www.alipanso.com/'?+?div.a['href']
????????date?=?p.text.strip();
????????name?=?div.a.text.strip();
????????result_list.append({'date':date,?'name':name,?'url':download_url})
????
????if?len(result_list)?==?0:
????????break
????
result_list.sort(key=lambda?k:?k.get('date'),reverse=True)

示例結(jié)果：

模板

上面抓取完內(nèi)容后，還需要將內(nèi)容一個(gè)個(gè)復(fù)制到 google 瀏覽器中訪問，有點(diǎn)太麻煩了。要是直接點(diǎn)擊一下能訪問就好了。小編在這里就用 Python 的模板方式寫一個(gè) html 文件。

模板文件小編是用 elements-ui 做的，下面是關(guān)鍵的代碼：

<body>
????<div?id="app">
????????<el-table?:data="table"?style="width:?100%"?:row-class-name="tableRowClassName">
????????????<el-table-column?prop="date"?label="日期"?width="180">?</el-table-column>
????????????<el-table-column?prop="name"?label="名稱"?width="600">?</el-table-column>
????????????<el-table-column?label="鏈接">
??????????????<template?slot-scope="scope">
??????????????<a?:href="'http://'+scope.row.url" rel="external nofollow" 
????????????????target="_blank"
????????????????class="buttonText">{{scope.row.url}}</a>
????????????</template>
????????</el-table>
????</div>

????<script>
??????const?App?=?{
????????data()?{
??????????return?{
??????????????table:?${elements}
????????????
??????????};
????????}
??????};
??????const?app?=?Vue.createApp(App);
??????app.use(ElementPlus);
??????app.mount("#app");
????</script>
??</body>

在 python 中讀取這個(gè)模板文件，并將 ${elements} 關(guān)鍵詞替換為上面的解析結(jié)果。最后生成一個(gè) report.html 文件。

with?open("aliso.html",?encoding='utf-8')?as?t:
????template?=?string.Template(t.read())

final_output?=?template.substitute(elements=result_list)
with?open("report.html",?"w",?encoding='utf-8')?as?output:
????output.write(final_output)

示例結(jié)果：

跳轉(zhuǎn)到阿里云盤界面

完整代碼

aliso.html

<html>
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width,initial-scale=1.0" />
    <script src="https://unpkg.com/vue@next"></script>
    <!-- import CSS -->
    <link rel="stylesheet" >
    <!-- import JavaScript -->
    <script src="https://unpkg.com/element-plus"></script>
    <title>阿里云盤資源</title>
  </head>
  <body>
    <div id="app">

        <el-table :data="table" style="width: 100%" :row-class-name="tableRowClassName">
            <el-table-column prop="date" label="日期" width="180"> </el-table-column>
            <el-table-column prop="name" label="名稱" width="600"> </el-table-column>
            <el-table-column label="鏈接">
              <template v-slot="scope">
              <a :href="scope.row.url"
                target="_blank"
                class="buttonText">{{scope.row.url}}</a>
            </template>
        </el-table>
    </div>

    <script>
      const App = {
        data() {
          return {
              table: ${elements}
            
          };
        }
      };
      const app = Vue.createApp(App);
      app.use(ElementPlus);
      app.mount("#app");
    </script>
  </body>
</html>

aliso.py

# -*- coding: UTF-8 -*-

import requests
from bs4 import BeautifulSoup
import string


word = input('請(qǐng)輸入要搜索的資源名稱：')
    
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
}

result_list = []
for i in range(1, 11):
    print('正在搜索第 {} 頁'.format(i))
    params = {
        'page': i,
        'keyword': word,
        'search_folder_or_file': 0,
        'is_search_folder_content': 0,
        'is_search_path_title': 0,
        'category': 'all',
        'file_extension': 'all',
        'search_model': 2
    }
    response_html = requests.get('https://www.alipanso.com/search.html', headers = headers,params=params)
    response_data = response_html.content.decode()
   
    soup = BeautifulSoup(response_data, "html.parser");
    divs = soup.find_all('div', class_='resource-item border-dashed-eee')
    
    if len(divs) <= 0:
        break

    for div in divs[1:]:
        p = div.find('p',class_='em')
        if p == None:
            break

        download_url = 'https://www.alipanso.com/' + div.a['href']
        date = p.text.strip();
        name = div.a.text.strip();
        result_list.append({'date':date, 'name':name, 'url':download_url})
    
    if len(result_list) == 0:
        break
    
result_list.sort(key=lambda k: k.get('date'),reverse=True)
print(result_list)

with open("aliso.html", encoding='utf-8') as t:
    template = string.Template(t.read())

final_output = template.substitute(elements=result_list)
with open("report.html", "w", encoding='utf-8') as output:
    output.write(final_output)