快捷導(dǎo)航

基于Python實(shí)現(xiàn)文章信息統(tǒng)計(jì)的小工具

更新時間：2023年02月14日 14:08:55 作者：KoiC

及時的統(tǒng)計(jì)可以更好的去分析讀者對于內(nèi)容的需求，了解文章內(nèi)容的價(jià)值，以及從側(cè)面認(rèn)識自己在知識創(chuàng)作方面的能力。本文就來用Python制作一個文章信息統(tǒng)計(jì)的小工具?，希望對大家有所幫助

前言

博客園在個人首頁有一個簡單的博客數(shù)據(jù)統(tǒng)計(jì)，以博客園官方的首頁為例：

但是這些數(shù)據(jù)不足以分析更為細(xì)節(jié)的東西

起初我是想把博客園作為個人學(xué)習(xí)的云筆記，但在一點(diǎn)點(diǎn)的記錄中，我逐漸把博客園視為知識創(chuàng)作和知識分享的平臺

所以從年后開始，就想著做一個類似 CSDN 里統(tǒng)計(jì)文章數(shù)據(jù)的工具

這樣的統(tǒng)計(jì)功能可以更好的去分析讀者對于內(nèi)容的需求，了解文章內(nèi)容的價(jià)值，以及從側(cè)面認(rèn)識自己在知識創(chuàng)作方面的能力

說了不少無關(guān)的話，下面直接進(jìn)入正題！

程序

這個程序是我昨天晚上一時興起，看到了一位博主的文章 Python爬蟲實(shí)戰(zhàn)-統(tǒng)計(jì)博客園閱讀量問題，對他的代碼做了一些補(bǔ)充和修改。因?yàn)橄胫鼮橹庇^的展示文章數(shù)據(jù)，所以分了幾個模塊去寫，以方便后續(xù)增加和修改功能

程序目前只有三個 .py 文件，爬取數(shù)據(jù)后解析并寫入到 txt 中（后續(xù)會使用更規(guī)范的方法做持久化處理）

主程序 main.py

from spider import spider
from store import write_data


# 設(shè)置博客名，例如我的博客地址為：https://www.cnblogs.com/KoiC，此處則填入KoiC
blog_name = 'KoiC'



if __name__ == '__main__':
    post_info = spider(blog_name)
    # print(post_info)
    write_data(post_info, blog_name)
    print('執(zhí)行完畢！')

爬蟲模塊 spider.py

import time
import requests
import re
from lxml import etree


def spider(blog_name):
    """
        爬取相關(guān)數(shù)據(jù)
    """
    
    # 設(shè)置UA和目標(biāo)博客url
    headers = {
        "User-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.41"
    }
    url = "https://www.cnblogs.com/" + blog_name + "/default.html?page=%d"
    
    # 測試訪問
    req = requests.get(url, headers)
    print('測試訪問狀態(tài)：%d'%req.status_code)
    
    
    print('開始爬取數(shù)據(jù)...')
    
    post_info = [] # 全部博文信息
    
    #分頁爬取數(shù)據(jù)
    for page_num in range(1, 999):
        
        # 指向目標(biāo)url
        new_url = format(url%page_num)
        
        # 獲取頁面
        req = requests.get(url=new_url, headers=headers)
        # print(req.status_code)
        
        tree = etree.HTML(req.text)
        
        # 獲取目標(biāo)數(shù)據(jù)（各博文名稱和閱讀量）
        count_list = tree.xpath('//div[@class="forFlow"]/div/div[@class="postDesc"]/span[1]/text()')        
        title_list = tree.xpath('//div[@class="postTitle"]/a/span/text()')
        
        # 獲取該頁博文數(shù)量
        post_count = len(count_list)
        # 如果該頁沒有博文，跳出循環(huán)
        if post_count == 0:
            break
        
        # 解析目標(biāo)數(shù)據(jù)
        
        for i in range(post_count):
            # 對數(shù)據(jù)進(jìn)行處理
            post_title = title_list[i].strip() # 處理前后多余的空格、換行等
            post_view_count = re.findall('\d+', count_list[i]) # 正則表達(dá)式獲取閱讀量數(shù)據(jù)
            
            single_post_info = [post_title, post_view_count[0]] # 單篇博文數(shù)據(jù)
            
            post_info.append(single_post_info)
        
        time.sleep(0.8)
        
    return post_info

持久化模塊 store.py

import os
import time


def write_data(post_info, blog_name):
    """
        對數(shù)據(jù)進(jìn)行持久化
    """
    
    print('開始寫入數(shù)據(jù)...')
    
    # 獲取時間
    now_time = time.localtime(time.time())
    select_date = time.strftime('%Y-%m-%d', now_time)
    select_time = time.strftime('%Y-%m-%d %H:%M:%S ', now_time)
    
    # 按日期創(chuàng)建文件路徑
    file_path = './{:s}/{:s}'.format(str(now_time.tm_year), str(now_time.tm_mon))
    
    try: 
        os.makedirs(file_path) # 該方法創(chuàng)建路徑時，若路徑存在會報(bào)異常，使用 try catch 跳過異常
    except OSError:
        pass
    
    # 寫入數(shù)據(jù)  
    try:
        fp = open('{:s}/{:s}.txt'.format(file_path, select_date), 'a+', encoding = 'utf-8')

        fp.write('閱讀量\t\t 博文題目\n')

        view_count = 0 # 總閱讀量
        for single_post_info in post_info:
            view_count += int(single_post_info[1])
            fp.write('{:<12s}{:s}\n'.format(single_post_info[1], single_post_info[0]))
        
        fp.write('------博客名:{:s} 博文數(shù)量:{:d} 總閱讀量:{:d} 統(tǒng)計(jì)時間:{:s}\n\n'.format(blog_name, len(post_info), view_count, select_time))
        
        # 關(guān)閉資源
        fp.close()
    except FileNotFoundError:
        print('無法打開指定的文件')
    except LookupError:
        print('指定編碼錯誤')
    except UnicodeDecodeError:
        print('讀取文件時解碼錯誤')

執(zhí)行結(jié)果

程序會在目錄下按日期創(chuàng)建文件夾

進(jìn)入后可找到以日期命名的 txt 文件，以我自己的博客為例，得到以下統(tǒng)計(jì)信息：

可以將程序掛在服務(wù)器上，定時統(tǒng)計(jì)數(shù)據(jù)，觀察閱讀量的漲幅。

到此這篇關(guān)于基于Python實(shí)現(xiàn)文章信息統(tǒng)計(jì)的小工具的文章就介紹到這了,更多相關(guān)Python文章信息統(tǒng)計(jì)內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: