快捷導(dǎo)航

基于python實(shí)現(xiàn)分析識(shí)別文章/內(nèi)容中的高頻詞和關(guān)鍵詞

更新時(shí)間：2023年09月02日 09:48:27 作者：青Cheng序員石頭

要分析一篇文章的高頻詞和關(guān)鍵詞,可以使用 Python 中的 nltk 庫(kù)和 collections 庫(kù)或者jieba庫(kù)來(lái)實(shí)現(xiàn),本篇文章介紹基于兩種庫(kù)分別實(shí)現(xiàn)分析內(nèi)容中的高頻詞和關(guān)鍵詞,需要的朋友可以參考下

nltk 和 collections 庫(kù)

首先，需要安裝 nltk 庫(kù)和 collections 庫(kù)?？梢允褂靡韵旅顏?lái)安裝：

pip install nltk
pip install collections

接下來(lái)，需要下載 nltk 庫(kù)中的 stopwords 和 punkt 數(shù)據(jù)?？梢允褂靡韵麓a來(lái)下載：

import nltk
nltk.download('stopwords')
nltk.download('punkt')

下載完成后，可以使用以下代碼來(lái)讀取文章并進(jìn)行分析：

import collections
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# 讀取文章
with open('article.txt', 'r',encoding='utf-8') as f:
    article = f.read()
# 分詞
tokens = word_tokenize(article)
# 去除停用詞
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
# 統(tǒng)計(jì)詞頻
word_freq = collections.Counter(filtered_tokens)
# 輸出高頻詞
print('Top 10 frequent words:')
for word, freq in word_freq.most_common(10):
    print(f'{word}: {freq}')
# 提取關(guān)鍵詞
keywords = nltk.FreqDist(filtered_tokens).keys()
# 輸出關(guān)鍵詞
print('Keywords:')
for keyword in keywords:
    print(keyword)

上述代碼中，首先使用 open() 函數(shù)讀取文章，然后使用 word_tokenize() 函數(shù)將文章分詞。接著，使用 stopwords 數(shù)據(jù)集去除停用詞，使用 collections.Counter() 函數(shù)統(tǒng)計(jì)詞頻，并輸出高頻詞。最后，使用 nltk.FreqDist() 函數(shù)提取關(guān)鍵詞，并輸出關(guān)鍵詞。

需要注意的是，上述代碼中的 article.txt 文件需要替換為實(shí)際的文章文件路徑。

結(jié)巴（jieba）庫(kù)實(shí)現(xiàn)

# 導(dǎo)入必要的庫(kù)
import jieba
import jieba.analyse
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# 讀取文章
with open('./data/2.txt', 'r', encoding='utf-8') as f:
    article = f.read()
# 分詞
words = jieba.cut(article)
# 統(tǒng)計(jì)詞頻
word_counts = Counter(words)
# 輸出高頻詞
print('高頻詞：')
for word, count in word_counts.most_common(10):
    print(word, count)
# 輸出關(guān)鍵詞
print('關(guān)鍵詞：')
keywords = jieba.analyse.extract_tags(article, topK=10, withWeight=True, allowPOS=('n', 'nr', 'ns'))
for keyword, weight in keywords:
    print(keyword, weight)
# 生成詞云
wordcloud = WordCloud(font_path='msyh.ttc', background_color='white', width=800, height=600).generate(article)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

導(dǎo)入jieba庫(kù)：首先需要導(dǎo)入jieba庫(kù)，才能使用其中的分詞功能。
讀取文章：需要讀取要分析的文章，可以使用Python內(nèi)置的open函數(shù)打開文件，然后使用read方法讀取文件內(nèi)容。
分詞：使用jieba庫(kù)的cut方法對(duì)文章進(jìn)行分詞，得到一個(gè)生成器對(duì)象，可以使用for循環(huán)遍歷生成器對(duì)象，得到每個(gè)詞。
統(tǒng)計(jì)詞頻：使用Python內(nèi)置的collections庫(kù)中的Counter類，對(duì)分詞后的詞進(jìn)行統(tǒng)計(jì)，得到每個(gè)詞出現(xiàn)的次數(shù)。
輸出高頻詞：根據(jù)詞頻統(tǒng)計(jì)結(jié)果，輸出出現(xiàn)頻率最高的詞，即為高頻詞。
輸出關(guān)鍵詞：使用jieba庫(kù)的analyse模塊中的extract_tags方法，根據(jù)TF-IDF算法計(jì)算每個(gè)詞的權(quán)重，輸出權(quán)重最高的詞，即為關(guān)鍵詞。
生成詞云：使用wordcloud庫(kù)生成詞云，將文章中的詞按照詞頻生成詞云，詞頻越高的詞在詞云中出現(xiàn)的越大。

到此這篇關(guān)于基于python實(shí)現(xiàn)分析識(shí)別文章/內(nèi)容中的高頻詞和關(guān)鍵詞的文章就介紹到這了,更多相關(guān)python分析識(shí)別高頻詞和關(guān)鍵詞內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: