快捷導(dǎo)航

python中文分詞+詞頻統(tǒng)計的實現(xiàn)步驟

更新時間：2022年06月11日 15:05:33 作者：愛吃糖的月妖妖

詞頻統(tǒng)計就是輸入一段句子或者一篇文章,然后統(tǒng)計句子中每個單詞出現(xiàn)的次數(shù),下面這篇文章主要給大家介紹了關(guān)于python中文分詞+詞頻統(tǒng)計的相關(guān)資料,需要的朋友可以參考下

前言

本文記錄了一下Python在文本處理時的一些過程+代碼

一、文本導(dǎo)入

我準備了一個名為abstract.txt的文本文件

接著是在網(wǎng)上下載了stopword.txt(用于結(jié)巴分詞時的停用詞)

有一些是自己覺得沒有用加上去的

另外建立了自己的詞典extraDict.txt

準備工作做好了，就來看看怎么使用吧！

二、使用步驟

1.引入庫

代碼如下：

import jieba
from jieba.analyse import extract_tags
from sklearn.feature_extraction.text import TfidfVectorizer

2.讀入數(shù)據(jù)

代碼如下：

jieba.load_userdict('extraDict.txt')  # 導(dǎo)入自己建立詞典

3.取出停用詞表

def stopwordlist():
    stopwords = [line.strip() for line in open('chinesestopwords.txt', encoding='UTF-8').readlines()]
    # ---停用詞補充,視具體情況而定---
    i = 0
    for i in range(19):
        stopwords.append(str(10 + i))
    # ----------------------
 
    return stopwords

4.分詞并去停用詞（此時可以直接利用python原有的函數(shù)進行詞頻統(tǒng)計）

def seg_word(line):
    # seg=jieba.cut_for_search(line.strip())
    seg = jieba.cut(line.strip())
    temp = ""
    counts = {}
    wordstop = stopwordlist()
    for word in seg:
        if word not in wordstop:
            if word != ' ':
                temp += word
                temp += '\n'
                counts[word] = counts.get(word, 0) + 1#統(tǒng)計每個詞出現(xiàn)的次數(shù)
    return  temp #顯示分詞結(jié)果
    #return str(sorted(counts.items(), key=lambda x: x[1], reverse=True)[:20])  # 統(tǒng)計出現(xiàn)前二十最多的詞及次數(shù)

5. 輸出分詞并去停用詞的有用的詞到txt

def output(inputfilename, outputfilename):
    inputfile = open(inputfilename, encoding='UTF-8', mode='r')
    outputfile = open(outputfilename, encoding='UTF-8', mode='w')
    for line in inputfile.readlines():
        line_seg = seg_word(line)
        outputfile.write(line_seg)
    inputfile.close()
    outputfile.close()
    return outputfile

6.函數(shù)調(diào)用

if __name__ == '__main__':
    print("__name__", __name__)
    inputfilename = 'abstract.txt'
    outputfilename = 'a1.txt'
    output(inputfilename, outputfilename)

7.結(jié)果

附：輸入一段話，統(tǒng)計每個字母出現(xiàn)的次數(shù)

先來講一下思路：

例如給出下面這樣一句話

Love is more than a word
it says so much.
When I see these four letters,
I almost feel your touch.
This is only happened since
I fell in love with you.
Why this word does this,
I haven’t got a clue.

那么想要統(tǒng)計里面每一個單詞出現(xiàn)的次數(shù)，思路很簡單，遍歷一遍這個字符串，再定義一個空字典count_dict，看每一個單詞在這個用于統(tǒng)計的空字典count_dict中的key中存在否，不存在則將這個單詞當做count_dict的鍵加入字典內(nèi)，然后值就為1，若這個單詞在count_dict里面已經(jīng)存在，那就將它對應(yīng)的鍵的值+1就行

下面來看代碼：

#定義字符串
sentences = """           # 字符串很長時用三個引號
Love is more than a word
it says so much.
When I see these four letters,
I almost feel your touch.
This is only happened since
I fell in love with you.
Why this word does this,
I haven't got a clue.
"""
#具體實現(xiàn)
#  將句子里面的逗號去掉,去掉多種符號時請用循環(huán)，這里我就這樣吧
sentences=sentences.replace(',','')   
sentences=sentences.replace('.','')   #  將句子里面的.去掉
sentences = sentences.split()         # 將句子分開為單個的單詞，分開后產(chǎn)生的是一個列表sentences
# print(sentences)
count_dict = {}
for sentence in sentences:
    if sentence not in count_dict:    # 判斷是否不在統(tǒng)計的字典中
        count_dict[sentence] = 1
    else:                              # 判斷是否不在統(tǒng)計的字典中
        count_dict[sentence] += 1
for key,value in count_dict.items():
    print(f"{key}出現(xiàn)了{value}次")

輸出結(jié)果是這樣：

總結(jié)

以上就是今天要講的內(nèi)容，本文僅僅簡單介紹了python的中文分詞及詞頻統(tǒng)計！

到此這篇關(guān)于python中文分詞+詞頻統(tǒng)計的實現(xiàn)步驟的文章就介紹到這了,更多相關(guān)python中文分詞詞頻統(tǒng)計內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

亚洲乱码中文字幕综合,中国熟女仑乱hd,亚洲精品乱拍国产一区二区三区,一本大道卡一卡二卡三乱码全集资源,又粗又黄又硬又爽的免费视频

軟件下載

源碼下載

軟件編程

網(wǎng)絡(luò)編程

在線工具

數(shù)據(jù)庫

CMS

常用工具

python中文分詞+詞頻統(tǒng)計的實現(xiàn)步驟

目錄

前言

一、文本導(dǎo)入

二、使用步驟

1.引入庫

2.讀入數(shù)據(jù)

3.取出停用詞表

4.分詞并去停用詞（此時可以直接利用python原有的函數(shù)進行詞頻統(tǒng)計）

5. 輸出分詞并去停用詞的有用的詞到txt

6.函數(shù)調(diào)用

7.結(jié)果

附：輸入一段話，統(tǒng)計每個字母出現(xiàn)的次數(shù)

總結(jié)

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

python中文分詞+詞頻統(tǒng)計的實現(xiàn)步驟

目錄

前言

一、文本導(dǎo)入

二、使用步驟

1.引入庫

2.讀入數(shù)據(jù)

3.取出停用詞表

4.分詞并去停用詞（此時可以直接利用python原有的函數(shù)進行詞頻統(tǒng)計）

5. 輸出分詞并去停用詞的有用的詞到txt

6.函數(shù)調(diào)用

7.結(jié)果

附：輸入一段話，統(tǒng)計每個字母出現(xiàn)的次數(shù)

總結(jié)

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

一、文本導(dǎo)入

二、使用步驟

附：輸入一段話，統(tǒng)計每個字母出現(xiàn)的次數(shù)