python提取內(nèi)容關(guān)鍵詞的方法
本文實(shí)例講述了python提取內(nèi)容關(guān)鍵詞的方法。分享給大家供大家參考。具體分析如下:
一個(gè)非常高效的提取內(nèi)容關(guān)鍵詞的python代碼,這段代碼只能用于英文文章內(nèi)容,中文因?yàn)橐衷~,這段代碼就無(wú)能為力了,不過(guò)要加上分詞功能,效果和英文是一樣的。
# coding=UTF-8
import nltk
from nltk.corpus import brown
# This is a fast and simple noun phrase extractor (based on NLTK)
# Feel free to use it, just keep a link back to this post
# http://thetokenizer.com/2013/05/09/efficient-way-to-extract-the-main-topics-of-a-sentence/
# Create by Shlomi Babluki
# May, 2013
# This is our fast Part of Speech tagger
#############################################################################
brown_train = brown.tagged_sents(categories='news')
regexp_tagger = nltk.RegexpTagger(
[(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
(r'(-|:|;)$', ':'),
(r'\'*$', 'MD'),
(r'(The|the|A|a|An|an)$', 'AT'),
(r'.*able$', 'JJ'),
(r'^[A-Z].*$', 'NNP'),
(r'.*ness$', 'NN'),
(r'.*ly$', 'RB'),
(r'.*s$', 'NNS'),
(r'.*ing$', 'VBG'),
(r'.*ed$', 'VBD'),
(r'.*', 'NN')
])
unigram_tagger = nltk.UnigramTagger(brown_train, backoff=regexp_tagger)
bigram_tagger = nltk.BigramTagger(brown_train, backoff=unigram_tagger)
#############################################################################
# This is our semi-CFG; Extend it according to your own needs
#############################################################################
cfg = {}
cfg["NNP+NNP"] = "NNP"
cfg["NN+NN"] = "NNI"
cfg["NNI+NN"] = "NNI"
cfg["JJ+JJ"] = "JJ"
cfg["JJ+NN"] = "NNI"
#############################################################################
class NPExtractor(object):
def __init__(self, sentence):
self.sentence = sentence
# Split the sentence into singlw words/tokens
def tokenize_sentence(self, sentence):
tokens = nltk.word_tokenize(sentence)
return tokens
# Normalize brown corpus' tags ("NN", "NN-PL", "NNS" > "NN")
def normalize_tags(self, tagged):
n_tagged = []
for t in tagged:
if t[1] == "NP-TL" or t[1] == "NP":
n_tagged.append((t[0], "NNP"))
continue
if t[1].endswith("-TL"):
n_tagged.append((t[0], t[1][:-3]))
continue
if t[1].endswith("S"):
n_tagged.append((t[0], t[1][:-1]))
continue
n_tagged.append((t[0], t[1]))
return n_tagged
# Extract the main topics from the sentence
def extract(self):
tokens = self.tokenize_sentence(self.sentence)
tags = self.normalize_tags(bigram_tagger.tag(tokens))
merge = True
while merge:
merge = False
for x in range(0, len(tags) - 1):
t1 = tags[x]
t2 = tags[x + 1]
key = "%s+%s" % (t1[1], t2[1])
value = cfg.get(key, '')
if value:
merge = True
tags.pop(x)
tags.pop(x)
match = "%s %s" % (t1[0], t2[0])
pos = value
tags.insert(x, (match, pos))
break
matches = []
for t in tags:
if t[1] == "NNP" or t[1] == "NNI":
#if t[1] == "NNP" or t[1] == "NNI" or t[1] == "NN":
matches.append(t[0])
return matches
# Main method, just run "python np_extractor.py"
def main():
sentence = "Swayy is a beautiful new dashboard for discovering and curating online content."
np_extractor = NPExtractor(sentence)
result = np_extractor.extract()
print "This sentence is about: %s" % ", ".join(result)
if __name__ == '__main__':
main()
希望本文所述對(duì)大家的Python程序設(shè)計(jì)有所幫助。
- python實(shí)現(xiàn)textrank關(guān)鍵詞提取
- Python超簡(jiǎn)單分析評(píng)論提取關(guān)鍵詞制作精美詞云流程
- Python實(shí)現(xiàn)提取Excel指定關(guān)鍵詞的行數(shù)據(jù)
- python 利用百度API進(jìn)行淘寶評(píng)論關(guān)鍵詞提取
- python TF-IDF算法實(shí)現(xiàn)文本關(guān)鍵詞提取
- python多進(jìn)程提取處理大量文本的關(guān)鍵詞方法
- python實(shí)現(xiàn)關(guān)鍵詞提取的示例講解
- Python使用TextRank算法提取關(guān)鍵詞
相關(guān)文章
selenium?UI自動(dòng)化實(shí)戰(zhàn)過(guò)程記錄
如果大家有做過(guò)web的自動(dòng)化測(cè)試,相信對(duì)于selenium一定不陌生,測(cè)試人員經(jīng)常使用它來(lái)進(jìn)行自動(dòng)化測(cè)試,下面這篇文章主要給大家介紹了關(guān)于selenium?UI自動(dòng)化實(shí)戰(zhàn)的相關(guān)資料,需要的朋友可以參考下2021-12-12Python中l(wèi)ist列表的一些進(jìn)階使用方法介紹
這篇文章主要介紹了Python中l(wèi)ist列表的一些進(jìn)階使用方法介紹,是Python入門(mén)學(xué)習(xí)中的基礎(chǔ)知識(shí),需要的朋友可以參考下2015-08-08詳解pyqt5的UI中嵌入matplotlib圖形并實(shí)時(shí)刷新(挖坑和填坑)
這篇文章主要介紹了詳解pyqt5的UI中嵌入matplotlib圖形并實(shí)時(shí)刷新(挖坑和填坑),文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧2020-08-08python 獲取指定文件夾下所有文件名稱并寫(xiě)入列表的實(shí)例
下面小編就為大家分享一篇python 獲取指定文件夾下所有文件名稱并寫(xiě)入列表的實(shí)例,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧2018-04-04python實(shí)現(xiàn)讀取excel表格詳解方法
python操作excel主要用到xlrd和xlwt兩個(gè)庫(kù),xlrd讀取表格數(shù)據(jù),支持xlsx和xls格式的excel表格;xlwt寫(xiě)入excel表格數(shù)據(jù)2022-07-07用Python實(shí)現(xiàn)隨機(jī)森林算法的示例
這篇文章主要介紹了用Python實(shí)現(xiàn)隨機(jī)森林算法,小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,也給大家做個(gè)參考。一起跟隨小編過(guò)來(lái)看看吧2017-08-08python GUI庫(kù)圖形界面開(kāi)發(fā)之PyQt5布局控件QVBoxLayout詳細(xì)使用方法與實(shí)例
這篇文章主要介紹了python GUI庫(kù)圖形界面開(kāi)發(fā)之PyQt5布局控件QVBoxLayout詳細(xì)使用方法與實(shí)例,需要的朋友可以參考下2020-03-03