Python結合spaCy?進行簡易自然語言處理

更新時間：2022年06月15日 08:58:54 作者：lsvih

這篇文章主要為大家介紹了Python結合spaCy進行簡易自然語言處理詳解，有需要的朋友可以借鑒參考下，希望能夠有所幫助，祝大家多多進步，早日升職加薪

簡介

自然語言處理（NLP）是人工智能領域最重要的部分之一。它在許多智能應用中擔任了關鍵的角色，例如聊天機器人、正文提取、多語翻譯以及觀點識別等應用。業(yè)界 NLP 相關的公司都意識到了，處理非結構文本數據時，不僅要看正確率，還需要注意是否能快速得到想要的結果。

NLP 是一個很寬泛的領域，它包括了文本分類、實體識別、機器翻譯、問答系統(tǒng)、概念識別等子領域。在我最近的一篇文章中，我探討了許多用于實現 NLP 的工具與組件。在那篇文章中，我更多的是在描述NLTK（Natural Language Toolkit）這個偉大的庫。

在這篇文章中，我會將 spaCy —— 這個現在最強大、最先進的 NLP python 庫分享給你們。

1. spaCy 簡介及安裝方法

1.1 簡介

spaCy 由 cython（Python 的 C 語言拓展，旨在讓 python 程序達到如同 C 程序一樣的性能）編寫，因此它的運行效率非常高。spaCy 提供了一系列簡潔的 API 方便用戶使用，并基于已經訓練好的機器學習與深度學習模型實現底層。

1.2 安裝

spaCy 及其數據和模型可以通過 pip 和安裝工具輕松地完成安裝。使用下面的命令在電腦中安裝 spaCy：

sudo pip install spacy

如果你使用的是 Python3，請用 “pip3” 代替 “pip”。

或者你也可以在這兒下載源碼，解壓后運行下面的命令安裝：

python setup.py install

在安裝好 spacy 之后，請運行下面的命令以下載所有的數據集和模型：

python -m spacy.en.download all

2. spaCy 的管道（Pipeline）與屬性（Properties）

spaCy 的使用，以及其各種屬性，是通過創(chuàng)建管道實現的。在加載模型的時候，spaCy 會將管道創(chuàng)建好。在 spaCy 包中，提供了各種各樣的模塊，這些模塊中包含了各種關于詞匯、訓練向量、語法和實體等用于語言處理的信息。

下面，我們會加載默認的模塊（english-core-web 模塊）。

import spacy
nlp = spacy.load(“en”)

“nlp” 對象用于創(chuàng)建 document、獲得 linguistic annotation 及其它的 nlp 屬性。首先我們要創(chuàng)建一個 document，將文本數據加載進管道中。我使用了來自貓途鷹網的旅店評論數據。這個數據文件可以在這兒下載。

document = unicode(open(filename).read().decode('utf8'))
document = nlp(document)

這個 document 現在是 spacy.english 模型的一個 class，并關聯上了許多的屬性?？梢允褂孟旅娴拿盍谐鏊?document（或 token）的屬性：

dir(document)
>> [ 'doc', 'ents', … 'mem']

它會輸出 document 中各種各樣的屬性，例如：token、token 的 index、詞性標注、實體、向量、情感、單詞等。下面讓我們會對其中的一些屬性進行一番探究。

2.1 Tokenization

spaCy 的 document 可以在 tokenized 過程中被分割成單句，這些單句還可以進一步分割成單詞。你可以通過遍歷文檔來讀取這些單詞：

# document 的首個單詞
document[0]
>> Nice
# document 的最后一個單詞  
document[len(document)-5]
>> boston
# 列出 document 中的句子
list(document.sents)
>> [ Nice place Better than some reviews give it credit for.,
 Overall, the rooms were a bit small but nice.,
...
Everything was clean, the view was wonderful and it is very well located (the Prudential Center makes shopping and eating easy and the T is nearby for jaunts out and about the city).]

2.2 詞性標注(POS Tag)

詞性標注即標注語法正確的句子中的詞語的詞性。這些標注可以用于信息過濾、統(tǒng)計模型，或者基于某些規(guī)則進行文本解析。

來看看我們的 document 中所有的詞性標注：

# 獲得所有標注
all_tags = {w.pos: w.pos_ for w in document}
>> {97:  u'SYM', 98: u'VERB', 99: u'X', 101: u'SPACE', 82: u'ADJ', 83: u'ADP', 84: u'ADV', 87: u'CCONJ', 88: u'DET', 89: u'INTJ', 90: u'NOUN', 91: u'NUM', 92: u'PART', 93: u'PRON', 94: u'PROPN', 95: u'PUNCT'}
# document 中第一個句子的詞性標注
for word in list(document.sents)[0]:  
    print word, word.tag_
>> ( Nice, u'JJ') (place, u'NN') (Better, u'NNP') (than, u'IN') (some, u'DT') (reviews, u'NNS') (give, u'VBP') (it, u'PRP') (creit, u'NN') (for, u'IN') (., u'.')

來看一看 document 中的最常用詞匯。我已經事先寫好了預處理和文本數據清洗的函數。

#一些參數定義
noisy_pos_tags = [“PROP”]
min_token_length = 2
#檢查 token 是不是噪音的函數
def isNoise(token):     
    is_noise = False
    if token.pos_ in noisy_pos_tags:
        is_noise = True
    elif token.is_stop == True:
        is_noise = True
    elif len(token.string) <= min_token_length:
        is_noise = True
    return is_noise
def cleanup(token, lower = True):
    if lower:
       token = token.lower()
    return token.strip()
# 評論中最常用的單詞
from collections import Counter
cleaned_list = [cleanup(word.string) for word in document if not isNoise(word)]
Counter(cleaned_list) .most_common(5)
>> [( u'hotel', 683), (u'room', 652), (u'great', 300),  (u'sheraton', 285), (u'location', 271)]

2.3 實體識別

spaCy 擁有一個快速實體識別模型，這個實體識別模型能夠從 document 中找出實體短語。它能識別各種類型的實體，例如人名、位置、機構、日期、數字等。你可以通過“.ents”屬性來讀取這些實體。

下面讓我們來獲取我們 document 中所有類型的命名實體：

labels = set([w.label_ for w in document.ents])
for label in labels:
    entities = [cleanup(e.string, lower=False) for e in document.ents if label==e.label_]
    entities = list(set(entities))
    print label,entities

2.4 依存句法分析

spaCy 最強大的功能之一就是它可以通過調用輕量級的 API 來實現又快又準確的依存分析。這個分析器也可以用于句子邊界檢測以及區(qū)分短語塊。依存關系可以通過“.children”、“.root”、“.ancestor”等屬性讀取。

# 取出所有句中包含“hotel”單詞的評論
hotel = [sent for sent in document.sents if 'hotel' in sent.string.lower()]
# 創(chuàng)建依存樹
sentence = hotel[2] for word in sentence:
print word, ': ', str(list(word.children))
>> A :  []  cab :  [A, from]
from :  [airport, to]
the :  []
airport :  [the]
to :  [hotel]
the :  [] hotel :  
[the] can :  []
be :  [cab, can, cheaper, .]
cheaper :  [than] than :  
[shuttles]
the :  []
shuttles :  [the, depending]
depending :  [time] what :  []
time :  [what, of] of :  [day]
the :  [] day :  
[the, go] you :  
[]
go :  [you]
. :  []

解析所有居中包含“hotel”單詞的句子的依存關系，并檢查對于 hotel 人們用了哪些形容詞。我創(chuàng)建了一個自定義函數，用于分析依存關系并進行相關的詞性標注。

# 檢查修飾某個單詞的所有形容詞
def pos_words (sentence, token, ptag):
    sentences = [sent for sent in sentence.sents if token in sent.string]     
    pwrds = []
    for sent in sentences:
        for word in sent:
            if character in word.string:
                   pwrds.extend([child.string.strip() for child in word.children
                                                      if child.pos_ == ptag] )
    return Counter(pwrds).most_common(10)
pos_words(document, 'hotel', “ADJ”)
>> [(u'other', 20), (u'great', 10), (u'good', 7), (u'better', 6), (u'nice', 6), (u'different', 5), (u'many', 5), (u'best', 4), (u'my', 4), (u'wonderful', 3)]

2.5 名詞短語（NP）

依存樹也可以用來生成名詞短語：

# 生成名詞短語
doc = nlp(u'I love data science on analytics vidhya')
for np in doc.noun_chunks:
    print np.text, np.root.dep_, np.root.head.text
>> I nsubj love
   data science dobj love
   analytics pobj on

3. 集成詞向量

spaCy 提供了內置整合的向量值算法，這些向量值可以反映詞中的真正表達信息。它使用 GloVe 來生成向量。GloVe 是一種用于獲取表示單詞的向量的無監(jiān)督學習算法。

讓我們創(chuàng)建一些詞向量，然后對其做一些有趣的操作吧：

from numpy import dot
from numpy.linalg import norm
from spacy.en import English
parser = English()
# 生成“apple”的詞向量 
apple = parser.vocab[u'apple']
# 余弦相似性計算函數
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
others = list({w for w in parser.vocab if w.has_vector and w.orth_.islower() and w.lower_ != unicode("apple")})
# 根據相似性值進行排序
others.sort(key=lambda w: cosine(w.vector, apple.vector))
others.reverse()
print "top most similar words to apple:"
for word in others[:10]:
    print word.orth_
>> apples iphone f ruit juice cherry lemon banana pie mac orange

4. 使用 spaCy 對文本進行機器學習

將 spaCy 集成進機器學習模型是非常簡單、直接的。讓我們使用 sklearn 做一個自定義的文本分類器。我們將使用 cleaner、tokenizer、vectorizer、classifier 組件來創(chuàng)建一個 sklearn 管道。其中的 tokenizer 和 vectorizer 會使用我們用 spaCy 自定義的模塊構建。

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS as stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
import string
punctuations = string.punctuation
from spacy.en import English
parser = English()
# 使用 spaCy 自定義 transformer
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        return [clean_text(text) for text in X]
    def fit(self, X, y=None, **fit_params):
        return self
    def get_params(self, deep=True):
        return {}
# 進行文本清洗的實用的基本函數
def clean_text(text):     
    return text.strip().lower()

現在讓我們使用 spaCy 的解析器和一些基本的數據清洗函數來創(chuàng)建一個自定義的 tokenizer 函數。值得一提的是，你可以用詞向量來代替文本特征（使用深度學習模型效果會有較大的提升）

#創(chuàng)建 spaCy tokenizer，解析句子并生成 token
#也可以用詞向量函數來代替它
def spacy_tokenizer(sentence):
    tokens = parser(sentence)
    tokens = [tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_ for tok in tokens]
    tokens = [tok for tok in tokens if (tok not in stopwords and tok not in punctuations)]     return tokens
#創(chuàng)建 vectorizer 對象，生成特征向量，以此可以自定義 spaCy 的 tokenizer
vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1)) classifier = LinearSVC()

現在可以創(chuàng)建管道，加載數據，然后運行分類模型了。

# 創(chuàng)建管道，進行文本清洗、tokenize、向量化、分類操作
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', vectorizer),
                 ('classifier', classifier)])
# Load sample data
train = [('I love this sandwich.', 'pos'),          
         ('this is an amazing place!', 'pos'),
         ('I feel very good about these beers.', 'pos'),
         ('this is my best work.', 'pos'),
         ("what an awesome view", 'pos'),
         ('I do not like this restaurant', 'neg'),
         ('I am tired of this stuff.', 'neg'),
         ("I can't deal with this", 'neg'),
         ('he is my sworn enemy!', 'neg'),          
         ('my boss is horrible.', 'neg')]
test =   [('the beer was good.', 'pos'),     
         ('I do not enjoy my job', 'neg'),
         ("I ain't feelin dandy today.", 'neg'),
         ("I feel amazing!", 'pos'),
         ('Gary is a good friend of mine.', 'pos'),
         ("I can't believe I'm doing this.", 'neg')]
# 創(chuàng)建模型并計算準確率
pipe.fit([x[0] for x in train], [x[1] for x in train])
pred_data = pipe.predict([x[0] for x in test])
for (sample, pred) in zip(test, pred_data):
    print sample, pred
print "Accuracy:", accuracy_score([x[1] for x in test], pred_data)
>>    ('the beer was good.', 'pos') pos
      ('I do not enjoy my job', 'neg') neg
      ("I ain't feelin dandy today.", 'neg') neg
      ('I feel amazing!', 'pos') pos
      ('Gary is a good friend of mine.', 'pos') pos
      ("I can't believe I'm doing this.", 'neg') neg
      Accuracy: 1.0

5. 和其它庫的對比

Spacy 是一個非常強大且具備工業(yè)級能力的 NLP 包，它能滿足大多數 NLP 任務的需求?？赡苣銜伎迹簽槭裁磿@樣呢？

讓我們把 Spacy 和另外兩個 python 中有名的實現 NLP 的工具 —— CoreNLP 和 NLTK 進行對比吧！

支持功能表

功能	Spacy	NLTK	Core NLP
簡易的安裝方式	Y	Y	Y
Python API	Y	Y	N
多語種支持	N	Y	Y
分詞	Y	Y	Y
詞性標注	Y	Y	Y
分句	Y	Y	Y
依存性分析	Y	N	Y
實體識別	Y	Y	Y
詞向量計算集成	Y	N	N
情感分析	Y	Y	Y
共指消解	N	N	Y

速度：主要功能（Tokenizer、Tagging、Parsing）速度

庫	Tokenizer	Tagging	Parsing
spaCy	0.2ms	1ms	19ms
CoreNLP	2ms	10ms	49ms
NLTK	4ms	443ms	–

準確性：實體抽取結果

庫	準確率	Recall	F-Score
spaCy	0.72	0.65	0.69
CoreNLP	0.79	0.73	0.76
NLTK	0.51	0.65	0.58

結束語

本文討論了 spaCy —— 這個基于 python，完全用于實現 NLP 的庫。我們通過許多用例展示了 spaCy 的可用性、速度及準確性。最后我們還將其余其它幾個著名的 NLP 庫 —— CoreNLP 與 NLTK 進行了對比。

如果你能真正理解這篇文章要表達的內容，那你一定可以去實現各種有挑戰(zhàn)的文本數據與 NLP 問題。

以上就是Python結合spaCy 進行簡易自然語言處理的詳細內容，更多關于Python spaCy自然語言處理的資料請關注腳本之家其它相關文章！

您可能感興趣的文章:

使用python爬取抖音app視頻的實例代碼
這篇文章主要介紹了使用python爬取抖音app視頻的實例代碼,本文通過實例代碼給大家介紹的非常詳細，對大家的學習或工作具有一定的參考借鑒價值，需要的朋友可以參考下
2020-12-12
python中%格式表達式實例用法
在本篇文章里小編給大家整理的是一篇關于python中%格式表達式實例用法的相關內容，有興趣的朋友們可以跟著學習下。
2021-06-06
Python使用爬蟲爬取貴陽房價的方法詳解
這篇文章主要為大家詳細介紹了Python爬蟲爬取貴陽房價的方法，文中示例代碼介紹的非常詳細，具有一定的參考價值，感興趣的小伙伴們可以參考一下，希望能夠給你帶來幫助
2022-02-02
Python中def()函數的實戰(zhàn)練習題
def是define的縮寫,用來自定義函數,下面這篇文章主要給大家介紹了關于Python中def()函數的相關資料,文中通過示例代碼介紹的非常詳細,需要的朋友可以參考下
2022-07-07
python使用分治法實現求解最大值的方法
這篇文章主要介紹了python使用分治法實現求解最大值的方法,較為詳細的分析了分治法的原理與實現求最大值的方法,需要的朋友可以參考下
2015-05-05
opencv實現圖像幾何變換
這篇文章主要為大家詳細介紹了opencv實現圖像幾何變換，文中示例代碼介紹的非常詳細，具有一定的參考價值，感興趣的小伙伴們可以參考一下
2021-03-03
Python正則替換字符串函數re.sub用法示例
這篇文章主要介紹了Python正則替換字符串函數re.sub用法,結合實例形式分析了正則替換字符串函數re.sub的功能及簡單使用方法,具有一定參考借鑒價值,需要的朋友可以參考下
2017-01-01
解決pyinstaller打包發(fā)布后的exe文件打開控制臺閃退的問題
今天小編就為大家分享一篇解決pyinstaller打包發(fā)布后的exe文件打開控制臺閃退的問題，具有很好的參考價值，希望對大家有所幫助。一起跟隨小編過來看看吧
2019-06-06
Python3 pandas.concat的用法說明
這篇文章主要介紹了Python3 pandas.concat的用法說明，具有很好的參考價值，希望對大家有所幫助。一起跟隨小編過來看看吧
2021-03-03
剛學完怎么用Python實現定時任務,轉頭就跑去撩妹!
朋友問我有沒有定時任務的模塊,并且越簡單越好.剛好前今天就研究了一下定時任務模塊,于是就告訴他使用方式,令我沒有想到的是,這貨他學會了之后,居然買了一個服務器給女朋友發(fā)消息,發(fā)消息,發(fā)消息……重要的事情說三遍,需要的朋友可以參考下
2021-06-06

亚洲乱码中文字幕综合,中国熟女仑乱hd,亚洲精品乱拍国产一区二区三区,一本大道卡一卡二卡三乱码全集资源,又粗又黄又硬又爽的免费视频

軟件下載

源碼下載

軟件編程

網絡編程

在線工具

數據庫

CMS

常用工具

Python結合spaCy?進行簡易自然語言處理

目錄

簡介

1. spaCy 簡介及安裝方法

1.1 簡介

1.2 安裝

2. spaCy 的管道（Pipeline）與屬性（Properties）

2.1 Tokenization

2.2 詞性標注(POS Tag)

2.3 實體識別

2.4 依存句法分析

2.5 名詞短語（NP）

3. 集成詞向量

4. 使用 spaCy 對文本進行機器學習

5. 和其它庫的對比

支持功能表

速度：主要功能（Tokenizer、Tagging、Parsing）速度

準確性：實體抽取結果

結束語

相關文章

最新評論

大家感興趣的內容

最近更新的內容

常用在線小工具

Python結合spaCy?進行簡易自然語言處理

目錄

簡介

1. spaCy 簡介及安裝方法

1.1 簡介

1.2 安裝

2. spaCy 的管道（Pipeline）與屬性（Properties）

2.1 Tokenization

2.2 詞性標注(POS Tag)

2.3 實體識別

2.4 依存句法分析

2.5 名詞短語（NP）

3. 集成詞向量

4. 使用 spaCy 對文本進行機器學習

5. 和其它庫的對比

支持功能表

速度：主要功能（Tokenizer、Tagging、Parsing）速度

準確性：實體抽取結果

結束語

相關文章

最新評論

大家感興趣的內容

最近更新的內容

常用在線小工具

速度：主要功能（Tokenizer、Tagging、Parsing）速度