快捷導(dǎo)航

python實(shí)現(xiàn)機(jī)械分詞之逆向最大匹配算法代碼示例

更新時(shí)間：2017年12月13日 14:39:19 作者：lalalawxt

這篇文章主要介紹了python實(shí)現(xiàn)機(jī)械分詞之逆向最大匹配算法代碼示例，具有一定借鑒價(jià)值，需要的朋友可以參考下。

逆向最大匹配方法

有正即有負(fù)，正向最大匹配算法大家可以參閱http://chabaoo.cn/article/127404.htm

逆向最大匹配分詞是中文分詞基本算法之一，因?yàn)槭菣C(jī)械切分，所以它也有分詞速度快的優(yōu)點(diǎn)，且逆向最大匹配分詞比起正向最大匹配分詞更符合人們的語(yǔ)言習(xí)慣。逆向最大匹配分詞需要在已有詞典的基礎(chǔ)上，從被處理文檔的末端開(kāi)始匹配掃描，每次取最末端的i個(gè)字符（分詞所確定的閾值i）作為匹配字段，若匹配失敗，則去掉匹配字段最前面的一個(gè)字，繼續(xù)匹配。而且選擇的閾值越大，分詞越慢，但準(zhǔn)確性越好。

逆向最大匹配算法python實(shí)現(xiàn)：

分詞文本示例：

分詞詞典words.xlsx示例：

#!/usr/bin/env python 
#-*- coding:utf-8 -*- 
 
''''' 
用逆向最大匹配法分詞，不去除停用詞 
''' 
import codecs 
import xlrd 
 
#讀取待分詞文本,readlines（）返回句子list 
def readfile(raw_file_path): 
  with codecs.open(raw_file_path,"r",encoding="ANSI") as f: 
    raw_file=f.readlines() 
    return raw_file 
#讀取分詞詞典,返回分詞詞典list 
def read_dic(dic_path): 
  excel = xlrd.open_workbook(dic_path) 
  sheet = excel.sheets()[0] 
  # 讀取第二列的數(shù)據(jù) 
  data_list = list(sheet.col_values(1))[1:] 
  return data_list 
#逆向最大匹配法分詞 
def cut_words(raw_sentences,word_dic): 
  word_cut=[] 
  #最大詞長(zhǎng)，分詞詞典中的最大詞長(zhǎng),為初始分詞的最大詞長(zhǎng) 
  max_length=max(len(word) for word in word_dic) 
  for sentence in raw_sentences: 
    #strip()函數(shù)返回一個(gè)沒(méi)有首尾空白字符(‘\n'、‘\r'、‘\t'、‘')的sentence，避免分詞錯(cuò)誤 
    sentence=sentence.strip() 
    #單句中的字?jǐn)?shù) 
    words_length = len(sentence) 
    #存儲(chǔ)切分出的詞語(yǔ) 
    cut_word_list=[] 
    #判斷句子是否切分完畢 
    while words_length > 0: 
      max_cut_length = min(words_length, max_length) 
      for i in range(max_cut_length, 0, -1): 
        #根據(jù)切片性質(zhì)，截取words_length-i到words_length-1索引的字，不包括words_length,所以不會(huì)溢出 
        new_word = sentence[words_length - i: words_length] 
        if new_word in word_dic: 
          cut_word_list.append(new_word) 
          words_length = words_length - i 
          break 
        elif i == 1: 
          cut_word_list.append(new_word) 
          words_length = words_length - 1 
    #因?yàn)槭悄嫦蜃畲笃ヅ?，所以最終需要把結(jié)果逆向輸出，轉(zhuǎn)換為原始順序 
    cut_word_list.reverse() 
    words="/".join(cut_word_list) 
    #最終把句子首端的分詞符號(hào)刪除，是避免以后將分詞結(jié)果轉(zhuǎn)化為列表時(shí)會(huì)出現(xiàn)空字符串元素 
    word_cut.append(words.lstrip("/")) 
  return word_cut 
#輸出分詞文本 
def outfile(out_path,sentences): 
  #輸出模式是“a”即在原始文本上繼續(xù)追加文本 
  with codecs.open(out_path,"a","utf8") as f: 
    for sentence in sentences: 
      f.write(sentence) 
  print("well done!") 
def main(): 
  #讀取待分詞文本 
  rawfile_path = r"逆向分詞文本.txt" 
  raw_file=readfile(rawfile_path) 
  #讀取分詞詞典 
  wordfile_path = r"words.xlsx" 
  words_dic = read_dic(wordfile_path) 
  #逆向最大匹配法分詞 
  content_cut = cut_words(raw_file,words_dic) 
  #輸出文本 
  outfile_path = r"分詞結(jié)果.txt" 
  outfile(outfile_path,content_cut) 
if __name__=="__main__": 
  main()