詳解Spring Boot 中使用 Java API 調(diào)用 lucene

更新時(shí)間：2017年11月09日 08:32:43 作者：Peng Lei

這篇文章主要介紹了詳解Spring Boot 中使用 Java API 調(diào)用 lucene,小編覺(jué)得挺不錯(cuò)的，現(xiàn)在分享給大家，也給大家做個(gè)參考。一起跟隨小編過(guò)來(lái)看看吧

Lucene是apache軟件基金會(huì)4 jakarta項(xiàng)目組的一個(gè)子項(xiàng)目，是一個(gè)開(kāi)放源代碼的全文檢索引擎工具包，但它不是一個(gè)完整的全文檢索引擎，而是一個(gè)全文檢索引擎的架構(gòu)，提供了完整的查詢引擎和索引引擎，部分文本分析引擎（英文與德文兩種西方語(yǔ)言）。Lucene的目的是為軟件開(kāi)發(fā)人員提供一個(gè)簡(jiǎn)單易用的工具包，以方便的在目標(biāo)系統(tǒng)中實(shí)現(xiàn)全文檢索的功能，或者是以此為基礎(chǔ)建立起完整的全文檢索引擎

全文檢索概述

比如，我們一個(gè)文件夾中，或者一個(gè)磁盤(pán)中有很多的文件，記事本、world、Excel、pdf，我們想根據(jù)其中的關(guān)鍵詞搜索包含的文件。例如，我們輸入Lucene，所有內(nèi)容含有Lucene的文件就會(huì)被檢查出來(lái)。這就是所謂的全文檢索。

因此，很容易的我們想到，應(yīng)該建立一個(gè)關(guān)鍵字與文件的相關(guān)映射，盜用ppt中的一張圖，很明白的解釋了這種映射如何實(shí)現(xiàn)。

倒排索引

有了這種映射關(guān)系，我們就來(lái)看看Lucene的架構(gòu)設(shè)計(jì)。

下面是Lucene的資料必出現(xiàn)的一張圖，但也是其精髓的概括。

我們可以看到，Lucene的使用主要體現(xiàn)在兩個(gè)步驟：

1 創(chuàng)建索引，通過(guò)IndexWriter對(duì)不同的文件進(jìn)行索引的創(chuàng)建，并將其保存在索引相關(guān)文件存儲(chǔ)的位置中。

2 通過(guò)索引查尋關(guān)鍵字相關(guān)文檔。

在Lucene中，就是使用這種“倒排索引”的技術(shù)，來(lái)實(shí)現(xiàn)相關(guān)映射。

Lucene數(shù)學(xué)模型

文檔、域、詞元

文檔是Lucene搜索和索引的原子單位，文檔為包含一個(gè)或者多個(gè)域的容器，而域則是依次包含“真正的”被搜索的內(nèi)容，域值通過(guò)分詞技術(shù)處理，得到多個(gè)詞元。

For Example，一篇小說(shuō)（斗破蒼穹）信息可以稱為一個(gè)文檔，小說(shuō)信息又包含多個(gè)域，例如：標(biāo)題（斗破蒼穹）、作者、簡(jiǎn)介、最后更新時(shí)間等等，對(duì)標(biāo)題這個(gè)域采用分詞技術(shù)又可以得到一個(gè)或者多個(gè)詞元（斗、破、蒼、穹）。

Lucene文件結(jié)構(gòu)

層次結(jié)構(gòu)

index
一個(gè)索引存放在一個(gè)目錄中

segment
一個(gè)索引中可以有多個(gè)段，段與段之間是獨(dú)立的，添加新的文檔可能產(chǎn)生新段，不同的段可以合并成一個(gè)新段

document
文檔是創(chuàng)建索引的基本單位，不同的文檔保存在不同的段中，一個(gè)段可以包含多個(gè)文檔

field
域，一個(gè)文檔包含不同類(lèi)型的信息，可以拆分開(kāi)索引

term
詞，索引的最小單位，是經(jīng)過(guò)詞法分析和語(yǔ)言處理后的數(shù)據(jù)。

正向信息

按照層次依次保存了從索引到詞的包含關(guān)系：index-->segment-->document-->field-->term。

反向信息

反向信息保存了詞典的倒排表映射：term-->document

IndexWriter
lucene中最重要的的類(lèi)之一，它主要是用來(lái)將文檔加入索引，同時(shí)控制索引過(guò)程中的一些參數(shù)使用。

Analyzer
分析器,主要用于分析搜索引擎遇到的各種文本。常用的有StandardAnalyzer分析器,StopAnalyzer分析器,WhitespaceAnalyzer分析器等。

Directory
索引存放的位置;lucene提供了兩種索引存放的位置，一種是磁盤(pán)，一種是內(nèi)存。一般情況將索引放在磁盤(pán)上；相應(yīng)地lucene提供了FSDirectory和RAMDirectory兩個(gè)類(lèi)。

Document
文檔;Document相當(dāng)于一個(gè)要進(jìn)行索引的單元，任何可以想要被索引的文件都必須轉(zhuǎn)化為Document對(duì)象才能進(jìn)行索引。

Field
字段。

IndexSearcher
是lucene中最基本的檢索工具，所有的檢索都會(huì)用到IndexSearcher工具;

Query
查詢，lucene中支持模糊查詢，語(yǔ)義查詢，短語(yǔ)查詢，組合查詢等等,如有TermQuery,BooleanQuery,RangeQuery,WildcardQuery等一些類(lèi)。

QueryParser
是一個(gè)解析用戶輸入的工具，可以通過(guò)掃描用戶輸入的字符串，生成Query對(duì)象。

Hits
在搜索完成之后，需要把搜索結(jié)果返回并顯示給用戶，只有這樣才算是完成搜索的目的。在lucene中，搜索的結(jié)果的集合是用Hits類(lèi)的實(shí)例來(lái)表示的。

測(cè)試用例

Github 代碼

代碼我已放到 Github ，導(dǎo)入spring-boot-lucene-demo 項(xiàng)目

github spring-boot-lucene-demo

添加依賴

<!--對(duì)分詞索引查詢解析-->
<dependency>
  <groupId>org.apache.lucene</groupId>
  <artifactId>lucene-queryparser</artifactId>
  <version>7.1.0</version>
</dependency>

<!--高亮 -->
<dependency>
  <groupId>org.apache.lucene</groupId>
  <artifactId>lucene-highlighter</artifactId>
  <version>7.1.0</version>
</dependency>

<!--smartcn 中文分詞器 SmartChineseAnalyzer smartcn分詞器 需要lucene依賴 且和lucene版本同步-->
<dependency>
  <groupId>org.apache.lucene</groupId>
  <artifactId>lucene-analyzers-smartcn</artifactId>
  <version>7.1.0</version>
</dependency>

<!--ik-analyzer 中文分詞器-->
<dependency>
  <groupId>cn.bestwu</groupId>
  <artifactId>ik-analyzers</artifactId>
  <version>5.1.0</version>
</dependency>

<!--MMSeg4j 分詞器-->
<dependency>
  <groupId>com.chenlb.mmseg4j</groupId>
  <artifactId>mmseg4j-solr</artifactId>
  <version>2.4.0</version>
  <exclusions>
    <exclusion>
      <groupId>org.apache.solr</groupId>
      <artifactId>solr-core</artifactId>
    </exclusion>
  </exclusions>
</dependency>

配置 lucene

private Directory directory;

private IndexReader indexReader;

private IndexSearcher indexSearcher;

@Before
public void setUp() throws IOException {
  //索引存放的位置，設(shè)置在當(dāng)前目錄中
  directory = FSDirectory.open(Paths.get("indexDir/"));

  //創(chuàng)建索引的讀取器
  indexReader = DirectoryReader.open(directory);

  //創(chuàng)建一個(gè)索引的查找器，來(lái)檢索索引庫(kù)
  indexSearcher = new IndexSearcher(indexReader);
}

@After
public void tearDown() throws Exception {
  indexReader.close();
}

**
 * 執(zhí)行查詢，并打印查詢到的記錄數(shù)
 *
 * @param query
 * @throws IOException
 */
public void executeQuery(Query query) throws IOException {

  TopDocs topDocs = indexSearcher.search(query, 100);

  //打印查詢到的記錄數(shù)
  System.out.println("總共查詢到" + topDocs.totalHits + "個(gè)文檔");
  for (ScoreDoc scoreDoc : topDocs.scoreDocs) {

    //取得對(duì)應(yīng)的文檔對(duì)象
    Document document = indexSearcher.doc(scoreDoc.doc);
    System.out.println("id：" + document.get("id"));
    System.out.println("title：" + document.get("title"));
    System.out.println("content：" + document.get("content"));
  }
}

/**
 * 分詞打印
 *
 * @param analyzer
 * @param text
 * @throws IOException
 */
public void printAnalyzerDoc(Analyzer analyzer, String text) throws IOException {

  TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));
  CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
  try {
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
      System.out.println(charTermAttribute.toString());
    }
    tokenStream.end();
  } finally {
    tokenStream.close();
    analyzer.close();
  }
}

創(chuàng)建索引

@Test
public void indexWriterTest() throws IOException {
  long start = System.currentTimeMillis();

  //索引存放的位置，設(shè)置在當(dāng)前目錄中
  Directory directory = FSDirectory.open(Paths.get("indexDir/"));

  //在 6.6 以上版本中 version 不再是必要的，并且，存在無(wú)參構(gòu)造方法，可以直接使用默認(rèn)的 StandardAnalyzer 分詞器。
  Version version = Version.LUCENE_7_1_0;

  //Analyzer analyzer = new StandardAnalyzer(); // 標(biāo)準(zhǔn)分詞器，適用于英文
  //Analyzer analyzer = new SmartChineseAnalyzer();//中文分詞
  //Analyzer analyzer = new ComplexAnalyzer();//中文分詞
  //Analyzer analyzer = new IKAnalyzer();//中文分詞

  Analyzer analyzer = new IKAnalyzer();//中文分詞

  //創(chuàng)建索引寫(xiě)入配置
  IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);

  //創(chuàng)建索引寫(xiě)入對(duì)象
  IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);

  //創(chuàng)建Document對(duì)象，存儲(chǔ)索引

  Document doc = new Document();

  int id = 1;

  //將字段加入到doc中
  doc.add(new IntPoint("id", id));
  doc.add(new StringField("title", "Spark", Field.Store.YES));
  doc.add(new TextField("content", "Apache Spark 是專(zhuān)為大規(guī)模數(shù)據(jù)處理而設(shè)計(jì)的快速通用的計(jì)算引擎", Field.Store.YES));
  doc.add(new StoredField("id", id));

  //將doc對(duì)象保存到索引庫(kù)中
  indexWriter.addDocument(doc);

  indexWriter.commit();
  //關(guān)閉流
  indexWriter.close();

  long end = System.currentTimeMillis();
  System.out.println("索引花費(fèi)了" + (end - start) + " 毫秒");
}

響應(yīng)

17:58:14.655 [main] DEBUG org.wltea.analyzer.dic.Dictionary - 加載擴(kuò)展詞典：ext.dic
17:58:14.660 [main] DEBUG org.wltea.analyzer.dic.Dictionary - 加載擴(kuò)展停止詞典：stopword.dic
索引花費(fèi)了879 毫秒

刪除文檔

@Test
public void deleteDocumentsTest() throws IOException {
  //Analyzer analyzer = new StandardAnalyzer(); // 標(biāo)準(zhǔn)分詞器，適用于英文
  //Analyzer analyzer = new SmartChineseAnalyzer();//中文分詞
  //Analyzer analyzer = new ComplexAnalyzer();//中文分詞
  //Analyzer analyzer = new IKAnalyzer();//中文分詞

  Analyzer analyzer = new IKAnalyzer();//中文分詞

  //創(chuàng)建索引寫(xiě)入配置
  IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);

  //創(chuàng)建索引寫(xiě)入對(duì)象
  IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);

  // 刪除title中含有關(guān)鍵詞“Spark”的文檔
  long count = indexWriter.deleteDocuments(new Term("title", "Spark"));

  // 除此之外IndexWriter還提供了以下方法：
  // DeleteDocuments(Query query):根據(jù)Query條件來(lái)刪除單個(gè)或多個(gè)Document
  // DeleteDocuments(Query[] queries):根據(jù)Query條件來(lái)刪除單個(gè)或多個(gè)Document
  // DeleteDocuments(Term term):根據(jù)Term來(lái)刪除單個(gè)或多個(gè)Document
  // DeleteDocuments(Term[] terms):根據(jù)Term來(lái)刪除單個(gè)或多個(gè)Document
  // DeleteAll():刪除所有的Document

  //使用IndexWriter進(jìn)行Document刪除操作時(shí)，文檔并不會(huì)立即被刪除，而是把這個(gè)刪除動(dòng)作緩存起來(lái)，當(dāng)IndexWriter.Commit()或IndexWriter.Close()時(shí)，刪除操作才會(huì)被真正執(zhí)行。

  indexWriter.commit();
  indexWriter.close();

  System.out.println("刪除完成:" + count);
}

響應(yīng)

刪除完成:1

更新文檔

/**
 * 測(cè)試更新
 * 實(shí)際上就是刪除后新增一條
 *
 * @throws IOException
 */
@Test
public void updateDocumentTest() throws IOException {
  //Analyzer analyzer = new StandardAnalyzer(); // 標(biāo)準(zhǔn)分詞器，適用于英文
  //Analyzer analyzer = new SmartChineseAnalyzer();//中文分詞
  //Analyzer analyzer = new ComplexAnalyzer();//中文分詞
  //Analyzer analyzer = new IKAnalyzer();//中文分詞

  Analyzer analyzer = new IKAnalyzer();//中文分詞

  //創(chuàng)建索引寫(xiě)入配置
  IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);

  //創(chuàng)建索引寫(xiě)入對(duì)象
  IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);

  Document doc = new Document();

  int id = 1;

  doc.add(new IntPoint("id", id));
  doc.add(new StringField("title", "Spark", Field.Store.YES));
  doc.add(new TextField("content", "Apache Spark 是專(zhuān)為大規(guī)模數(shù)據(jù)處理而設(shè)計(jì)的快速通用的計(jì)算引擎", Field.Store.YES));
  doc.add(new StoredField("id", id));

  long count = indexWriter.updateDocument(new Term("id", "1"), doc);
  System.out.println("更新文檔:" + count);
  indexWriter.close();
}

響應(yīng)

更新文檔:1

按詞條搜索

/**
 * 按詞條搜索
 * <p>
 * TermQuery是最簡(jiǎn)單、也是最常用的Query。TermQuery可以理解成為“詞條搜索”，
 * 在搜索引擎中最基本的搜索就是在索引中搜索某一詞條，而TermQuery就是用來(lái)完成這項(xiàng)工作的。
 * 在Lucene中詞條是最基本的搜索單位，從本質(zhì)上來(lái)講一個(gè)詞條其實(shí)就是一個(gè)名/值對(duì)。
 * 只不過(guò)這個(gè)“名”是字段名，而“值”則表示字段中所包含的某個(gè)關(guān)鍵字。
 *
 * @throws IOException
 */
@Test
public void termQueryTest() throws IOException {

  String searchField = "title";
  //這是一個(gè)條件查詢的api，用于添加條件
  TermQuery query = new TermQuery(new Term(searchField, "Spark"));

  //執(zhí)行查詢，并打印查詢到的記錄數(shù)
  executeQuery(query);
}

響應(yīng)