Fuse.js模糊查詢算法學習指南

更新時間：2023年03月13日 10:29:57 作者：Kiera

這篇文章主要為大家介紹了Fuse.js模糊查詢算法學習指南，有需要的朋友可以借鑒參考下，希望能夠有所幫助，祝大家多多進步，早日升職加薪

Fuse.js是什么

最近在項目里用到了Fuse.js做模糊查詢，便對這個算法起了點好奇心，翻了翻源碼。

Fuse.js 是一個 JavaScript 庫，用于執(zhí)行模糊字符串搜索。它通過比較搜索字符串與目標字符串的相似度來找到最佳匹配。

Fuse.js 使用一種稱為 Bitap 算法的搜索算法來找到最佳匹配。Bitap 算法是一種用于字符串搜索的二進制算法，它通過比較二進制位來判斷字符串是否匹配，其中模式可以與目標有所不同。該算法采用位向量數(shù)據(jù)結構和按位比較以實現(xiàn)字符串匹配。

核心算法Bitap

Bitap算法是fuse.js中用于實現(xiàn)模糊搜索的核心算法之一，其主要思路是利用位運算來計算模式串和目標串之間的相似度。具體來說，Bitap算法首先將模式串轉換為二進制掩碼，并利用位運算來計算模式串和目標串之間的相似度，然后采用一些啟發(fā)式策略來提高算法的準確性和效率。

在fuse.js中，Bitap算法的實現(xiàn)主要在BitapSearch類中。接下來我將嘗試解析一下這個類。

1. 構造函數(shù)初始化

在構造函數(shù)中，會根據(jù)配置參數(shù)計算并設置一些內部變量，如模式串的二進制掩碼、距離閾值等。

const addChunk = (pattern, startIndex) => {
  this.chunks.push({
    pattern,
    alphabet: createPatternAlphabet(pattern),
    startIndex
  })
}

createPatternAlphabet 函數(shù)的作用是生成一個對象 mask，它的鍵是模式字符串中的字符，值是一個二進制數(shù)，表示該字符在模式字符串中的位置。這個字典用于后續(xù)的位運算匹配算法中，用于計算某個字符在目標字符串中出現(xiàn)的位置。

export default function createPatternAlphabet(pattern) {
  let mask = {}
  for (let i = 0, len = pattern.length; i < len; i += 1) {
    const char = pattern.charAt(i)
    mask[char] = (mask[char] || 0) | (1 << (len - i - 1))
  }
  return mask
}

| 表示按位或運算，可以理解為二進制中的||，只要某一二進制位有一個是1，就是1，如果都是0，則是0。

<< 表示左移運算。1 << (len - i - 1)表示將數(shù)字1左移len-i-1位。比如 len=4，i=2 ，將 1 左移 (4-2-1) 位，即左移 1 位，結果為 00000010，也就是十進制數(shù) 2。

以模式字符串"hello"為例，則 mask 對象可能如下所示：

{
  "h": 16, // 二進制00010000，表示 "h" 在模式字符串的第一個位置
  "e": 8,  // 00001000，第二個位置
  "l": 3,  // 00000011，第三和第四個位置
  "o": 1   // 00000001，第五個位置
}

2. 類暴露的searchIn方法

2.1 參數(shù)和工具函數(shù)

searchIn方法中，調用了search函數(shù)。可以看到，search函數(shù)接收了text目標字符串，以及pattern模式串和opions參數(shù)，用于在目標字符串中搜索模式串。

const { isMatch, score, indices } = search(text, pattern, alphabet, {
  location: location + startIndex,
  distance,
  threshold,
  findAllMatches,
  minMatchCharLength,
  includeMatches,
  ignoreLocation
})

fuse.js提供了這些參數(shù)的默認值，比如其中的FuzzyOptions：

export const FuzzyOptions = {
  location: 0,
  threshold: 0.6,
  distance: 100
}

我們重點關注threshold 參數(shù)，它表示匹配的閾值，取值范圍為 [0, 1]。如果匹配的得分小于閾值，則表示匹配失敗。在進行模式分配時，F(xiàn)use.js 會根據(jù)模式串的長度，以及 threshold 參數(shù)，計算出一個可以接受的最大編輯距離，即 distance 參數(shù)。如果兩個字符串的編輯距離超過了這個值，就認為它們不匹配。

具體來說，對于一個長度為 m 的模式串，計算出的最大編輯距離 d 約為 m * (1 - threshold)。例如，如果 threshold 為 0.6，模式串的長度為 4，則 d = 4 * (1 - 0.6) = 1.6，向下取整后得到 1。也就是說，對于一個長度為 4 的模式串，最多允許編輯距離為 1。

computeScore根據(jù)傳入的參數(shù)計算出當前匹配的得分，分數(shù)越低表示匹配程度越高。

export default function computeScore(
  pattern,
  {
    errors = 0,
    currentLocation = 0,
    expectedLocation = 0,
    distance = Config.distance,
    ignoreLocation = Config.ignoreLocation
  } = {}
) {
  const accuracy = errors / pattern.length
  if (ignoreLocation) {
    return accuracy
  }
  const proximity = Math.abs(expectedLocation - currentLocation)
  if (!distance) {
    // Dodge divide by zero error.
    return proximity ? 1.0 : accuracy
  }
  return accuracy + proximity / distance
}

accuracy = 錯誤數(shù)/模式長度，表示當前匹配的質量。proximity = 期望位置 - 當前匹配位置的絕對值，表示它們之間的距離。如果 distance 為 0，避開被除數(shù)為0的錯誤，判斷二者之間距離，返回闕值 1 或者匹配質量的分數(shù)。否則，根據(jù)錯誤數(shù)和期望位置和實際位置之間的距離，計算出匹配得分 score = accuracy + proximity / distance。

我們得到了匹配得分，現(xiàn)在讓我們回到search函數(shù)。

2.2 第一次循環(huán)：

while 循環(huán)在每次迭代中執(zhí)行以下操作：在 text 中搜索 pattern，并調用computeScore計算每個匹配的得分。該循環(huán)用來優(yōu)化搜索算法，不斷比較模式與文本中的字符串，直到找到最佳匹配為止。

let index
// Get all exact matches, here for speed up
while ((index = text.indexOf(pattern, bestLocation)) > -1) {
  let score = computeScore(pattern, {
    currentLocation: index,
    expectedLocation,
    distance,
    ignoreLocation
  })
  currentThreshold = Math.min(score, currentThreshold)
  bestLocation = index + patternLen
  if (computeMatches) {
    let i = 0
    while (i < patternLen) {
      matchMask[index + i] = 1
      i += 1
    }
  }
}

currentThreshold表示當前的閾值，用于控制什么樣的匹配可以被接受。它初始化為最大值，然后每次迭代都會被更新為當前最優(yōu)匹配的得分，以保證后續(xù)的匹配得分不會超過當前最優(yōu)解。同時，如果computeMatches 為true，則在 matchMask 數(shù)組中標記匹配，以便后續(xù)統(tǒng)計匹配數(shù)。

2.3 第二次循環(huán)

每次開始搜索前，重置幾個變量如bestLocation、binMax，計算掩碼mask的值，掩碼的長度等于搜索模式的長度 patternLen。

bestLocation = -1
let lastBitArr = []
let finalScore = 1
let binMax = patternLen + textLen
const mask = 1 << (patternLen - 1)

用一個for循環(huán)遍歷給定的搜索模式中的每個字符，計算出搜索模式的每個字符對應的掩碼值，這個掩碼用來進行位運算匹配。

for (let i = 0; i < patternLen; i += 1){  
    //...不急不急，后面一步步來分解。
}

二分查找算法更新區(qū)間端點

我們先看這個循環(huán)體內的一個while循環(huán)。一個熟悉的二分查找算法，還有一個老朋友computeScore函數(shù)：計算當前二分區(qū)間中間位置的得分。簡直就像是即將迷路的旅人見到了自己熟悉的物事。うれしい! 勝利在望了啊同志們！

let binMin = 0
let binMid = binMax  
while (binMin < binMid) {
  const score = computeScore(pattern, {
    errors: i,
    currentLocation: expectedLocation + binMid,
    expectedLocation,
    distance,
    ignoreLocation
  })
  if (score <= currentThreshold) {
    binMin = binMid
  } else {
    binMax = binMid
  }
  binMid = Math.floor((binMax - binMin) / 2 + binMin)
}

在這個循環(huán)中，每次計算二分區(qū)間中間位置的得分，然后根據(jù)當前得分和閾值來更新區(qū)間端點。這樣，循環(huán)會不斷縮小搜索范圍，直到找到最佳匹配或者搜索范圍縮小到為空為止。再用這個值賦值給binMax作為下一次二分搜索中的右端點：

// Use the result from this iteration as the maximum for the next. 
binMax = binMid

計算區(qū)間兩端的值

計算出左端點 start 和右端點 finish：

let start = Math.max(1, expectedLocation - binMid + 1)
let finish = findAllMatches
  ? textLen
  : Math.min(expectedLocation + binMid, textLen) + patternLen
// Initialize the bit array
let bitArr = Array(finish + 2)
bitArr[finish + 1] = (1 &lt;&lt; i) - 1

左端點 start 的值是 expectedLocation - binMid + 1 和 1 中的較大值，這樣可以保證搜索區(qū)間的左端點不會小于 1。右端點 finish 的值取決于變量 findAllMatches 和文本長度 textLen。如果 findAllMatches 為true，需要搜索整個文本，則將右端點 finish 設置為文本長度 textLen。否則，將右端點 finish 設置為 expectedLocation + binMid和 textLen 中的較小值，并加上搜索模式長度 patternLen，以便搜索可能包含匹配項的區(qū)間。

初始化二進制數(shù)組 bitArr，長度為 finish + 2。數(shù)組中的每個元素代表一位二進制數(shù)中的一位。在 bitArr 數(shù)組中，右端點 finish + 1 的元素被設置為一個二進制數(shù)，(1 << i) - 1確保其后i位均為 1，其余位為 0。在后面的算法中，用來存儲搜索模式和文本之間的匹配信息。

遍歷區(qū)間

從右往左遍歷文本中的每個字符。這個循環(huán)體的代碼很長，沒關系，繼續(xù)分解便是。

for (let j = finish; j >= start; j -= 1) {
  let currentLocation = j - 1
  let charMatch = patternAlphabet[text.charAt(currentLocation)]
  if (computeMatches) {
    // Speed up: quick bool to int conversion (i.e, `charMatch ? 1 : 0`)
    matchMask[currentLocation] = +!!charMatch
  }
  // First pass: exact match
  bitArr[j] = ((bitArr[j + 1] << 1) | 1) & charMatch
  // Subsequent passes: fuzzy match
  if (i) {
    bitArr[j] |=
      ((lastBitArr[j + 1] | lastBitArr[j]) << 1) | 1 | lastBitArr[j + 1]
  }
  if (bitArr[j] & mask) {
    finalScore = computeScore(pattern, {
      errors: i,
      currentLocation,
      expectedLocation,
      distance,
      ignoreLocation
    })
    // This match will almost certainly be better than any existing match.
    // But check anyway.
    if (finalScore <= currentThreshold) {
      // Indeed it is
      currentThreshold = finalScore
      bestLocation = currentLocation
      // Already passed `loc`, downhill from here on in.
      if (bestLocation <= expectedLocation) {
        break
      }
      // When passing `bestLocation`, don't exceed our current distance from `expectedLocation`.
      start = Math.max(1, 2 * expectedLocation - bestLocation)
    }
  }
}

先看第一段：

let currentLocation = j - 1
let charMatch = patternAlphabet[text.charAt(currentLocation)]
if (computeMatches) {
  // Speed up: quick bool to int conversion (i.e, `charMatch ? 1 : 0`)
  matchMask[currentLocation] = +!!charMatch
}
// First pass: exact match
bitArr[j] = ((bitArr[j + 1] << 1) | 1) & charMatch

這里根據(jù)該字符是否與模式串中的對應字符匹配，更新 bitArr 數(shù)組相應位置的值。

patternAlphabet[text.charAt(currentLocation)] 用于獲取當前位置的字符在模式串中最右邊出現(xiàn)的位置。如果該字符不在模式串中，則返回 undefined。然后，將這個位置記錄在 charMatch 變量中，以便在后面的匹配過程中使用。

（bitArr[j + 1] << 1 | 1）將右側位置的匹配狀態(tài)左移一位，將最后一位設為 1，保證右側位置的比特位都是 1。再用& charMatch和當前位置對應的字符是否匹配的比特位進行與運算。如果匹配，那么與運算的結果就是 1，否則是 0。這個過程實際上是在構建比特矩陣，用于后續(xù)的模糊匹配。

這里需要注意的是，由于 bitArr 數(shù)組的長度比文本串和模式串的長度都要長 2，因此 bitArr 數(shù)組中最后兩個位置的值都為 0，即 bitArr[finish + 1] 和 bitArr[finish + 2] 的值都為 0。

// Subsequent passes: fuzzy match
if (i) {
  bitArr[j] |=
    ((lastBitArr[j + 1] | lastBitArr[j]) << 1) | 1 | lastBitArr[j + 1]
}

這段代碼實現(xiàn)了模糊匹配的邏輯。lastBitArr初始化為空數(shù)組，在后面的代碼中，會看到被賦值為上一次循環(huán)的bitArr。

在第一次匹配只考慮完全匹配，bitArr[j] 只需要用到 bitArr[j+1]。但是在后續(xù)的匹配需要考慮字符不匹配的情況，那么就需要用到 lastBitArr 數(shù)組，它存儲了上一次匹配的結果。具體來說，對于當前位置 j，我們把左側、上側和左上側三個位置【這仨位置可以想象成看似矩陣實際是二維數(shù)組的左、左上、上，比如最長公共子序列那個算法】的匹配結果進行或運算，并左移一位。然后再和 1 或上一個特定的值（lastBitArr[j+1]），最終得到 bitArr[j] 的值。這樣就可以考慮字符不匹配的情況，實現(xiàn)模糊匹配的功能。

接下來，判斷當前位置的匹配結果是否滿足閾值要求，如果滿足，則更新最優(yōu)匹配位置。

if (bitArr[j] & mask) {
  finalScore = computeScore(pattern, { //...一些參數(shù)，這里省略  })
  // This match will almost certainly be better than any existing match.
  // But check anyway.
  if (finalScore <= currentThreshold) {
    // Indeed it is
    currentThreshold = finalScore
    bestLocation = currentLocation
    // Already passed `loc`, downhill from here on in.
    if (bestLocation <= expectedLocation) {
      break
    }
    // When passing `bestLocation`, don't exceed our current distance from `expectedLocation`.
    start = Math.max(1, 2 * expectedLocation - bestLocation)
  }
}

如果 bitArr[j] & mask 的結果為真，則說明當前位置匹配成功，接下來計算當前位置的得分 finalScore。如果 finalScore 小或等于當前閾值 currentThreshold ，說明當前匹配結果更優(yōu)，更新閾值和最優(yōu)匹配位置 bestLocation。

如果最優(yōu)匹配位置 bestLocation 小于等于期望位置 expectedLocation，說明已經(jīng)找到了期望位置的最優(yōu)匹配，跳出循環(huán)；否則更新搜索起點 start，保證在向左搜索時不超過當前距離期望位置的距離。

????判斷當前錯誤距離是否已經(jīng)超出了之前最好的匹配結果，如果已經(jīng)超出，則終止后續(xù)匹配，因為后續(xù)匹配的結果不可能更優(yōu)。

// No hope for a (better) match at greater error levels.
const score = computeScore(pattern, {
  errors: i + 1,
  currentLocation: expectedLocation,
  expectedLocation,
  distance,
  ignoreLocation
})
if (score > currentThreshold) {
  break
}
lastBitArr = bitArr

最后，真的最后了????：

const result = {
  isMatch: bestLocation >= 0,
  // Count exact matches (those with a score of 0) to be "almost" exact
  score: Math.max(0.001, finalScore)
}
if (computeMatches) {
  const indices = convertMaskToIndices(matchMask, minMatchCharLength)
  if (!indices.length) {
    result.isMatch = false
  } else if (includeMatches) {
    result.indices = indices
  }
}

convertMaskToIndices()函數(shù)將匹配掩碼轉換為匹配的索引數(shù)組。以上，我們得到了search的結果。

接下來，回到searchIn函數(shù)，我們會看到對result結果的一些其它處理。這里不再贅述。

基于動態(tài)規(guī)劃算法的Levenshtein算法

動態(tài)規(guī)劃（Dynamic Programming）常用于處理具有有重疊子問題和最優(yōu)子結構性質的問題，它將原問題分解成一系列子問題，通過求解子問題的最優(yōu)解來推算出原問題的最優(yōu)解。動態(tài)規(guī)劃算法兩個關鍵步驟：設計狀態(tài)轉移方程，用來表示狀態(tài)之間的關系；確定邊界，設置循環(huán)結束條件。

一個經(jīng)典的動態(tài)規(guī)劃算法例子，使用動態(tài)規(guī)劃算法實現(xiàn)斐波那契數(shù)列：

function fibonacci(n) {
  if (n === 0 || n === 1) return n;
  const dp = new Array(n + 1).fill(0);
  dp[0] = 0;
  dp[1] = 1;
  for (let i = 2; i <= n; i++) {
    dp[i] = dp[i - 1] + dp[i - 2];
  }
  return dp[n];
}

Levenshtein算法是一種用于計算兩個字符串之間的編輯距離的算法，即需要將一個字符串轉換為另一個字符串所需的最少編輯次數(shù)。編輯操作可以是插入、刪除或替換字符。

function levenshteinDistance(str1, str2) {
  const m = str1.length;
  const n = str2.length;
  const dp = [];
  for (let i = 0; i <= m; i++) {
    dp[i] = [i];
  }
  for (let j = 1; j <= n; j++) {
    dp[0][j] = j;
  }
  for (let i = 1; i <= m; i++) {
    for (let j = 1; j <= n; j++) {
      const cost = str1[i-1] === str2[j-1] ? 0 : 1;
      dp[i][j] = Math.min(
        dp[i-1][j] + 1,//刪除
        dp[i][j-1] + 1,
        dp[i-1][j-1] + cost
      );
    }
  }
  return dp[m][n];
}

讓我們照著下圖來分析如何求出dp[i][j]。

|   |   |     s |   i   |   t   |   t   |   i   |   n   |   g   |
|   | 0 | 1     | 2     | 3     | 4     | 5     | 6     | 7     |
| k | 1 | 1     | 2     | 3     | 4     | 5     | 6     | 7     |
| i | 2 | 2     | 1     | 2     | 3     | 4     | 5     | 6     |
| t | 3 | 3     | 2     | 1     | 2     | 3     | 4     | 5     |
| t | 4 | 4     | 3     | 2     | 1     | 2     | 3     | 4     |
| e | 5 | 5     | 4     | 3     | 2     | 2     | 3     | 4     |
| n | 6 | 6     | 5     | 4     | 3     | 3     | 2     | 3     |

假設我們要將字符串 str1 轉換為字符串 str2，并且我們已經(jīng)定義了一個二維數(shù)組 dp，其中 dp[i][j] 表示將字符串 str1 的前 i 個字符轉換為字符串 str2 的前 j 個字符所需的最少編輯次數(shù)。

為了求出 dp[i][j]，我們可以考慮將字符串 str1 的前 i 個字符轉換為字符串 str2 的前 j 個字符時，最后一步進行了什么操作?？赡艿牟僮饔腥N：

刪除字符串 str1 中的第 i 個字符，然后將剩余的字符轉換為字符串 str2 的前 j 個字符。這種情況下，dp[i][j] 就等于 dp[i-1][j] + 1，其中 dp[i-1][j] 表示將字符串 str1 的前 i-1 個字符轉換為字符串 str2 的前 j 個字符所需的最少編輯次數(shù)，再加上刪除字符的操作次數(shù) 1。
在字符串 str1 的第 i 個位置插入字符 str2[j]，然后將剩余的字符轉換為字符串 str2 的前 j 個字符。這種情況下，dp[i][j] 就等于 dp[i][j-1] + 1，其中 dp[i][j-1] 表示將字符串 str1 的前 i 個字符轉換為字符串 str2 的前 j-1 個字符所需的最少編輯次數(shù)，再加上插入字符的操作次數(shù) 1。
將字符串 str1 中的第 i 個字符替換為字符 str2[j]，然后將剩余的字符轉換為字符串 str2 的前 j 個字符。這種情況下，dp[i][j] 就等于 dp[i-1][j-1] + cost，其中 dp[i-1][j-1] 表示將字符串 str1 的前 i-1 個字符轉換為字符串 str2 的前 j-1 個字符所需的最少編輯次數(shù)，再加上替換字符的操作次數(shù) cost（如果 str1[i] 和 str2[j] 相同，那么 cost 就為 0，否則 cost 就為 1）。

上述三種操作中所需的最少編輯次數(shù)取最小值，便可作為將字符串 str1 的前 i 個字符轉換為字符串 str2 的前 j 個字符所需的最少編輯次數(shù)。

以上就是Fuse.js模糊查詢算法學習指南的詳細內容，更多關于Fuse.js模糊查詢算法的資料請關注腳本之家其它相關文章！

您可能感興趣的文章:

js 賦值包含單引號雙引號問題的解決方法
這篇文章主要介紹了js賦值包含單引號雙引號問題的解決方法,需要的朋友可以參考下
2014-02-02
javascript 內存模型實例詳解
這篇文章主要介紹了javascript 內存模型,結合實例形式詳細分析了javascript 內存模型相關概念、原理、操作技巧與注意事項,需要的朋友可以參考下
2020-04-04
JavaScript中的事件處理程序
這篇文章介紹了JavaScript中的事件處理程序，文中通過示例代碼介紹的非常詳細。對大家的學習或工作具有一定的參考借鑒價值，需要的朋友可以參考下
2022-04-04
JS實現(xiàn)橫向輪播圖（初級版）
這篇文章主要為大家詳細介紹了JS實現(xiàn)橫向輪播圖的初級版，文中示例代碼介紹的非常詳細，具有一定的參考價值，感興趣的小伙伴們可以參考一下
2020-01-01
Javascript & DHTML DOM基礎和基本API
DOM是文檔對象模型(Document Object Model，是基于瀏覽器編程（在本教程中，可以說就是DHTML編程）的一套API接口，W3C出臺的推薦標準，每個瀏覽器都有一些細微的差別，其中以Mozilla的瀏覽器最與標準接近。
2008-07-07
基于js原生和ajax的get和post方法以及jsonp的原生寫法實例
下面小編就為大家?guī)硪黄趈s原生和ajax的get和post方法以及jsonp的原生寫法實例。小編覺得挺不錯的，現(xiàn)在就分享給大家，也給大家做個參考。一起跟隨小編過來看看吧
2017-10-10
JavaScript返回0-1之間隨機數(shù)的方法
這篇文章主要介紹了JavaScript返回0-1之間隨機數(shù)的方法,涉及javascript中Math對象random方法的使用技巧,非常具有實用價值,需要的朋友可以參考下
2015-04-04
JavaScript數(shù)據(jù)結構與算法之檢索算法示例【二分查找法、計算重復次數(shù)】
這篇文章主要介紹了JavaScript數(shù)據(jù)結構與算法之檢索算法,結合實例形式分析了二分查找法、計算重復次數(shù)相關算法原理與使用技巧,需要的朋友可以參考下
2019-02-02
javascript 二分法(數(shù)組array)
擴展Array對象，為其添加二分法查找功能,提高查找效率。
2010-04-04
淺談JavaScript中this的指向更改
這篇文章主要介紹了淺談JavaScript中this的指向更改，文中通過示例代碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值，需要的朋友們下面隨著小編來一起學習學習吧
2020-07-07