快捷導(dǎo)航

bloom filter概念講解以及代碼分析

更新時(shí)間：2013年09月22日 09:30:01 作者：

Bloom filter 優(yōu)點(diǎn)就是它的插入和查詢(xún)時(shí)間都是常數(shù)，另外它查詢(xún)?cè)貐s不保存元素本身，具有良好的安全性

一. 簡(jiǎn)介
1.什么是bloom filter？
Bloom filter 是由 Howard Bloom 在 1970 年提出的二進(jìn)制向量數(shù)據(jù)結(jié)構(gòu)，它具有很好的空間和時(shí)間效率，被用來(lái)檢測(cè)一個(gè)元素是不是集合中的一個(gè)成員，這種檢測(cè)只會(huì)對(duì)在集合內(nèi)的數(shù)據(jù)錯(cuò)判，而不會(huì)對(duì)不是集合內(nèi)的數(shù)據(jù)進(jìn)行錯(cuò)判，這樣每個(gè)檢測(cè)請(qǐng)求返回有“在集合內(nèi)（可能錯(cuò)誤）”和“不在集合內(nèi)（絕對(duì)不在集合內(nèi)）”兩種情況，可見(jiàn) Bloom filter 是犧牲了正確率換取時(shí)間和空間。

2.bloom filter的計(jì)算方法？
如需要判斷一個(gè)元素是不是在一個(gè)集合中，我們通常做法是把所有元素保存下來(lái)，然后通過(guò)比較知道它是不是在集合內(nèi)，鏈表、樹(shù)都是基于這種思路，當(dāng)集合內(nèi)元素個(gè)數(shù)的變大，我們需要的空間和時(shí)間都線性變大，檢索速度也越來(lái)越慢。 Bloom filter 采用的是哈希函數(shù)的方法，將一個(gè)元素映射到一個(gè) m 長(zhǎng)度的陣列上的一個(gè)點(diǎn)，當(dāng)這個(gè)點(diǎn)是 1 時(shí)，那么這個(gè)元素在集合內(nèi)，反之則不在集合內(nèi)。這個(gè)方法的缺點(diǎn)就是當(dāng)檢測(cè)的元素很多的時(shí)候可能有沖突，解決方法就是使用 k 個(gè)哈希函數(shù)對(duì)應(yīng) k 個(gè)點(diǎn)，如果所有點(diǎn)都是 1 的話，那么元素在集合內(nèi)，如果有 0 的話，元素則不在集合內(nèi)。

3.bloom filter的特點(diǎn)？
Bloom filter 優(yōu)點(diǎn)就是它的插入和查詢(xún)時(shí)間都是常數(shù)，另外它查詢(xún)?cè)貐s不保存元素本身，具有良好的安全性。它的缺點(diǎn)也是顯而易見(jiàn)的，當(dāng)插入的元素越多，錯(cuò)判“在集合內(nèi)”的概率就越大了，另外 Bloom filter 也不能刪除一個(gè)元素，因?yàn)槎鄠€(gè)元素哈希的結(jié)果可能在 Bloom filter 結(jié)構(gòu)中占用的是同一個(gè)位，如果刪除了一個(gè)比特位，可能會(huì)影響多個(gè)元素的檢測(cè)。

二. 代碼實(shí)現(xiàn)
現(xiàn)下面在linux下實(shí)現(xiàn)了bloom filter的功能代碼：

復(fù)制代碼代碼如下:

// bloom.h:
#ifndef __BLOOM_H__
#define __BLOOM_H__

#include<stdlib.h>

typedef unsigned int (*hashfunc_t)(const char *);
typedef struct {
size_t asize;
unsigned char *a;
size_t nfuncs;
hashfunc_t *funcs;
} BLOOM;

BLOOM *bloom_create(size_t size, size_t nfuncs, ...);
int bloom_destroy(BLOOM *bloom);
int bloom_add(BLOOM *bloom, const char *s);
int bloom_check(BLOOM *bloom, const char *s);

#endif

// bloom.c:
#include<limits.h>
#include<stdarg.h>

#include"bloom.h"

#define SETBIT(a, n) (a[n/CHAR_BIT] |= (1<<(n%CHAR_BIT)))
#define GETBIT(a, n) (a[n/CHAR_BIT] & (1<<(n%CHAR_BIT)))

BLOOM *bloom_create(size_t size, size_t nfuncs, ...)
{
BLOOM *bloom;
va_list l;
int n;

if(!(bloom=malloc(sizeof(BLOOM)))) return NULL;
if(!(bloom->a=calloc((size+CHAR_BIT-1)/CHAR_BIT, sizeof(char)))) {
free(bloom);
return NULL;
}
if(!(bloom->funcs=(hashfunc_t*)malloc(nfuncs*sizeof(hashfunc_t)))) {
free(bloom->a);
free(bloom);
return NULL;
}

va_start(l, nfuncs);
for(n=0; n<nfuncs; ++n) {
bloom->funcs[n]=va_arg(l, hashfunc_t);
}
va_end(l);

bloom->nfuncs=nfuncs;
bloom->asize=size;

return bloom;
}

int bloom_destroy(BLOOM *bloom)
{
free(bloom->a);
free(bloom->funcs);
free(bloom);

return 0;
}

int bloom_add(BLOOM *bloom, const char *s)
{
size_t n;

for(n=0; n<bloom->nfuncs; ++n) {
SETBIT(bloom->a, bloom->funcs[n](s)%bloom->asize);
}

return 0;
}

int bloom_check(BLOOM *bloom, const char *s)
{
size_t n;

for(n=0; n<bloom->nfuncs; ++n) {
if(!(GETBIT(bloom->a, bloom->funcs[n](s)%bloom->asize))) return 0;
}

return 1;
}

// test.c:
#include<stdio.h>
#include<string.h>

#include"bloom.h"
//下面為兩種哈希算法函數(shù)
unsigned int sax_hash(const char *key)
{
unsigned int h=0;

while(*key) h^=(h<<5)+(h>>2)+(unsigned char)*key++;

return h;
}

unsigned int sdbm_hash(const char *key)
{
unsigned int h=0;
while(*key) h=(unsigned char)*key++ + (h<<6) + (h<<16) - h;
return h;
}

int main(int argc, char *argv[])
{
FILE *fp;
char line[1024];
char *p;
BLOOM *bloom;

if(argc<2) {
fprintf(stderr, "ERROR: No word file specified\n");
return EXIT_FAILURE;
}

if(!(bloom=bloom_create(2500000, 2, sax_hash, sdbm_hash))) {
fprintf(stderr, "ERROR: Could not create bloom filter\n");
return EXIT_FAILURE;
}

if(!(fp=fopen(argv[1], "r"))) {
fprintf(stderr, "ERROR: Could not open file %s\n", argv[1]);
return EXIT_FAILURE;
}

while(fgets(line, 1024, fp)) {
if((p=strchr(line, '\r'))) *p='\0';//回車(chē)
if((p=strchr(line, '\n'))) *p='\0';//換行

bloom_add(bloom, line);
}

fclose(fp);

while(fgets(line, 1024, stdin)) {
if((p=strchr(line, '\r'))) *p='\0';
if((p=strchr(line, '\n'))) *p='\0';

p=strtok(line, " \t,.;:\r\n?!-/()");
while(p) {
if(!bloom_check(bloom, p)) {
printf("No match for ford \"%s\"\n", p);
}
else
printf("match for ford \"%s\"\n",p);
p=strtok(NULL, " \t,.;:\r\n?!-/()");
}
}

bloom_destroy(bloom);

return EXIT_SUCCESS;
}

// Makefile:
all: bloom

bloom: bloom.o test.o
cc -o bloom -Wall -pedantic bloom.o test.o

bloom.o: bloom.c bloom.h
cc -o bloom.o -Wall -pedantic -ansi -c bloom.c

test.o: test.c bloom.h
cc -o test.o -Wall -pedantic -ansi -c test.c

您可能感興趣的文章: