一文詳細講解C++精妙的哈希算法

更新時間：2024年10月23日 09:33:25 作者：_小羊_

這篇文章主要介紹了C++精妙的哈希算法的相關(guān)資料,哈希結(jié)構(gòu)通過哈希函數(shù)將關(guān)鍵碼映射到表中的特定位置,以提高搜索效率,理想的哈希函數(shù)應(yīng)保證一致性、哈希值均勻分布、高計算效率與最小化沖突,文中通過代碼介紹的非常詳細,需要的朋友可以參考下

一、哈希結(jié)構(gòu)

1、哈希概念

AVL樹、紅黑樹等平衡樹搜索效率取決于搜索過程中的比較次數(shù)，一般時間復(fù)雜度為O(logN)，雖然平衡樹的搜索效率已經(jīng)很快，但如果可以不經(jīng)過任何比較或者常數(shù)次的比較后就能搜索到我們要找的元素，會極大的提高效率。

哈希結(jié)構(gòu)，是一種通過特定函數(shù)（哈希函數(shù)）將關(guān)鍵碼映射到表中的一個位置，那么在查找時通過該函數(shù)就可以很快的找到該元素。

但是上述的映射方法存在一個問題，就是不同的元素可能會映射到同一個位置，這時就發(fā)生了哈希沖突（也叫哈希碰撞），解決哈希沖突，是實現(xiàn)哈希結(jié)構(gòu)的關(guān)鍵。

2、哈希函數(shù)

引起哈希沖突的一個原因可能是：哈希函數(shù)設(shè)計不合理。哈希函數(shù)的設(shè)計要保證高效性和可靠性：

一致性：確保相同的輸入總是產(chǎn)生相同的輸出哈希值
均勻分布：哈希值應(yīng)在哈希表的地址空間中盡可能均勻分布，以減少哈希沖突
計算效率：哈希函數(shù)應(yīng)簡單且計算快速，以便在實際應(yīng)用中能夠快速執(zhí)行
沖突最小化：設(shè)計哈希函數(shù)時應(yīng)盡量減少哈希沖突的發(fā)生，以提高哈希表的性能

| 常見哈希函數(shù)：哈希函數(shù)是哈希表的核心，它決定了如何將關(guān)鍵字映射到哈希地址。

直接定制法：取關(guān)鍵字的某個線性函數(shù)為散列地址，Hash(Key)=A*Key+B。這種方法簡單、均勻，但需要事先知道關(guān)鍵字的分布情況
除留余數(shù)法：取一個不大于哈希表地址數(shù)m的質(zhì)數(shù)p，按照哈希函數(shù)Hash(key)=key%p將關(guān)鍵碼轉(zhuǎn)換成哈希地址。這種方法實現(xiàn)簡單，且當p選擇合理時，哈希沖突的概率較低
平方取中法：對關(guān)鍵字進行平方運算，然后抽取中間的幾位作為哈希地址。這種方法適用于不知道關(guān)鍵字分布情況，且位數(shù)不是很大的場景
折疊法：將關(guān)鍵字從左到右分割成位數(shù)相等的幾部分（最后一部分位數(shù)可以短些），然后將這幾部分疊加求和，并按哈希表表長取后幾位作為哈希地址。這種方法適用于關(guān)鍵字位數(shù)較多的情況

此外，還有隨機數(shù)法、數(shù)學(xué)分析法等哈希函數(shù)設(shè)計方法，可以根據(jù)具體應(yīng)用場景選擇合適的哈希函數(shù)。
哈希函數(shù)設(shè)計的越好，產(chǎn)生哈希沖突的可能性就越低，但是哈希沖突還是無可避免。

3、哈希沖突

解決哈希沖突的兩種常見方法是：閉散列（開放定址法）和開散列（鏈地址法）。

3.1 閉散列

當發(fā)生哈希沖突時，如果哈希表中還有空位置，就把key存放到?jīng)_突位置的“下一個”空位置去。找下一個空位置，常見的探測方法有線性探測、二次探測和雙重散列等。

| 線性探測： 從發(fā)生沖突的位置開始，依次向后探測，直到找到下一個空位置為止。

插入上圖中在插入15前，通過哈希函數(shù)得到映射位置為5，但是5位置被占了，就依次向后找，在7位置找到了一個空位置將15插入。
刪除閉散列解決哈希沖突時，不好隨便物理刪除某個元素，可以考慮標記的方法來偽刪除一個元素。

//每個位置都給標記
enum State
{
	EXIST,//存在
	DELETE,//刪除
	EMPTY//空
}

| 線性探測實現(xiàn)：

enum State
{
	EXIST,
	EMPTY,
	DELETE
};

template<class K, class V>
struct HashData
{
	pair<K, V> _kv;
	State _state = EMPTY;
};

template<class K, class V>
class HashTable
{
public:

	HashTable()
	{
		_tables.resize(10);//提前開10個位置
	} 
	
private:
	vector<HashData<K, V>> _tables;
	size_t _n = 0;//存儲元素個數(shù)
};

插入

關(guān)鍵碼對表的size()取模，不能對capacity()取模，因為哈希表支持[]訪問，只能訪問下標小于size()的元素。

散列表的載荷因子 = 表中的元素個數(shù) / 表的大小

當載荷因子達到某個臨界值，就需要擴容。載荷因子越大，產(chǎn)生沖突的可能性就越大，相反產(chǎn)生沖突的可能性就越小。通常載荷因子應(yīng)限制在0.7-0.8一下。

不能直接對原表進行擴容，無論是原地擴還是異地擴，都會把原數(shù)據(jù)拷貝過來。但是擴完容后元素的相對位置可能會發(fā)生改變，原本沖突的元素擴完容后就不沖突了，所以直接對原表進行擴容是不行的。

擴容有兩種方法：

方法一：新建原表兩倍大小的vector，遍歷原表的元素重新映射到vector中，再將新建的vector和原表的vector交換。
方法二：因為方法一還需要重寫一遍映射過程，所以可以直接新建一個哈希表，遍歷原表的元素插入到新建的哈希表中，最后交換兩個哈希表的vector，這個方法的好處是新建的哈希表復(fù)用了原哈希表的Insert。

方法一我們就不實現(xiàn)了，直接用更好一點的方法二：

bool Insert(const pair<K, V>& kv)
{
	if (_n * 10 / _tables.size() >= 7)//載荷因子達到一定的值進行擴容
	{
		HashTable<K, V> newHT;
		newHT._tables.resize(2 * _tables.size());
		for (int i = 0; i < _tables.size(); i++)
		{
			if (_tables[i]._state == EXIST)
			{
				newHT.Insert(_tables[i]._kv);
			}
		}
		_tables.swap(newHT._tables);
	}
	size_t hashi = kv.first % _tables.size();//確定映射位置
	while (_tables[hashi]._state == EXIST)
	{
		++hashi;
		hashi %= _tables.size();//防止越界
	}
	_tables[hashi]._kv = kv;
	_tables[hashi]._state = EXIST;
	++_n;
	return true;
}

但是現(xiàn)在還有個問題，在上面的代碼中要求我們的key可以取模，也就是key只能是無符號整數(shù)，如果是浮點數(shù)、字符串等上面的代碼就行不通，所以還需要想辦法將可能出現(xiàn)的浮點數(shù)、字符串等類型的key轉(zhuǎn)換為無符號的整型再做映射。

像浮點數(shù)等可以直接強轉(zhuǎn)為無符號整型，可以考慮用仿函數(shù)解決。字符串一般不能直接強轉(zhuǎn)為無符號整型，我們可以對字符串特殊處理，也就是模版特化，將字符串中字符的ASCII碼值加起來作為映射值。

但是這里還有個問題，將字符串中字符的ASCII碼值加起來也可能沖突，比如相同的字符按不同的順序組合起來的字符串。不過好在有專門的字符串哈希函數(shù)（字符串哈希函數(shù)有好多種，這里使用其中一種：BKDR Hash函數(shù)），這里就不做過多介紹了，有興趣的同學(xué)請百度了解。他給出的解決辦法是字符每次相加之前+31（31、131、1313、13131…都行）來盡可能減少沖突。

template<class K>
struct HashFunc //key強轉(zhuǎn)為整型
{
	size_t operator()(const K& key)
	{
		return (size_t)key;
	}
};

//對string類型特殊處理
template<>
struct HashFunc<string>
{
	size_t operator()(const string& s)
	{
		size_t hash = 0;
		for (auto e : s)
		{
			hash = hash * 31 + e;
		}
		return hash;
	}
};

刪除

刪除指定的元素，只需要找到該元素的位置，將該位置的狀態(tài)標記為DELETE即可。

bool Erase(const K& key)
{
	HashData<K, V>* ret = Find(key);
	if (ret == nullptr)
	{
		return false;
	}
	else
	{
		ret->_state = DELETE;
		return true;
	}
}

查找

查找過程需要注意的是，當在表中找到被查找的元素時還要判斷此位置是否被標記為已刪除，因為刪除操作我們并沒有實際物理上的刪除某個元素。

HashData<K, V>* Find(const K& key)
{
	Hash hs;
	size_t hashi = hs(key) % _tables.size();
	while (_tables[hashi]._state != EMPTY)
	{
		if (_tables[hashi]._state != DELETE 
		&& _tables[hashi]._kv.first == key)
		{
			return &_tables[hashi];
		}
		++hashi;
		hashi %= _tables.size();
	}
	return nullptr;
}

線性探測的優(yōu)點是簡單好理解，缺點是數(shù)據(jù)容易堆積，查找時可能需要多次比較。
閉散列 / 開放定址法我們就先實現(xiàn)到這里，它是一種零和博弈，和下面將要介紹的開散列 / 鏈地址法對比還是稍遜一籌。

3.2 開散列

通過哈希函數(shù)計算散列地址，具有相同映射地址的元素歸于同一子集合，每一個子集合稱為一個哈希桶，各個桶中的元素通過一個單鏈表鏈接起來，哈希表中存各鏈表的頭節(jié)點。開散列每個桶中存放的都是產(chǎn)生哈希沖突的元素。

template<class K>
struct HashFunc
{
	size_t operator()(const K& key)
	{
		return (size_t)key;
	}
};

template<>
struct HashFunc<string>
{
	size_t operator()(const string& s)
	{
		size_t hash = 0;
		for (auto e : s)
		{
			hash = hash * 31 + e;
		}
		return hash;
	}
};

template<class K, class V>
struct HashNode
{
	HashNode(const pair<K, V>& kv)
		:_kv(kv)
		,_next(nullptr)
	{}

	pair<K, V> _kv;
	HashNode<K, V>* _next;
};

template<class K, class V, class Hash = HashFunc<K>>
class HashTable
{
	typedef HashNode<K, V> Node;
public:
	HashTable()
	{
		_tables.resize(10, nullptr);
	}

	~HashTable()
	{
		for (int i = 0; i < _tables.size(); i++)
		{
			Node* pcur = _tables[i];
			while (pcur)
			{
				Node* next = pcur->_next;
				delete pcur;
				pcur = next;
			}
			_tables[i] = nullptr;
		}
	}

private:
	vector<Node*> _tables;
	size_t _n = 0;
};

插入

bool Insert(const pair<K, V>& kv)
{
	size_t hashi = kv.first % _tables.size();

	//負載因子==1就擴容
	if (_n == _tables.size())
	{
		HashTable<K, V> newHT;
		newHT._tables.resize(2 * _tables.size(), nullptr);
		for (int i = 0; i < _tables.size(); i++)
		{
			Node* pcur = _tables[i];
			while (pcur)
			{
				newHT.Insert(pcur->_kv);
				pcur = pcur->_next;
			}
		}
		_tables.swap(newHT._tables);
	}
	Node* newnode = new Node(kv);

	//頭插
	newnode->_next = _tables[hashi];
	_tables[hashi] = newnode;
	++_n;
	return true;
}

上面的擴容過程雖然可行，但是不夠好。假如原表中有很多個節(jié)點，新建新表擴容后復(fù)用Insert就要new很多個節(jié)點再插入，這實際上是很有消耗的。因為原節(jié)點和新new的節(jié)點并無差別，所以可以直接將原表中的節(jié)點拿下來頭插到新表中，這樣就不用再new新節(jié)點。

| 優(yōu)化：

bool Insert(const pair<K, V>& kv)
{
	size_t hashi = hs(kv.first) % _tables.size();
	
	//負載因子==1就擴容
	if (_n == _tables.size())
	{
		vector<Node*> newtables(2 * _tables.size(), nullptr);
		for (int i = 0; i < _tables.size(); i++)
		{
			Node* pcur = _tables[i];
			while (pcur)
			{
				Node* next = pcur->_next;
				size_t hashi = pcur->_kv.first % newtables.size();
				pcur->_next = newtables[hashi];
				newtables[hashi] = pcur;
				pcur = next;
			}
			_tables[i] = nullptr;
		}
		_tables.swap(newtables);
	}
	Node* newnode = new Node(kv);

	//頭插
	newnode->_next = _tables[hashi];
	_tables[hashi] = newnode;
	++_n;
	return true;
}

刪除

學(xué)習(xí)過鏈表我們知道，單鏈表的頭刪和其他位置的刪除需要分開處理，因為其他位置刪除節(jié)點后要將前后節(jié)點鏈接起來，而單鏈表的頭節(jié)點沒有前一個節(jié)點。

bool Erase(const K& key)
{
	Hash hs;
	size_t hashi = hs(key) % _tables.size();
	Node* pcur = _tables[hashi];
	Node* prev = nullptr;
	while (pcur)
	{
		if (pcur->_kv.first == key)
		{
			if (prev == nullptr)
			{
				_tables[hashi] = pcur->_next;
			}
			else
			{
				prev->_next = pcur->_next;
			}
			delete pcur;
			--_n;
			return true;
		}
		prev = pcur;
		pcur = pcur->_next;
	}
	return false;
}

4、完整代碼

namespace open_address
{
	enum State
	{
		EXIST,
		EMPTY,
		DELETE
	};

	template<class K, class V>
	struct HashData
	{
		pair<K, V> _kv;
		State _state = EMPTY;
	};

	template<class K>
	struct HashFunc //key強轉(zhuǎn)為整型
	{
		size_t operator()(const K& key)
		{
			return (size_t)key;
		}
	};

	//對string類型特殊處理
	template<>
	struct HashFunc<string>
	{
		size_t operator()(const string& s)
		{
			size_t hash = 0;
			for (auto e : s)
			{
				hash = hash * 31 + e;
			}
			return hash;
		}
	};

	template<class K, class V, class Hash = HashFunc<K>>
	class HashTable
	{
	public:

		HashTable()
		{
			_tables.resize(10);//提前開10個位置
		}

		bool Insert(const pair<K, V>& kv)
		{
			//去冗余
			if (Find(kv.first))
			{
				return false;
			}

			if (_n * 10 / _tables.size() >= 7)//載荷因子達到一定的值進行擴容
			{
				HashTable<K, V, Hash> newHT;
				newHT._tables.resize(2 * _tables.size());
				for (int i = 0; i < _tables.size(); i++)
				{
					if (_tables[i]._state == EXIST)
					{
						newHT.Insert(_tables[i]._kv);
					}
				}
				_tables.swap(newHT._tables);
			}

			Hash hs;
			size_t hashi = hs(kv.first) % _tables.size();//確定映射位置
			while (_tables[hashi]._state == EXIST)
			{
				++hashi;
				hashi %= _tables.size();//防止越界
			}
			_tables[hashi]._kv = kv;
			_tables[hashi]._state = EXIST;
			++_n;
			return true;
		}

		HashData<K, V>* Find(const K& key)
		{
			Hash hs;
			size_t hashi = hs(key) % _tables.size();
			while (_tables[hashi]._state != EMPTY)
			{
				if (_tables[hashi]._state != DELETE && _tables[hashi]._kv.first == key)
				{
					return &_tables[hashi];
				}
				++hashi;
				hashi %= _tables.size();
			}
			return nullptr;
		}

		bool Erase(const K& key)
		{
			HashData<K, V>* ret = Find(key);
			if (ret == nullptr)
			{
				return false;
			}
			else
			{
				ret->_state = DELETE;
				return true;
			}
		}
	private:
		vector<HashData<K, V>> _tables;
		size_t _n = 0;//存儲元素個數(shù)
	};
}

namespace close_address
{
	template<class K>
	struct HashFunc
	{
		size_t operator()(const K& key)
		{
			return (size_t)key;
		}
	};

	template<>
	struct HashFunc<string>
	{
		size_t operator()(const string& s)
		{
			size_t hash = 0;
			for (auto e : s)
			{
				hash = hash * 31 + e;
			}
			return hash;
		}
	};

	template<class K, class V>
	struct HashNode
	{
		HashNode(const pair<K, V>& kv)
			:_kv(kv)
			, _next(nullptr)
		{}

		pair<K, V> _kv;
		HashNode<K, V>* _next;
	};

	template<class K, class V, class Hash = HashFunc<K>>
	class HashTable
	{
		typedef HashNode<K, V> Node;
	public:
		HashTable()
		{
			_tables.resize(10, nullptr);
		}

		~HashTable()
		{
			for (int i = 0; i < _tables.size(); i++)
			{
				Node* pcur = _tables[i];
				while (pcur)
				{
					Node* next = pcur->_next;
					delete pcur;
					pcur = next;
				}
				_tables[i] = nullptr;
			}
		}

		bool Insert(const pair<K, V>& kv)
		{
			Hash hs;
			size_t hashi = hs(kv.first) % _tables.size();

			負載因子==1就擴容
			//if (_n == _tables.size())
			//{
			//	HashTable<K, V> newHT;
			//	newHT._tables.resize(2 * _tables.size(), nullptr);
			//	for (int i = 0; i < _tables.size(); i++)
			//	{
			//		Node* pcur = _tables[i];
			//		while (pcur)
			//		{
			//			newHT.Insert(pcur->_kv);
			//			pcur = pcur->_next;
			//		}
			//	}
			//	_tables.swap(newHT._tables);
			//}

			//負載因子==1就擴容
			if (_n == _tables.size())
			{
				vector<Node*> newtables(2 * _tables.size(), nullptr);
				for (int i = 0; i < _tables.size(); i++)
				{
					Node* pcur = _tables[i];
					while (pcur)
					{
						Node* next = pcur->_next;//記錄下一個節(jié)點
						size_t hashi = hs(pcur->_kv.first) % newtables.size();//映射新表的相對位置
						pcur->_next = newtables[hashi];//頭插
						newtables[hashi] = pcur;
						pcur = next;
					}
					_tables[i] = nullptr;
				}
				_tables.swap(newtables);
			}
			Node* newnode = new Node(kv);

			//頭插
			newnode->_next = _tables[hashi];
			_tables[hashi] = newnode;
			++_n;
			return true;
		}

		Node* Find(const K& key)
		{
			Hash hs;
			size_t hashi = hs(key) % _tables.size();
			Node* pcur = _tables[hashi];
			while (pcur)
			{
				if (key == pcur->_kv.first)
				{
					return pcur;
				}
				pcur = pcur->_next;
			}
			return nullptr;
		}

		bool Erase(const K& key)
		{
			Hash hs;
			size_t hashi = hs(key) % _tables.size();
			Node* pcur = _tables[hashi];
			Node* prev = nullptr;
			while (pcur)
			{
				if (pcur->_kv.first == key)
				{
					if (prev == nullptr)
					{
						_tables[hashi] = pcur->_next;
					}
					else
					{
						prev->_next = pcur->_next;
					}
					delete pcur;
					--_n;
					return true;
				}
				prev = pcur;
				pcur = pcur->_next;
			}
			return false;
		}
	private:
		vector<Node*> _tables;
		size_t _n = 0;
	};
}