K最近鄰算法(KNN)---sklearn+python實現(xiàn)方式

更新時間：2020年02月24日 10:35:22 作者：zcc_TPJH

今天小編就為大家分享一篇K最近鄰算法(KNN)---sklearn+python實現(xiàn)方式，具有很好的參考價值，希望對大家有所幫助。一起跟隨小編過來看看吧

k-近鄰算法概述

簡單地說，k近鄰算法采用測量不同特征值之間的距離方法進行分類。

k-近鄰算法

優(yōu)點：精度高、對異常值不敏感、無數(shù)據(jù)輸入假定。

缺點：計算復雜度高、空間復雜度高。適用數(shù)據(jù)范圍：數(shù)值型和標稱型。

k-近鄰算法（kNN)，它的工作原理是：存在一個樣本數(shù)據(jù)集合，也稱作訓練樣本集，并且樣本集中每個數(shù)據(jù)都存在標簽，即我們知道樣本集中每一數(shù)據(jù)與所屬分類的對應關(guān)系。輸入沒有標簽的新數(shù)據(jù)后，將新數(shù)據(jù)的每個特征與樣本集中數(shù)據(jù)對應的特征進行比較，然后算法提取樣本集中特征最相似數(shù)據(jù)（最近鄰）的分類標簽。一般來說，我們只選擇樣本數(shù)據(jù)集中前k個最相似的數(shù)據(jù)，這就是k-近鄰算法中k的出處，通常k是不大于20的整數(shù)。最后，選擇k個最相似數(shù)據(jù)中出現(xiàn)次數(shù)最多的分類，作為新數(shù)據(jù)的分類。

k近鄰算法的一般流程

收集數(shù)據(jù)：可以使用任何方法。

準備數(shù)據(jù)：距離計算所需要的數(shù)值，最好是結(jié)構(gòu)化的數(shù)據(jù)格式。

分析數(shù)據(jù)：可以使用任何方法。

訓練算法：此步驟不適用于k近鄰算法。

測試算法：計算錯誤率。

使用算法：首先需要輸入樣本數(shù)據(jù)和結(jié)構(gòu)化的輸出結(jié)果，然后運行k近鄰算法判定輸入數(shù)據(jù)分別屬于哪個分類，最后應用對計算出的分類執(zhí)行后續(xù)的處理。

下面將經(jīng)過編碼來了解KNN算法：

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
plt.rcParams['font.sans-serif']=['SimHei'] #用來正常顯示中文標簽
#準備數(shù)據(jù)集
iris=datasets.load_iris()
X=iris.data
print('X:\n',X)
Y=iris.target
print('Y:\n',Y)
 
#處理二分類問題，所以只針對Y=0,1的行，然后從這些行中取X的前兩列
x=X[Y<2,:2]
print(x.shape)
print('x:\n',x)
y=Y[Y<2]
print('y:\n',y)
#target=0的點標紅，target=1的點標藍,點的橫坐標為data的第一列，點的縱坐標為data的第二列
plt.scatter(x[y==0,0],x[y==0,1],color='red')
plt.scatter(x[y==1,0],x[y==1,1],color='green')
plt.scatter(5.6,3.2,color='blue')
x_1=np.array([5.6,3.2])
plt.title('紅色點標簽為0,綠色點標簽為1，待預測的點為藍色')

如圖所示，我們要對圖中藍色的點進行預測，從而判斷他屬于哪一類，我們使用歐氏距離公式，計算兩個向量點之間的距離.

計算完所有點之間的距離后，可以對數(shù)據(jù)按照從小到大的次序排序。統(tǒng)計距離最近前k個數(shù)據(jù)點的類別數(shù)，返回票數(shù)最多的那類即為藍色點的類別。

#采用歐式距離計算
distances=[np.sqrt(np.sum((x_t-x_1)**2)) for x_t in x]
#對數(shù)組進行排序，返回的是排序后的索引
d=np.sort(distances)
nearest=np.argsort(distances)
k=6
topk_y=[y[i] for i in nearest[:k]]
from collections import Counter
#對topk_y進行統(tǒng)計返回字典
votes=Counter(topk_y)
#返回票數(shù)最多的1類元素
print(votes)
predict_y=votes.most_common(1)[0][0]
print(predict_y)
plt.show()

Counter({1: 4, 0: 2})
1

從結(jié)果可以看出，k=6時，距離藍色的點最近的6個點鐘，有4個屬于綠色，2個屬于紅色，最終藍色點的標簽被預測為綠色。

我們將剛才代碼中實現(xiàn)的功能可以封裝成一個類：

KNN.py

import numpy as np
from collections import Counter
from metrics import accuracy_score
class KNNClassifier:
 def __init__(self,k):
 assert k>=1,'k must be valid'
 self.k=k
 self._x_train=None
 self._y_train=None
 
 def fit(self,x_train,y_train):
 self._x_train=x_train
 self._y_train=y_train
 return self
 
 def _predict(self,x):
 d=[np.sqrt(np.sum((x_i-x)**2)) for x_i in self._x_train]
 nearest=np.argsort(d)
 top_k=[self._y_train[i] for i in nearest[:self.k]]
 votes=Counter(top_k)
 return votes.most_common(1)[0][0]
 
 def predict(self,X_predict):
 y_predict=[self._predict(x1) for x1 in X_predict]
 return np.array(y_predict)
 
 def __repr__(self):
 return 'knn(k=%d):'%self.k
 
 def score(self,x_test,y_test):
 y_predict=self.predict(x_test)
 return sum(y_predict==y_test)/len(x_test)

模型選擇，將訓練集和測試集進行劃分

model_selection.py

import numpy as np
def train_test_split(x,y,test_ratio=0.2,seed=None):
 if seed:
 np.random.seed(seed)
 #生成樣本隨機的序號
 shuffed_indexes=np.random.permutation(len(x))
 print(shuffed_indexes)
 #測試集占樣本總數(shù)的20%
 test_size=int(test_ratio*len(x))
 test_indexes=shuffed_indexes[:test_size]
 train_indexes=shuffed_indexes[test_size:]
 x_test=x[test_indexes]
 y_test=y[test_indexes]
 x_train=x[train_indexes]
 y_train=y[train_indexes]
 return x_train,x_test,y_train,y_test
 
'''
sklearn中的train_test_split
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=tran_test_split(x,y,test_size=0.2,random_state=666)
'''

下面我們采用兩種不同的方式嗎，對模型的正確率進行衡量

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
iris=datasets.load_iris()
x=iris.data
y=iris.target
 
#sklearn自帶的train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y)
knn_classifier=KNeighborsClassifier(6)
knn_classifier.fit(x_train,y_train)
y_predict=knn_classifier.predict(x_test)
scores=knn_classifier.score(x_test,y_test)
print('acc:{}'.format(sum(y_predict==y_test)/len(x_test)),scores)
 
 
#采用我們自己寫的
from model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(x,y)
from KNN import KNNClassifier
from metrics import accuracy_score
my_knn=KNNClassifier(k=6)
my_knn.fit(X_train,y_train)
y_predict=my_knn.predict(X_test)
print(accuracy_score(y_test,y_predict))
score=my_knn.score(X_test,y_test)
print(score)

得到正確率之后，想要進一步的提升在測試集上的正確率，我們就需要對模型進行調(diào)參

超參數(shù)：在算法運行前需要設定的參數(shù)（通過領(lǐng)域知識、經(jīng)驗數(shù)值、實驗搜索來尋找好的超參數(shù)）

模型參數(shù)：算法過程中學習的參數(shù)

在KNN中沒有模型參數(shù)，KNN算法中的k是典型的超參數(shù)，我們將采用實驗搜索來尋找好的超參數(shù)

尋找最好的k：

def main():
 from sklearn import datasets
 digits=datasets.load_digits()
 x=digits.data
 y=digits.target
 from sklearn.model_selection import train_test_split
 x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=666)
 from sklearn.neighbors import KNeighborsClassifier
 # 尋找最好的k
 best_k=-1
 best_score=0
 for i in range(1,11):
 knn_clf=KNeighborsClassifier(n_neighbors=i)
 knn_clf.fit(x_train,y_train)
 scores=knn_clf.score(x_test,y_test)
 if scores>best_score:
  best_score=scores
  best_k=i
 print('最好的k為:%d,最好的得分為:%.4f'%(best_k,best_score))
if __name__ == '__main__':
 main()

最好的k為:4,最好的得分為:0.9917

那么還有沒有別的超參數(shù)呢？

sklearn中的文檔

`sklearn.neighbors`.KNeighborsClassifier

Parameters:	n_neighbors : int, optional (default = 5) Number of neighbors to use by default for `kneighbors` queries. weights : str or callable, optional (default = ‘uniform') weight function used in prediction. Possible values: ‘uniform' : uniform weights. All points in each neighborhood are weighted equally. ‘distance' : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away. [callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights. algorithm : {‘a(chǎn)uto', ‘ball_tree', ‘kd_tree', ‘brute'}, optional Algorithm used to compute the nearest neighbors: ‘ball_tree' will use `BallTree` ‘kd_tree' will use `KDTree` ‘brute' will use a brute-force search. ‘a(chǎn)uto' will attempt to decide the most appropriate algorithm based on the values passed to `fit` method. Note: fitting on sparse input will override the setting of this parameter, using brute force. leaf_size : int, optional (default = 30) Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem. p : integer, optional (default = 2) Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. metric : string or callable, default ‘minkowski' the distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of the DistanceMetric class for a list of available metrics. metric_params : dict, optional (default = None) Additional keyword arguments for the metric function. n_jobs : int, optional (default = 1) The number of parallel jobs to run for neighbors search. If `-1`, then the number of jobs is set to the number of CPU cores. Doesn't affect `fit` method.

Parameters:

n_neighbors : int, optional (default = 5)

Number of neighbors to use by default for kneighbors queries.

weights : str or callable, optional (default = ‘uniform')

weight function used in prediction. Possible values:

‘uniform' : uniform weights. All points in each neighborhood are weighted equally.

‘distance' : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.

[callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.

algorithm : {‘a(chǎn)uto', ‘ball_tree', ‘kd_tree', ‘brute'}, optional

Algorithm used to compute the nearest neighbors:

‘ball_tree' will use BallTree

‘kd_tree' will use KDTree

‘brute' will use a brute-force search.

‘a(chǎn)uto' will attempt to decide the most appropriate algorithm based on the values passed to fit method.

Note: fitting on sparse input will override the setting of this parameter, using brute force.

leaf_size : int, optional (default = 30)

Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

p : integer, optional (default = 2)

Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

metric : string or callable, default ‘minkowski'

the distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of the DistanceMetric class for a list of available metrics.

metric_params : dict, optional (default = None)

Additional keyword arguments for the metric function.

n_jobs : int, optional (default = 1)

The number of parallel jobs to run for neighbors search. If -1, then the number of jobs is set to the number of CPU cores. Doesn't affect fit method.

n_neighbors：默認為5，就是k-NN的k的值，選取最近的k個點。

weights：默認是uniform，參數(shù)可以是uniform、distance，也可以是用戶自己定義的函數(shù)。uniform是均等的權(quán)重，就說所有的鄰近點的權(quán)重都是相等的。distance是不均等的權(quán)重，距離近的點比距離遠的點的影響大。用戶自定義的函數(shù)，接收距離的數(shù)組，返回一組維數(shù)相同的權(quán)重。

algorithm：快速k近鄰搜索算法，默認參數(shù)為auto，可以理解為算法自己決定合適的搜索算法。除此之外，用戶也可以自己指定搜索算法ball_tree、kd_tree、brute方法進行搜索，brute是蠻力搜索，也就是線性掃描，當訓練集很大時，計算非常耗時。kd_tree，構(gòu)造kd樹存儲數(shù)據(jù)以便對其進行快速檢索的樹形數(shù)據(jù)結(jié)構(gòu)，kd樹也就是數(shù)據(jù)結(jié)構(gòu)中的二叉樹。以中值切分構(gòu)造的樹，每個結(jié)點是一個超矩形，在維數(shù)小于20時效率高。ball tree是為了克服kd樹高緯失效而發(fā)明的，其構(gòu)造過程是以質(zhì)心C和半徑r分割樣本空間，每個節(jié)點是一個超球體。

leaf_size：默認是30，這個是構(gòu)造的kd樹和ball樹的大小。這個值的設置會影響樹構(gòu)建的速度和搜索速度，同樣也影響著存儲樹所需的內(nèi)存大小。需要根據(jù)問題的性質(zhì)選擇最優(yōu)的大小。

metric：用于距離度量，默認度量是minkowski，也就是p=2的歐氏距離(歐幾里德度量)。

p：距離度量公式。在上小結(jié)，我們使用歐氏距離公式進行距離度量。除此之外，還有其他的度量方法，例如曼哈頓距離。這個參數(shù)默認為2，也就是默認使用歐式距離公式進行距離度量。也可以設置為1，使用曼哈頓距離公式進行距離度量。

metric_params：距離公式的其他關(guān)鍵參數(shù)，這個可以不管，使用默認的None即可。

n_jobs：并行處理設置。默認為1，臨近點搜索并行工作數(shù)。如果為-1，那么CPU的所有cores都用于并行工作。

考慮距離？不考慮距離？

當K=3時，如下圖所示，由投票法藍色點勝出，于是綠色的點就歸為藍色點的那一類。但是這樣分類我們雖然考慮了離綠色節(jié)點最近的三個點，但是卻忽略了這三個點到綠色點的距離，從圖中可以看出紅色的點其實是離綠色的點最近，當我們考慮距離時，我們需要將距離附一個權(quán)重值，離得越近，權(quán)重越大，通常將距離的導數(shù)作為權(quán)重：

此外，考慮距離的KNN在平票時也可解決相應的問題

尋找最優(yōu)超參數(shù)weights：['uniform','distance']

def main():
 from sklearn import datasets
 digits=datasets.load_digits()
 x=digits.data
 y=digits.target
 from sklearn.model_selection import train_test_split
 x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=666)
 from sklearn.neighbors import KNeighborsClassifier
 # 尋找最好的k,weights
 best_k=-1
 best_score=0
 best_method=''
 for method in ['uniform','distance']:
 for i in range(1,11):
  knn_clf=KNeighborsClassifier(n_neighbors=i,weights=method)
  knn_clf.fit(x_train,y_train)
  scores=knn_clf.score(x_test,y_test)
  if scores>best_score:
  best_score=scores
  best_k=i
  best_method=method
 print('最好的k為:%d,最好的得分為:%.4f,最好的方法%s'%(best_k,best_score,best_method))
if __name__ == '__main__':
 main()

更多關(guān)于距離的定義，可以得到領(lǐng)一個超參數(shù)p：

對最優(yōu)的明可夫斯基距離相應的p進行搜索：

def main():
 from sklearn import datasets
 digits=datasets.load_digits()
 x=digits.data
 y=digits.target
 from sklearn.model_selection import train_test_split
 x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=666)
 from sklearn.neighbors import KNeighborsClassifier
 # 尋找最好的k,weights
 best_k=-1
 best_score=0
 best_p=-1
 
 for i in range(1,11):
 for p in range(1,6):
  knn_clf=KNeighborsClassifier(n_neighbors=i,weights='distance',p=p)
  knn_clf.fit(x_train,y_train)
  scores=knn_clf.score(x_test,y_test)
  if scores>best_score:
  best_score=scores
  best_k=i
  best_p=p
 print('最好的k為:%d,最好的得分為:%.4f,最好的p為%d'%(best_k,best_score,best_p))
if __name__ == '__main__':
 main()

最好的k為:3,最好的得分為:0.9889,最好的p為2

從上面的例子我們可以看出，有些超參數(shù)之間可能會存在相互依賴的關(guān)系，比如在上面的程序中，當weights='distance'時，才牽扯到p這個超參數(shù)。如何將更好的一次性將我們想要的超參數(shù)都列出來，運行一遍程序就可以找到我們想要的超參數(shù)呢？使用網(wǎng)格搜索我們可以幫我們解決這個問題，被封裝在sklearn中的Grid Search 中。

GridSearchCV 提供的網(wǎng)格搜索從通過 param_grid 參數(shù)確定的網(wǎng)格參數(shù)值中全面生成候選。例如，下面的 param_grid:

param_grid = [
 {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
 {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]

探索兩個網(wǎng)格的詳細解釋：一個具有線性內(nèi)核并且C在[1,10,100,1000]中取值；另一個具有RBF內(nèi)核，C值的交叉乘積范圍在[1,10，100,1000]，gamma在[0.001，0.0001]中取值。

在本例中：

def main():
 import numpy as np
 from sklearn import datasets
 digits=datasets.load_digits()
 x=digits.data
 y=digits.target
 from sklearn.model_selection import train_test_split
 x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=666)
 from sklearn.neighbors import KNeighborsClassifier
 
 #Grid Search定義要搜索的參數(shù)
 param_grid=[
 {
  'weights':['uniform'],
  'n_neighbors':[i for i in range(1,11)]
 },
 {
  'weights':['distance'],
  'n_neighbors':[i for i in range(1,11)],
  'p':[i for i in range(1,6)]
 }
 ]
 knn_clf=KNeighborsClassifier()
 from sklearn.model_selection import GridSearchCV
 #n_jobs采用幾個核來處理，-1代表計算機有幾個核就用幾個核進行并行處理,搜索過程中verbose可以進行信息輸出，幫助理解搜索狀態(tài)
 grid_search=GridSearchCV(knn_clf,param_grid,n_jobs=-1,verbose=1)
 grid_search.fit(x_train,y_train)
 #返回網(wǎng)格搜索最佳分類器
 print(grid_search.best_estimator_)
 #返回網(wǎng)格搜索最佳分類器的參數(shù)
 print(grid_search.best_params_)
 #返回網(wǎng)格搜索最佳分類器的分數(shù)
 print(grid_search.best_score_)
 knn_clf=grid_search.best_estimator_
 print(knn_clf.score(x_test,y_test))
if __name__ == '__main__':
 main()
Fitting 3 folds for each of 60（10+50） candidates, totalling 180 fits
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 1.6s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed: 30.6s finished
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
  metric_params=None, n_jobs=1, n_neighbors=3, p=3,
  weights='distance')
{'n_neighbors': 3, 'weights': 'distance', 'p': 3}
0.985386221294
0.983333333333

在衡量距離時，其實還有一個非常重要的概念就是數(shù)據(jù)歸一化Feature Scaling

從上面的例子可以看出，如果發(fā)現(xiàn)時間以天為計量單位，樣本間的距離被發(fā)現(xiàn)時間所主導，若發(fā)現(xiàn)時間以年為計量單位，樣本間的距離又被腫瘤大小主導，這就表明如果我們不對樣本的數(shù)據(jù)進行處理的話，直接去計算距離結(jié)果將會是有偏差的。

數(shù)據(jù)歸一化

解決方案：將所有的數(shù)據(jù)映射同一尺度

import numpy as np
import matplotlib.pyplot as plt
#最值歸一化
x=np.random.randint(0,100,size=100)
print(x)
normal_x=(x-np.min(x))/(np.max(x)-np.min(x))
print(normal_x)
#均值方差歸一化
x2=np.random.randint(0,100,(50,2))
print(x2)
x2=np.array(x2,dtype=float)
x2[:,0]=(x2[:,0]-np.mean(x2[:,0]))/np.std(x2[:,0])
x2[:,1]=(x2[:,1]-np.mean(x2[:,1]))/np.std(x2[:,1])
plt.scatter(x2[:,0],x2[:,1])
plt.show()

x:[49 27 88 47 6 89 9 98 17 72 46 46 80 62 28 38 0 27 22 14 2 79 70 73 15
 57 85 6 11 76 59 62 23 32 82 78 0 45 8 82 13 81 99 61 43 21 45 61 93 63
 66 57 78 60 50 8 29 63 74 8 25 55 10 69 3 77 41 24 15 23 21 31 36 78 94
 52 12 1 23 99 8 12 15 37 75 75 27 14 31 75 6 56 29 96 23 0 22 98 86 10]
normal_x:[ 0.49494949 0.27272727 0.88888889 0.47474747 0.06060606 0.8989899
 0.09090909 0.98989899 0.17171717 0.72727273 0.46464646 0.46464646
 0.80808081 0.62626263 0.28282828 0.38383838 0.  0.27272727
 0.22222222 0.14141414 0.02020202 0.7979798 0.70707071 0.73737374
 0.15151515 0.57575758 0.85858586 0.06060606 0.11111111 0.76767677
 0.5959596 0.62626263 0.23232323 0.32323232 0.82828283 0.78787879
 0.  0.45454545 0.08080808 0.82828283 0.13131313 0.81818182
 1.  0.61616162 0.43434343 0.21212121 0.45454545 0.61616162
 0.93939394 0.63636364 0.66666667 0.57575758 0.78787879 0.60606061
 0.50505051 0.08080808 0.29292929 0.63636364 0.74747475 0.08080808
 0.25252525 0.55555556 0.1010101 0.6969697 0.03030303 0.77777778
 0.41414141 0.24242424 0.15151515 0.23232323 0.21212121 0.31313131
 0.36363636 0.78787879 0.94949495 0.52525253 0.12121212 0.01010101
 0.23232323 1.  0.08080808 0.12121212 0.15151515 0.37373737
 0.75757576 0.75757576 0.27272727 0.14141414 0.31313131 0.75757576
 0.06060606 0.56565657 0.29292929 0.96969697 0.23232323 0.
 0.22222222 0.98989899 0.86868687 0.1010101 ]

sklearn中的StandardScaler
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=666)
#sklearn中的StandardScaler
from sklearn.preprocessing import StandardScaler
standardscaler=StandardScaler()
standardscaler.fit(x_train)
#均值
print(standardscaler.mean_)
#方差
print(standardscaler.scale_)
x_train_standard=standardscaler.transform(x_train)
x_test_standard=standardscaler.transform(x_test)
print(x_train_standard)
from sklearn.neighbors import KNeighborsClassifier
knn_clf=KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(x_train_standard,y_train)
scores=knn_clf.score(x_test_standard,y_test)
print(scores)

總結(jié)：