機器學習之數據清洗及六種缺值處理方式小結
1 數據清洗
1.1 概念
數據清洗(Data Cleaning)是指從數據集中識別并糾正或刪除不準確、不完整、格式不統(tǒng)一或與業(yè)務規(guī)則不符的數據的過程。這個過程是數據預處理的一個重要組成部分,其目的是提高數據的質量,確保數據的一致性和準確性,從而為數據分析、數據挖掘和機器學習等后續(xù)數據處理工作提供可靠的基礎。數據清洗是一個反復迭代的過程,可能需要多次調整和優(yōu)化以達到理想的效果。
1.2 重要性
數據清洗是數據處理過程中的重要環(huán)節(jié),它涉及到將原始數據轉換為可用、可靠和有意義的形式,以便進行進一步的分析和挖掘。
數據清洗是數據科學和數據分析領域的一個重要步驟,因為它直接影響到后續(xù)分析結果的準確性和可靠性。不干凈的數據可能會導致錯誤的結論和決策。
1.3 注意事項
- 1.完整性:檢查單條數據是否存在空值,統(tǒng)計的字段是否完善。
- 2.全面性:觀察某一列的全部數值,可以通過比較最大值、最小值、平均值、數據定義等來判斷數據是否全面。
- 3.合法性:檢査數值的類型、內容、大小是否符合預設的規(guī)則。例如,人類的年齡超過1000歲這個數據就是不合法的。
- 4.唯一性:檢查數據是否重復記錄,例如一個人的數據被重復記錄多次。
- 5.類別是否可靠。
2 查轉空值及數據類型轉換和標準化
2.1 空值為空
- null_num = data.isnull()判斷是否為空,為空填充為True
- null_all = null_num.sum()計算空值數量
2.2 空值不為空
- data.replace(‘NA’, ‘’, inplace=True)空值為NA填充,替換為空再計算
null_num = data.isnull()
null_all = null_num.sum()
2.2.1 結果
調試結果:
null_num = data.isnull()

null_all

處理后原數據

2.3 類型轉換和標準化處理
- 特征數據類型轉換為數值
pd.to_numeric(數據,errors=‘coerce’) - 標準化處理
scaler = StandardScaler()x_all_z = scaler.fit_transform(x_all)
調試結果:
x_all_z

3 六種缺值處理方式
3.1 數據介紹
部分數據展示,第一列為序號,最后一行為結果類別,其他為特征變量

3.2 涉及函數導入及變量
import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.ensemble import RandomForestRegressor
train_data,train_label,test_data,test_label分別為訓練集變量、結果和測試集變量、結果,結果列名為質量評分
3.3 刪除空行填充
# 刪除空行
def cca_train_fill(train_data,train_label):
data = pd.concat([train_data, train_label], axis=1)
#reset_index()重新排序
data = data.reset_index(drop=True)
#dropna()刪除空行
df_filled = data.dropna()
#dropna()刪除空行
df_filled = df_filled.reset_index(drop=True)
return df_filled.drop('礦物類型', axis=1), df_filled.礦物類型
def cca_test_fill(train_data,train_label,test_data,test_label):
data = pd.concat([test_data, test_label], axis=1)
data = data.reset_index(drop=True)
df_filled = data.dropna()
df_filled = df_filled.reset_index(drop=True)
return df_filled.drop('礦物類型', axis=1), df_filled.礦物類型
3.4 平均值填充
# 平均數
# 填充訓練集平均值
def mean_train_method(data):
# 數據的平均值
fill_values = data.mean()
# fillna(平均值)填充平均值
# 返回數據填充后結果
return data.fillna(fill_values)
def mean_train_fill(train_data,train_label):
data = pd.concat([train_data,train_label],axis=1)
data = data.reset_index(drop=True)
A = data[data['礦物類型'] == 0]
B = data[data['礦物類型'] == 1]
C = data[data['礦物類型'] == 2]
D = data[data['礦物類型'] == 3]
A = mean_train_method(A)
B = mean_train_method(B)
C = mean_train_method(C)
D = mean_train_method(D)
df_filled = pd.concat([A,B,C,D])
df_filled = df_filled.reset_index(drop=True)
return df_filled.drop('礦物類型', axis=1),df_filled.礦物類型
# 填充測試集平均數,測試集需根據訓練集的平均值進行填充
def mean_test_method(train_data,test_data):
fill_values = train_data.mean()
return test_data.fillna(fill_values)
def mean_test_fill(train_data,train_label,test_data,test_label):
train_data_all = pd.concat([train_data,train_label],axis=1)
train_data_all = train_data_all.reset_index(drop=True)
test_data_all = pd.concat([test_data,test_label],axis=1)
test_data_all = test_data_all.reset_index(drop=True)
A_train = train_data_all[train_data_all['礦物類型'] == 0]
B_train = train_data_all[train_data_all['礦物類型'] == 1]
C_train = train_data_all[train_data_all['礦物類型'] == 2]
D_train = train_data_all[train_data_all['礦物類型'] == 3]
A_test = test_data_all[test_data_all['礦物類型'] == 0]
B_test = test_data_all[test_data_all['礦物類型'] == 1]
C_test = test_data_all[test_data_all['礦物類型'] == 2]
D_test = test_data_all[test_data_all['礦物類型'] == 3]
# 測試集根據訓練集填充
A = mean_test_method(A_train,A_test)
B = mean_test_method(B_train,B_test)
C = mean_test_method(C_train,C_test)
D = mean_test_method(D_train,D_test)
df_filled = pd.concat([A,B,C,D])
df_filled = df_filled.reset_index(drop=True)
return df_filled.drop('礦物類型', axis=1),df_filled.礦物類型
3.5 中位數填充
# 中位數
def median_train_method(data):
fill_values = data.median()
return data.fillna(fill_values)
def median_train_fill(train_data,train_label):
data = pd.concat([train_data,train_label],axis=1)
data = data.reset_index(drop=True)
A = data[data['礦物類型'] == 0]
B = data[data['礦物類型'] == 1]
C = data[data['礦物類型'] == 2]
D = data[data['礦物類型'] == 3]
A = median_train_method(A)
B = median_train_method(B)
C = median_train_method(C)
D = median_train_method(D)
df_filled = pd.concat([A,B,C,D])
df_filled = df_filled.reset_index(drop=True)
return df_filled.drop('礦物類型', axis=1),df_filled.礦物類型
def median_test_method(train_data,test_data):
fill_values = train_data.median()
return test_data.fillna(fill_values)
def median_test_fill(train_data,train_label,test_data,test_label):
train_data_all = pd.concat([train_data,train_label],axis=1)
train_data_all = train_data_all.reset_index(drop=True)
test_data_all = pd.concat([test_data,test_label],axis=1)
test_data_all = test_data_all.reset_index(drop=True)
A_train = train_data_all[train_data_all['礦物類型'] == 0]
B_train = train_data_all[train_data_all['礦物類型'] == 1]
C_train = train_data_all[train_data_all['礦物類型'] == 2]
D_train = train_data_all[train_data_all['礦物類型'] == 3]
A_test = test_data_all[test_data_all['礦物類型'] == 0]
B_test = test_data_all[test_data_all['礦物類型'] == 1]
C_test = test_data_all[test_data_all['礦物類型'] == 2]
D_test = test_data_all[test_data_all['礦物類型'] == 3]
A = median_test_method(A_train,A_test)
B = median_test_method(B_train,B_test)
C = median_test_method(C_train,C_test)
D = median_test_method(D_train,D_test)
df_filled = pd.concat([A,B,C,D])
df_filled = df_filled.reset_index(drop=True)
return df_filled.drop('礦物類型', axis=1),df_filled.礦物類型
3.6 眾數填充
# 眾數
def mode_train_method(data):
# apply()每列應用函數
# 執(zhí)行函數如果有眾數個數不為0或空,填充第一個
fill_values = data.apply(lambda x:x.mode().iloc[0] if len(x.mode())>0 else None)
# 每列眾數
a = data.mode()
return data.fillna(fill_values)
def mode_train_fill(train_data,train_label):
data = pd.concat([train_data,train_label],axis=1)
data = data.reset_index(drop=True)
A = data[data['礦物類型'] == 0]
B = data[data['礦物類型'] == 1]
C = data[data['礦物類型'] == 2]
D = data[data['礦物類型'] == 3]
A = mode_train_method(A)
B = mode_train_method(B)
C = mode_train_method(C)
D = mode_train_method(D)
df_filled = pd.concat([A,B,C,D])
df_filled = df_filled.reset_index(drop=True)
return df_filled.drop('礦物類型', axis=1),df_filled.礦物類型
def mode_test_method(train_data,test_data):
fill_values = train_data.apply(lambda x:x.mode().iloc[0] if len(x.mode())>0 else None)
return test_data.fillna(fill_values)
def mode_test_fill(train_data,train_label,test_data,test_label):
train_data_all = pd.concat([train_data,train_label],axis=1)
train_data_all = train_data_all.reset_index(drop=True)
test_data_all = pd.concat([test_data,test_label],axis=1)
test_data_all = test_data_all.reset_index(drop=True)
A_train = train_data_all[train_data_all['礦物類型'] == 0]
B_train = train_data_all[train_data_all['礦物類型'] == 1]
C_train = train_data_all[train_data_all['礦物類型'] == 2]
D_train = train_data_all[train_data_all['礦物類型'] == 3]
A_test = test_data_all[test_data_all['礦物類型'] == 0]
B_test = test_data_all[test_data_all['礦物類型'] == 1]
C_test = test_data_all[test_data_all['礦物類型'] == 2]
D_test = test_data_all[test_data_all['礦物類型'] == 3]
A = mode_test_method(A_train,A_test)
B = mode_test_method(B_train,B_test)
C = mode_test_method(C_train,C_test)
D = mode_test_method(D_train,D_test)
df_filled = pd.concat([A,B,C,D])
df_filled = df_filled.reset_index(drop=True)
return df_filled.drop('礦物類型', axis=1),df_filled.礦物類型
3.7 線性填充
def lr_train_fill(train_data,train_label):
train_data_all = pd.concat([train_data,train_label],axis=1)
train_data_all = train_data_all.reset_index(drop=True)
train_data_x = train_data_all.drop('礦物類型',axis=1)
# 計算空值個數
null_num = train_data_x.isnull().sum()
# 根據空值個數排列列名
null_num_sorted = null_num.sort_values(ascending=True)
filling_feature = []
for i in null_num_sorted.index:
filling_feature.append(i)
# 該列空值個數不為0
if null_num_sorted[i] != 0:
# x為去除當前含空列的其他列特征數據
x = train_data_x[filling_feature].drop(i,axis=1)
# y為含空列所有數據
y = train_data_x[i]
# 空列行索引列表
row_numbers_null_list = train_data_x[train_data_x[i].isnull()].index.tolist()
# 訓練集x為去除空行的x
x_train = x.drop(row_numbers_null_list)
# 訓練集y為去除空行的y
y_train = y.drop(row_numbers_null_list)
# 測試集空行的x數據
x_test = x.iloc[row_numbers_null_list]
lr = LinearRegression()
lr.fit(x_train,y_train)
# 預測空值結果
y_pr = lr.predict(x_test)
train_data_x.loc[row_numbers_null_list,i] = y_pr
print(f'完成訓練數據集中的{i}列數據清洗')
return train_data_x,train_data_all.礦物類型
def lr_test_fill(train_data,train_label,test_data,test_label):
train_data_all = pd.concat([train_data,train_label],axis=1)
train_data_all = train_data_all.reset_index(drop=True)
test_data_all = pd.concat([test_data, test_label], axis=1)
test_data_all = test_data_all.reset_index(drop=True)
train_data_x = train_data_all.drop('礦物類型',axis=1)
test_data_x = test_data_all.drop('礦物類型',axis=1)
null_num = test_data_x.isnull().sum()
null_num_sorted = null_num.sort_values(ascending=True)
filling_feature = []
for i in null_num_sorted.index:
filling_feature.append(i)
if null_num_sorted[i] != 0:
x_train = train_data_x[filling_feature].drop(i,axis=1)
y_train = train_data_x[i]
x_test = test_data_x[filling_feature].drop(i,axis=1)
row_numbers_null_list = test_data_x[test_data_x[i].isnull()].index.tolist()
x_test = x_test.iloc[row_numbers_null_list]
lr = LinearRegression()
# 根據訓練集數據進行測試集數據空值填充
lr.fit(x_train,y_train)
y_pr = lr.predict(x_test)
test_data_x.loc[row_numbers_null_list,i] = y_pr
print(f'完成測試數據集中的{i}列數據清洗')
return test_data_x,test_data_all.礦物類型
3.8 隨機森林填充
# 隨機森林
def Random_train_fill(train_data,train_label):
train_data_all = pd.concat([train_data,train_label],axis=1)
train_data_all = train_data_all.reset_index(drop=True)
train_data_x = train_data_all.drop('礦物類型',axis=1)
null_num = train_data_x.isnull().sum()
null_num_sorted = null_num.sort_values(ascending=True)
filling_feature = []
for i in null_num_sorted.index:
filling_feature.append(i)
if null_num_sorted[i] != 0:
x = train_data_x[filling_feature].drop(i,axis=1)
y = train_data_x[i]
row_numbers_null_list = train_data_x[train_data_x[i].isnull()].index.tolist()
x_train = x.drop(row_numbers_null_list)
y_train = y.drop(row_numbers_null_list)
x_test = x.iloc[row_numbers_null_list]
lr = RandomForestRegressor(n_estimators=100,max_features=0.8,random_state=314,n_jobs=-1)
lr.fit(x_train,y_train)
y_pr = lr.predict(x_test)
train_data_x.loc[row_numbers_null_list,i] = y_pr
print(f'完成訓練數據集中的{i}列數據清洗')
return train_data_x,train_data_all.礦物類型
def Random_test_fill(train_data,train_label,test_data,test_label):
train_data_all = pd.concat([train_data,train_label],axis=1)
train_data_all = train_data_all.reset_index(drop=True)
test_data_all = pd.concat([test_data, test_label], axis=1)
test_data_all = test_data_all.reset_index(drop=True)
train_data_x = train_data_all.drop('礦物類型',axis=1)
test_data_x = test_data_all.drop('礦物類型',axis=1)
null_num = test_data_x.isnull().sum()
null_num_sorted = null_num.sort_values(ascending=True)
filling_feature = []
for i in null_num_sorted.index:
filling_feature.append(i)
if null_num_sorted[i] != 0:
x_train = train_data_x[filling_feature].drop(i,axis=1)
y_train = train_data_x[i]
x_test = test_data_x[filling_feature].drop(i,axis=1)
row_numbers_null_list = test_data_x[test_data_x[i].isnull()].index.tolist()
x_test = x_test.iloc[row_numbers_null_list]
lr = RandomForestRegressor(n_estimators=100,max_features=0.8,random_state=314,n_jobs=-1)
lr.fit(x_train,y_train)
y_pr = lr.predict(x_test)
test_data_x.loc[row_numbers_null_list,i] = y_pr
print(f'完成測試數據集中的{i}列數據清洗')
return test_data_x,test_data_all.礦物類型
4 數據保存
不同處理方法得到的數據應分別保存,更改[ ]內內容即可
代碼展示:
data_train = pd.concat([ov_x_train,ov_y_train],axis=1).sample(frac=1,random_state=4) data_test = pd.concat([x_test_fill,y_test_fill],axis=1).sample(frac=1,random_state=4) data_train.to_excel(r'./data_train_test//訓練數據集[隨機森林回歸].xlsx',index=False) data_test.to_excel(r'./data_train_test//測試數據集[隨機森林回歸].xlsx',index=False)
5 代碼集合測試
為便于處理,將數據填充另封裝為file_data,便于應用
全部代碼展示:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import file_data
from imblearn.over_sampling import SMOTE
data = pd.read_excel('礦物數據.xls')
data = data[data['礦物類型'] != 'E']
# 空值為nan
null_num = data.isnull()
# 計算空值數量
null_all = null_num.sum()
x_all = data.drop('礦物類型',axis=1).drop('序號',axis=1)
y_all = data.礦物類型
# 轉換結果類別類型
label_dict = {'A':0,'B':1,'C':2,'D':3}
encod_labels = [label_dict[label] for label in y_all]
# 類別轉serises
y_all = pd.Series(encod_labels,name='礦物類型')
# 特征數據類型轉換為數值
for column_name in x_all.columns:
x_all[column_name] = pd.to_numeric(x_all[column_name],errors='coerce')
# 標準化處理
scaler = StandardScaler()
x_all_z = scaler.fit_transform(x_all)
x_all = pd.DataFrame(x_all_z,columns=x_all.columns)
x_train,x_test,y_train,y_test = \
train_test_split(x_all,y_all,test_size=0.3,random_state=50000)
### 按注釋依次使用不同填充缺值數據
# cca
# x_train_fill,y_train_fill = file_data.cca_train_fill(x_train,y_train)
# x_test_fill,y_test_fill = file_data.cca_test_fill(x_train_fill,y_train_fill,x_test,y_test)
# 平均值
# x_train_fill,y_train_fill = file_data.mean_train_fill(x_train,y_train)
# x_test_fill,y_test_fill = file_data.mean_test_fill(x_train_fill,y_train_fill,x_test,y_test)
# 中位數
# x_train_fill,y_train_fill = file_data.median_train_fill(x_train,y_train)
# x_test_fill,y_test_fill = file_data.median_test_fill(x_train_fill,y_train_fill,x_test,y_test)
# 眾數
# x_train_fill,y_train_fill = file_data.mode_train_fill(x_train,y_train)
# x_test_fill,y_test_fill = file_data.mode_test_fill(x_train_fill,y_train_fill,x_test,y_test)
#
# lr_train_fill線性回歸
# x_train_fill,y_train_fill = file_data.lr_train_fill(x_train,y_train)
# x_test_fill,y_test_fill = file_data.lr_test_fill(x_train_fill,y_train_fill,x_test,y_test)
# 隨機森林回歸
x_train_fill,y_train_fill = file_data.Random_train_fill(x_train,y_train)
x_test_fill,y_test_fill = file_data.Random_test_fill(x_train_fill,y_train_fill,x_test,y_test)
#打亂順序
oversampler = SMOTE(k_neighbors=1,random_state=42)
ov_x_train,ov_y_train = oversampler.fit_resample(x_train_fill,y_train_fill)
# 數據存儲
data_train = pd.concat([ov_x_train,ov_y_train],axis=1).sample(frac=1,random_state=4)
data_test = pd.concat([x_test_fill,y_test_fill],axis=1).sample(frac=1,random_state=4)
data_train.to_excel(r'./data_train_test//訓練數據集[隨機森林回歸].xlsx',index=False)
data_test.to_excel(r'./data_train_test//測試數據集[隨機森林回歸].xlsx',index=False)
依次運行結果:


到此這篇關于機器學習之數據清洗及六種缺值處理方式小結的文章就介紹到這了,更多相關機器學習數據清洗缺值內容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關文章希望大家以后多多支持腳本之家!
相關文章
Python中淺拷貝copy與深拷貝deepcopy的簡單理解
今天小編就為大家分享一篇關于Python中淺拷貝copy與深拷貝deepcopy的簡單理解,小編覺得內容挺不錯的,現在分享給大家,具有很好的參考價值,需要的朋友一起跟隨小編來看看吧2018-10-10

