基于pandas數(shù)據清洗的實現(xiàn)示例

更新時間：2024年07月23日 08:26:27 作者：寫代碼的大學生

數(shù)據清洗是數(shù)據科學和數(shù)據分析中非常重要的一個步驟,本文主要介紹了基于pandas的數(shù)據清洗,文中通過示例代碼介紹的非常詳細,對大家的學習或者工作具有一定的參考學習價值,需要的朋友們下面隨著小編來一起學習學習吧

數(shù)據清洗是數(shù)據科學和數(shù)據分析中非常重要的一個步驟。它指的是在數(shù)據分析之前，對數(shù)據進行預處理，以確保數(shù)據的質量和一致性。使用Python的pandas庫進行數(shù)據清洗是一種常見的做法，因為pandas提供了豐富的數(shù)據操作和清洗功能。

1.導入需要的庫

import pandas as pd
from pandas import DataFrame
import numpy as np

2.處理丟失數(shù)據

有兩種丟失數(shù)據：

None
np.nan(NaN）

為什么在數(shù)據分析中需要用到的是浮點類型的空而不是對象類型？

數(shù)據分析中會常常使用某些形式的運算來處理原始數(shù)據，如果原數(shù)數(shù)據中的空值為NAN的形式，則不會干擾或者中斷運算。
NAN可以參與運算的
None是不可以參與運算

df = DataFrame(data=np.random.randint(0,100,size=(7,5)))
df.iloc[2,3] = None
df.iloc[4,2] = np.nan
df.iloc[5,4] = None
df

運行結果為：

3.pandas處理空值操作

isnull
notnull
any
all
dropna
filln

#哪些行中有空值
#any(axis=1)檢測哪些行中存有空值
df.isnull().any(axis=1) #any會作用isnull返回結果的每一行
#true對應的行就是存有缺失數(shù)據的行

運行結果：

df.notnull()
df.notnull().all(axis=1)
#將布爾值作為源數(shù)據的行索引
df.loc[df.notnull().all(axis=1)]
#獲取空對應的行數(shù)據
df.loc[df.isnull().any(axis=1)]
#獲取空對應行數(shù)據的行索引
indexs = df.loc[df.isnull().any(axis=1)].index
indexs
df.drop(labels=indexs,axis=0)

3.案例分析

數(shù)據說明：

數(shù)據是1個冷庫的溫度數(shù)據，1-7對應7個溫度采集設備，1分鐘采集一次。

數(shù)據處理目標：

用1-4對應的4個必須設備，通過建立冷庫的溫度場關系模型，預估出5-7對應的數(shù)據。
最后每個冷庫中僅需放置4個設備，取代放置7個設備。
f(1-4) --> y(5-7)

數(shù)據處理過程：

1、原始數(shù)據中有丟幀現(xiàn)象，需要做預處理；
2、matplotlib 繪圖；
3、建立邏輯回歸模型。

無標準答案，按個人理解操作即可，請把自己的操作過程以文字形式簡單描述一下，謝謝配合。

測試數(shù)據為testData.xlsx

data = pd.read_excel('./data/testData.xlsx').drop(labels=['none','none1'],axis=1)
data

運行結果為：

data.shape
#刪除空對應的行數(shù)據
data.dropna(axis=0).shape
df = DataFrame(data=np.random.randint(0,100,size=(8,6)))
df.iloc[1] = [1,1,1,1,1,1]
df.iloc[3] = [1,1,1,1,1,1]
df.iloc[5] = [1,1,1,1,1,1]
df
#檢測哪些行存有重復的數(shù)據
df.duplicated(keep='first')
df.loc[~df.duplicated(keep='first')]
#異步到位刪除
df.drop_duplicates(keep='first')
df = DataFrame(data=np.random.random(size=(1000,3)),columns=['A','B','C'])
df.head()
#制定判定異常值的條件
twice_std = df['C'].std() * 2
twice_std
df.loc[~(df['C'] > twice_std)]

運行結果：