Python如何檢驗樣本是否服從正態(tài)分布
在進行t檢驗、F檢驗之前,我們往往要求樣本大致服從正態(tài)分布,下面介紹兩種檢驗樣本是否服從正態(tài)分布的方法。
可視化
我們可以通過將樣本可視化,看一下樣本的概率密度是否是正態(tài)分布來初步判斷樣本是否服從正態(tài)分布。
代碼如下:
import numpy as np import pandas as pd import matplotlib.pyplot as plt # 使用pandas和numpy生成一組仿真數據 s = pd.DataFrame(np.random.randn(500),columns=['value']) print(s.shape) # (500, 1) # 創(chuàng)建自定義圖像 fig = plt.figure(figsize=(10, 6)) # 創(chuàng)建子圖1 ax1 = fig.add_subplot(2,1,1) # 繪制散點圖 ax1.scatter(s.index, s.values) plt.grid() # 添加網格 # 創(chuàng)建子圖2 ax2 = fig.add_subplot(2, 1, 2) # 繪制直方圖 s.hist(bins=30,alpha=0.5,ax=ax2) # 繪制密度圖 s.plot(kind='kde', secondary_y=True,ax=ax2) # 使用雙坐標軸 plt.grid() # 添加網格 # 顯示自定義圖像 plt.show()
可視化圖像如下:
從圖中可以初步看出生成的數據近似服從正態(tài)分布。
為了得到更具說服力的結果,我們可以使用統(tǒng)計檢驗的方法,這里使用的是.scipy.stats中的函數。
統(tǒng)計檢驗
1)kstest
scipy.stats.kstest函數可用于檢驗樣本是否服從正態(tài)、指數、伽馬等分布,函數的源代碼為:
def kstest(rvs, cdf, args=(), N=20, alternative='two-sided', mode='approx'): """ Perform the Kolmogorov-Smirnov test for goodness of fit. This performs a test of the distribution F(x) of an observed random variable against a given distribution G(x). Under the null hypothesis the two distributions are identical, F(x)=G(x). The alternative hypothesis can be either 'two-sided' (default), 'less' or 'greater'. The KS test is only valid for continuous distributions. Parameters ---------- rvs : str, array or callable If a string, it should be the name of a distribution in `scipy.stats`. If an array, it should be a 1-D array of observations of random variables. If a callable, it should be a function to generate random variables; it is required to have a keyword argument `size`. cdf : str or callable If a string, it should be the name of a distribution in `scipy.stats`. If `rvs` is a string then `cdf` can be False or the same as `rvs`. If a callable, that callable is used to calculate the cdf. args : tuple, sequence, optional Distribution parameters, used if `rvs` or `cdf` are strings. N : int, optional Sample size if `rvs` is string or callable. Default is 20. alternative : {'two-sided', 'less','greater'}, optional Defines the alternative hypothesis (see explanation above). Default is 'two-sided'. mode : 'approx' (default) or 'asymp', optional Defines the distribution used for calculating the p-value. - 'approx' : use approximation to exact distribution of test statistic - 'asymp' : use asymptotic distribution of test statistic Returns ------- statistic : float KS test statistic, either D, D+ or D-. pvalue : float One-tailed or two-tailed p-value.
2)normaltest
scipy.stats.normaltest函數專門用于檢驗樣本是否服從正態(tài)分布,函數的源代碼為:
def normaltest(a, axis=0, nan_policy='propagate'): """ Test whether a sample differs from a normal distribution. This function tests the null hypothesis that a sample comes from a normal distribution. It is based on D'Agostino and Pearson's [1]_, [2]_ test that combines skew and kurtosis to produce an omnibus test of normality. Parameters ---------- a : array_like The array containing the sample to be tested. axis : int or None, optional Axis along which to compute test. Default is 0. If None, compute over the whole array `a`. nan_policy : {'propagate', 'raise', 'omit'}, optional Defines how to handle when input contains nan. 'propagate' returns nan, 'raise' throws an error, 'omit' performs the calculations ignoring nan values. Default is 'propagate'. Returns ------- statistic : float or array ``s^2 + k^2``, where ``s`` is the z-score returned by `skewtest` and ``k`` is the z-score returned by `kurtosistest`. pvalue : float or array A 2-sided chi squared probability for the hypothesis test.
3)shapiro
scipy.stats.shapiro函數也是用于專門做正態(tài)檢驗的,函數的源代碼為:
def shapiro(x): """ Perform the Shapiro-Wilk test for normality. The Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution. Parameters ---------- x : array_like Array of sample data. Returns ------- W : float The test statistic. p-value : float The p-value for the hypothesis test.
下面我們使用第一部分生成的仿真數據,用這三種統(tǒng)計檢驗函數檢驗生成的樣本是否服從正態(tài)分布(p > 0.05),代碼如下:
import numpy as np import pandas as pd import matplotlib.pyplot as plt # 使用pandas和numpy生成一組仿真數據 s = pd.DataFrame(np.random.randn(500),columns=['value']) print(s.shape) # (500, 1) # 計算均值 u = s['value'].mean() # 計算標準差 std = s['value'].std() # 計算標準差 print('scipy.stats.kstest統(tǒng)計檢驗結果:----------------------------------------------------') print(stats.kstest(s['value'], 'norm', (u, std))) print('scipy.stats.normaltest統(tǒng)計檢驗結果:----------------------------------------------------') print(stats.normaltest(s['value'])) print('scipy.stats.shapiro統(tǒng)計檢驗結果:----------------------------------------------------') print(stats.shapiro(s['value']))
統(tǒng)計檢驗結果如下:
scipy.stats.kstest統(tǒng)計檢驗結果:----------------------------------------------------
KstestResult(statistic=0.01596290473494305, pvalue=0.9995623150120069)
scipy.stats.normaltest統(tǒng)計檢驗結果:----------------------------------------------------
NormaltestResult(statistic=0.5561685865675511, pvalue=0.7572329891688141)
scipy.stats.shapiro統(tǒng)計檢驗結果:----------------------------------------------------
(0.9985257983207703, 0.9540967345237732)
可以看到使用三種方法檢驗樣本是否服從正態(tài)分布的結果中p-value都大于0.05,說明服從原假設,即生成的仿真數據服從正態(tài)分布。
總結
以上為個人經驗,希望能給大家一個參考,也希望大家多多支持腳本之家。
相關文章
Pandas去除重復項函數詳解drop_duplicates()
這篇文章主要介紹了Pandas去除重復項函數drop_duplicates(),具有很好的參考價值,希望對大家有所幫助,如有錯誤或未考慮完全的地方,望不吝賜教2024-02-02pandas.DataFrame刪除/選取含有特定數值的行或列實例
今天小編就為大家分享一篇pandas.DataFrame刪除/選取含有特定數值的行或列實例,具有很好的參考價值,希望對大家有所幫助。一起跟隨小編過來看看吧2018-11-11