Python如何檢驗樣本是否服從正態(tài)分布
在進行t檢驗、F檢驗之前,我們往往要求樣本大致服從正態(tài)分布,下面介紹兩種檢驗樣本是否服從正態(tài)分布的方法。
可視化
我們可以通過將樣本可視化,看一下樣本的概率密度是否是正態(tài)分布來初步判斷樣本是否服從正態(tài)分布。
代碼如下:
import numpy as np import pandas as pd import matplotlib.pyplot as plt # 使用pandas和numpy生成一組仿真數據 s = pd.DataFrame(np.random.randn(500),columns=['value']) print(s.shape) # (500, 1) # 創(chuàng)建自定義圖像 fig = plt.figure(figsize=(10, 6)) # 創(chuàng)建子圖1 ax1 = fig.add_subplot(2,1,1) # 繪制散點圖 ax1.scatter(s.index, s.values) plt.grid() # 添加網格 # 創(chuàng)建子圖2 ax2 = fig.add_subplot(2, 1, 2) # 繪制直方圖 s.hist(bins=30,alpha=0.5,ax=ax2) # 繪制密度圖 s.plot(kind='kde', secondary_y=True,ax=ax2) # 使用雙坐標軸 plt.grid() # 添加網格 # 顯示自定義圖像 plt.show()
可視化圖像如下:

從圖中可以初步看出生成的數據近似服從正態(tài)分布。
為了得到更具說服力的結果,我們可以使用統計檢驗的方法,這里使用的是.scipy.stats中的函數。
統計檢驗
1)kstest
scipy.stats.kstest函數可用于檢驗樣本是否服從正態(tài)、指數、伽馬等分布,函數的源代碼為:
def kstest(rvs, cdf, args=(), N=20, alternative='two-sided', mode='approx'):
"""
Perform the Kolmogorov-Smirnov test for goodness of fit.
This performs a test of the distribution F(x) of an observed
random variable against a given distribution G(x). Under the null
hypothesis the two distributions are identical, F(x)=G(x). The
alternative hypothesis can be either 'two-sided' (default), 'less'
or 'greater'. The KS test is only valid for continuous distributions.
Parameters
----------
rvs : str, array or callable
If a string, it should be the name of a distribution in `scipy.stats`.
If an array, it should be a 1-D array of observations of random
variables.
If a callable, it should be a function to generate random variables;
it is required to have a keyword argument `size`.
cdf : str or callable
If a string, it should be the name of a distribution in `scipy.stats`.
If `rvs` is a string then `cdf` can be False or the same as `rvs`.
If a callable, that callable is used to calculate the cdf.
args : tuple, sequence, optional
Distribution parameters, used if `rvs` or `cdf` are strings.
N : int, optional
Sample size if `rvs` is string or callable. Default is 20.
alternative : {'two-sided', 'less','greater'}, optional
Defines the alternative hypothesis (see explanation above).
Default is 'two-sided'.
mode : 'approx' (default) or 'asymp', optional
Defines the distribution used for calculating the p-value.
- 'approx' : use approximation to exact distribution of test statistic
- 'asymp' : use asymptotic distribution of test statistic
Returns
-------
statistic : float
KS test statistic, either D, D+ or D-.
pvalue : float
One-tailed or two-tailed p-value.2)normaltest
scipy.stats.normaltest函數專門用于檢驗樣本是否服從正態(tài)分布,函數的源代碼為:
def normaltest(a, axis=0, nan_policy='propagate'):
"""
Test whether a sample differs from a normal distribution.
This function tests the null hypothesis that a sample comes
from a normal distribution. It is based on D'Agostino and
Pearson's [1]_, [2]_ test that combines skew and kurtosis to
produce an omnibus test of normality.
Parameters
----------
a : array_like
The array containing the sample to be tested.
axis : int or None, optional
Axis along which to compute test. Default is 0. If None,
compute over the whole array `a`.
nan_policy : {'propagate', 'raise', 'omit'}, optional
Defines how to handle when input contains nan. 'propagate' returns nan,
'raise' throws an error, 'omit' performs the calculations ignoring nan
values. Default is 'propagate'.
Returns
-------
statistic : float or array
``s^2 + k^2``, where ``s`` is the z-score returned by `skewtest` and
``k`` is the z-score returned by `kurtosistest`.
pvalue : float or array
A 2-sided chi squared probability for the hypothesis test.3)shapiro
scipy.stats.shapiro函數也是用于專門做正態(tài)檢驗的,函數的源代碼為:
def shapiro(x):
"""
Perform the Shapiro-Wilk test for normality.
The Shapiro-Wilk test tests the null hypothesis that the
data was drawn from a normal distribution.
Parameters
----------
x : array_like
Array of sample data.
Returns
-------
W : float
The test statistic.
p-value : float
The p-value for the hypothesis test.下面我們使用第一部分生成的仿真數據,用這三種統計檢驗函數檢驗生成的樣本是否服從正態(tài)分布(p > 0.05),代碼如下:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# 使用pandas和numpy生成一組仿真數據
s = pd.DataFrame(np.random.randn(500),columns=['value'])
print(s.shape) # (500, 1)
# 計算均值
u = s['value'].mean()
# 計算標準差
std = s['value'].std() # 計算標準差
print('scipy.stats.kstest統計檢驗結果:----------------------------------------------------')
print(stats.kstest(s['value'], 'norm', (u, std)))
print('scipy.stats.normaltest統計檢驗結果:----------------------------------------------------')
print(stats.normaltest(s['value']))
print('scipy.stats.shapiro統計檢驗結果:----------------------------------------------------')
print(stats.shapiro(s['value']))統計檢驗結果如下:
scipy.stats.kstest統計檢驗結果:----------------------------------------------------
KstestResult(statistic=0.01596290473494305, pvalue=0.9995623150120069)
scipy.stats.normaltest統計檢驗結果:----------------------------------------------------
NormaltestResult(statistic=0.5561685865675511, pvalue=0.7572329891688141)
scipy.stats.shapiro統計檢驗結果:----------------------------------------------------
(0.9985257983207703, 0.9540967345237732)
可以看到使用三種方法檢驗樣本是否服從正態(tài)分布的結果中p-value都大于0.05,說明服從原假設,即生成的仿真數據服從正態(tài)分布。
總結
以上為個人經驗,希望能給大家一個參考,也希望大家多多支持腳本之家。
相關文章
Pandas去除重復項函數詳解drop_duplicates()
這篇文章主要介紹了Pandas去除重復項函數drop_duplicates(),具有很好的參考價值,希望對大家有所幫助,如有錯誤或未考慮完全的地方,望不吝賜教2024-02-02
pandas.DataFrame刪除/選取含有特定數值的行或列實例
今天小編就為大家分享一篇pandas.DataFrame刪除/選取含有特定數值的行或列實例,具有很好的參考價值,希望對大家有所幫助。一起跟隨小編過來看看吧2018-11-11

