Conducting Hypothesis Test

  • Let's say a company claims that children, who consume their protein product, generally grows taller than who do not consume.
  • To establish their claim, they need to provide evidence. For example, if average growth of kids is about 6 inches per year for any kid, then the company must provide evidence that the average growth of kids, who consume their product, grow more than 6 inches in a year and it is statistically signigicant.
  • Generally these kind of evidence has to be statistically verified and usually called Hypothesis Test. In this specific case, it can be assumed that no protein drinks have any effect on growth of height until unless data provides sufficient evidence that the drink has effect on heigh growth and the data is statistically verified.
  • Hypothesis Test can use data collected to either establish new belief or theory (alternate hypothesis) or continue to accept existing belief or theory (null hypothesis).
  • In this scenario:

    • Null Hypothesis: Height growth in childern who consume the product == Height growth in childern not consuming the product
    • Alternate Hypothesis: Height growth in childern who consume the product > Height growth in childern not consuming the product
  • Lets assume that an established study says that kids of age 5 or 6, typically grow at an average of 6 inches per year with a standard deviation of 2 inches.
In [1]:
import numpy as np

Kids not consuming the product

And we have some data which was measured on 100 kids not consuming the product

In [2]:
#kids_growth = np.round( np.random.normal( 6, 2, 100 ), 2 )
In [3]:
kids_growth = [  8.1 ,   6.82,   6.46,   5.29,   6.63,   4.42,   4.64,   7.94,
       5.69,   5.94,   6.91,   3.81,   2.96,   8.27,   1.74,   6.17,
       8.98,   7.66,   9.1 ,   3.8 ,   7.26,   6.87,   4.36,   3.21,
       7.77,   6.72,   7.64,   7.19,   8.47,   5.58,   3.07,   8.58,
       7.64,   6.28,   4.17,   1.94,   4.95,   3.65,   5.23,   5.62,
       5.22,   7.91,   7.98,   8.02,   2.94,  10.46,   4.55,   6.93,
       6.88,   7.15,   0.27,   4.33,   7.3 ,   5.35,   7.83,   9.07,
       8.39,   3.69,   3.66,   8.33,   5.92,   3.79,   6.85,   5.33,
       3.4 ,   7.7 ,   8.28,   3.14,   2.21,   5.55,   4.64,   5.83,
       6.68,   6.39,   5.76,   9.73,   6.81,   5.45,   4.92,   5.14,
       8.71,   5.66,   7.52,   5.65,   5.96,   6.91,   7.05,   2.44,
       8.27,   7.1 ,   6.63,   6.69,   4.84,   6.52,   5.53,   5.49,
       7.08,   3.94,   2.88,   4.76]

Mean and standard deviation

In [4]:
np.mean( kids_growth )
Out[4]:
5.9694000000000003
In [5]:
np.std( kids_growth )
Out[5]:
1.9575085287170528

Average height growth is about 5.97 inches, with a standard deviation of 1.95 inch.

In [6]:
import matplotlib as plt
import seaborn as sn
%matplotlib inline
In [7]:
sn.distplot( kids_growth, hist=False )
plt.pyplot.title( "Distplot of height growth for kids not consuming the product")
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[7]:
<matplotlib.text.Text at 0x118fa7ba8>

95% confidence interval

In [8]:
from scipy import stats
In [9]:
stats.norm.interval( 0.95, np.mean( kids_growth ), np.std( kids_growth ))
Out[9]:
(2.1327537842845867, 9.8060462157154138)

Observation: 95% of the kids height growth will remain between 2.13 and 9.80.

Observing average height growth in a group

  • If the company wants to provide evidence, it can not take one children and prove the growth of height is more than this. Typically, a group of kids will be chosen randomly, who will consume their product and observe their height growth. Of course, every children will not grow equally taller. There will be varaince in growth. So, it need to be established that average heigth growth is more than what is typically observed in group of kids otherwise, then it can be taken as an evidence.
  • So, to prove that we must observe the change in average heigth growth in the sample above. To do that we will randomly select a set of kids (let's say 20) and measure the average growth in height and variance.

Sampling Ditribution

  • Sampling distribution is a distribtion of means from multiple samples drawn from the same population. If we randomly take a set of childern and find out the average growth in height, how will it look like?
In [10]:
from numpy import random
In [11]:
kids_avg_growth = []

for num in range( 0, 100 ):
  kids_avg_growth.append( np.mean( random.choice( kids_growth, 30 ) ) )

Mean and standard deviation of sampling distribution

In [28]:
print( "Average growth in height: ", round( np.mean( kids_avg_growth ), 2 ) )
Average growth in height:  5.98
In [30]:
print( "Standard Deviation is: ", round( np.std( kids_avg_growth ), 2 ) )
Standard Deviation is:  0.35
In [14]:
sn.distplot( kids_avg_growth, hist=False )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x116bb8ba8>
  • So, if we randomly select groups of kids and observe their growth in height, the average height will be around 6 with standard deviation of 0.40.
  • Now we will take various scenarios and discuss the claim.

Scenario 1: Kids who consume XYZ's protien drink

  • Now let's the heigth growth observed in a controlled group of children who consumed the product over a year is as follows. So, can the company claim that the drink has effect on the growth of height.
In [15]:
xyz_kids_growth = [  8.88,   3.92,   5.79,   5.93,   6.15,   6.03,   3.77,   3.87,
       7.15,   8.51,   3.97,   6.3 ,  10.45,   6.74,   5.98,   3.76,
       8.36,   9.05,   4.49,   5.43,   8.41,   6.44,  10.13,   7.36,
       3.87,   7.85,   7.03,   5.15,   5.08,   2.56]

Comparing the height growth between two different groups

In [16]:
sn.distplot( xyz_kids_growth, hist = False, color = 'b' )
sn.distplot( kids_growth, hist = False, color = 'g' )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x1195fbba8>
  • It is observed that some children have higher growth compared to the group of children not consuming the product. But is it sufficient to establish the claim. Or is this we are observing purely becuase of chance and the sample we have selected.

Are we observing a new behavior or similar behavior as before?

Average growth in height

In [27]:
print( "Average growth in height: ", round( np.mean( xyz_kids_growth ),2 ) )
Average growth in height:  6.28

What is the probability that we have already observed this behavior before?

In [24]:
p_value = 1 - stats.norm.cdf( np.mean( xyz_kids_growth ),
             np.mean( kids_avg_growth ),
             np.std( kids_avg_growth ))
In [26]:
print( "p-value is: ", round( p_value, 2 ) )
p-value is:  0.2

Note:

  • Statisticians agree that anything that has less than 5% probability can be considered as observing new behavior i.e. not part of the exising population with whom we are comparing and can be considered part of a new population. But what we observed here is not a rare event and 1 out of 4 time we observe this kind of growth in normal children. So, no effect of the product.
  • The cut-off probability of 5% is called $\alpha$ (alpha) value. The actual probability value is called p-value. if p-value is less than $\alpha$, then the alternative hypothesis is accepted, otherwise the null hypothesis is retained.

Conclusion: Yes, we have observed this behavior before and there is a $$p_value$$ probability that children who do not consume this product also grow at this rate.

Scenario 2: Kids who consume XYZ's protien drink

  • Let's say the company observed the following height growth in the children. Let's compare.
In [19]:
xyz_kids_growth_1 = [  5.23,   7.67,   8.61,  10.19,   7.25,   8.45,   6.76,   7.14,
       5.58,   4.08,   5.95,   3.5 ,   8.46,   4.32,   5.5 ,  10.03,
       6.89,   7.55,   9.38,   7.61,   8.38,   9.24,   8.3 ,   0.26,
       9.64,   5.51,   1.1 ,   5.89,   9.63,   7.74]
In [20]:
sn.distplot( xyz_kids_growth_1, hist = False, color = 'b' )
sn.distplot( kids_growth, hist = False, color = 'g' )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x119732f98>

Are we observing a new behavior or similar behavior as before?

In [21]:
np.mean( xyz_kids_growth_1 )
Out[21]:
6.8613333333333317

What is the probability that we have already observed this behavior before?

In [22]:
1 - stats.norm.cdf( np.mean( xyz_kids_growth_1 ),
             np.mean( kids_avg_growth ),
             np.std( kids_avg_growth ))
Out[22]:
0.0057444007480100812

Conclusion: Yes, we have observed this behavior before, but with only 2% probability of children growing at this rate in normal circumstances i.e. children not consume any specific product.

  • In this case, we may conclude that what is observed is very rare in children, who do not consume this product and can be considered as a new behavior. And it can be assumed that the new behavior is observed because of consuming the product because the experiment is done on a controlled group.

Using stats APIs

  • The test can be done using stats API by comparing the two groups observed height growth samples directly, as follows.
In [23]:
stats.ttest_ind( xyz_kids_growth_1, kids_growth )
Out[23]:
Ttest_indResult(statistic=2.0504490204801282, pvalue=0.04236194561087387)
  • As the probability value is less than 4%, it can be concluded that the drink has effect. This is using the scenario-2 as explained above.