## Conducting Hypothesis Test¶

- Let's say a company claims that children, who consume their protein product, generally grows taller than who do not consume.
- To establish their claim, they need to provide evidence. For example, if average growth of kids is about 6 inches per year for any kid, then the company must provide evidence that the average growth of kids, who consume their product, grow more than 6 inches in a year and it is statistically signigicant.

- Generally these kind of evidence has to be statistically verified and usually called
**Hypothesis Test**. In this specific case, it can be assumed that no protein drinks have any effect on growth of height until unless data provides sufficient evidence that the drink has effect on heigh growth and the data is statistically verified. **Hypothesis Test**can use data collected to either establish new belief or theoryor continue to accept existing belief or theory*(alternate hypothesis)*.*(null hypothesis)*- In this scenario:
**Null Hypothesis**:*Height growth in childern who consume the product == Height growth in childern not consuming the product***Alternate Hypothesis**:*Height growth in childern who consume the product > Height growth in childern not consuming the product*

In [1]:

```
import numpy as np
```

## Kids not consuming the product¶

And we have some data which was measured on 100 kids not consuming the product

In [2]:

```
#kids_growth = np.round( np.random.normal( 6, 2, 100 ), 2 )
```

In [3]:

```
kids_growth = [ 8.1 , 6.82, 6.46, 5.29, 6.63, 4.42, 4.64, 7.94,
5.69, 5.94, 6.91, 3.81, 2.96, 8.27, 1.74, 6.17,
8.98, 7.66, 9.1 , 3.8 , 7.26, 6.87, 4.36, 3.21,
7.77, 6.72, 7.64, 7.19, 8.47, 5.58, 3.07, 8.58,
7.64, 6.28, 4.17, 1.94, 4.95, 3.65, 5.23, 5.62,
5.22, 7.91, 7.98, 8.02, 2.94, 10.46, 4.55, 6.93,
6.88, 7.15, 0.27, 4.33, 7.3 , 5.35, 7.83, 9.07,
8.39, 3.69, 3.66, 8.33, 5.92, 3.79, 6.85, 5.33,
3.4 , 7.7 , 8.28, 3.14, 2.21, 5.55, 4.64, 5.83,
6.68, 6.39, 5.76, 9.73, 6.81, 5.45, 4.92, 5.14,
8.71, 5.66, 7.52, 5.65, 5.96, 6.91, 7.05, 2.44,
8.27, 7.1 , 6.63, 6.69, 4.84, 6.52, 5.53, 5.49,
7.08, 3.94, 2.88, 4.76]
```

### Mean and standard deviation¶

In [4]:

```
np.mean( kids_growth )
```

Out[4]:

In [5]:

```
np.std( kids_growth )
```

Out[5]:

#### Average height growth is about 5.97 inches, with a standard deviation of 1.95 inch.¶

In [6]:

```
import matplotlib as plt
import seaborn as sn
%matplotlib inline
```

In [7]:

```
sn.distplot( kids_growth, hist=False )
plt.pyplot.title( "Distplot of height growth for kids not consuming the product")
```

Out[7]:

### 95% confidence interval¶

In [8]:

```
from scipy import stats
```

In [9]:

```
stats.norm.interval( 0.95, np.mean( kids_growth ), np.std( kids_growth ))
```

Out[9]:

#### Observation: 95% of the kids height growth will remain between 2.13 and 9.80.¶

## Observing average height growth in a group¶

### Sampling Ditribution¶

- Sampling distribution is a distribtion of means from multiple samples drawn from the same population. If we randomly take a set of childern and find out the average growth in height, how will it look like?

In [10]:

```
from numpy import random
```

In [11]:

```
kids_avg_growth = []
for num in range( 0, 100 ):
kids_avg_growth.append( np.mean( random.choice( kids_growth, 30 ) ) )
```

### Mean and standard deviation of sampling distribution¶

In [28]:

```
print( "Average growth in height: ", round( np.mean( kids_avg_growth ), 2 ) )
```

In [30]:

```
print( "Standard Deviation is: ", round( np.std( kids_avg_growth ), 2 ) )
```

In [14]:

```
sn.distplot( kids_avg_growth, hist=False )
```

Out[14]:

- Now we will take various scenarios and discuss the claim.

## Scenario 1: Kids who consume XYZ's protien drink¶

In [15]:

```
xyz_kids_growth = [ 8.88, 3.92, 5.79, 5.93, 6.15, 6.03, 3.77, 3.87,
7.15, 8.51, 3.97, 6.3 , 10.45, 6.74, 5.98, 3.76,
8.36, 9.05, 4.49, 5.43, 8.41, 6.44, 10.13, 7.36,
3.87, 7.85, 7.03, 5.15, 5.08, 2.56]
```

### Comparing the height growth between two different groups¶

In [16]:

```
sn.distplot( xyz_kids_growth, hist = False, color = 'b' )
sn.distplot( kids_growth, hist = False, color = 'g' )
```

Out[16]:

### Are we observing a new behavior or similar behavior as before?¶

### Average growth in height¶

In [27]:

```
print( "Average growth in height: ", round( np.mean( xyz_kids_growth ),2 ) )
```

### What is the probability that we have already observed this behavior before?¶

In [24]:

```
p_value = 1 - stats.norm.cdf( np.mean( xyz_kids_growth ),
np.mean( kids_avg_growth ),
np.std( kids_avg_growth ))
```

In [26]:

```
print( "p-value is: ", round( p_value, 2 ) )
```

#### Note:¶

- Statisticians agree that anything that has less than 5% probability can be considered as observing new behavior i.e. not part of the exising population with whom we are comparing and can be considered part of a new population. But what we observed here is not a rare event and 1 out of 4 time we observe this kind of growth in normal children. So, no effect of the product.
- The cut-off probability of 5% is called $\alpha$ (alpha) value. The actual probability value is called p-value. if p-value is less than $\alpha$, then the alternative hypothesis is accepted, otherwise the null hypothesis is retained.

## Scenario 2: Kids who consume XYZ's protien drink¶

- Let's say the company observed the following height growth in the children. Let's compare.

In [19]:

```
xyz_kids_growth_1 = [ 5.23, 7.67, 8.61, 10.19, 7.25, 8.45, 6.76, 7.14,
5.58, 4.08, 5.95, 3.5 , 8.46, 4.32, 5.5 , 10.03,
6.89, 7.55, 9.38, 7.61, 8.38, 9.24, 8.3 , 0.26,
9.64, 5.51, 1.1 , 5.89, 9.63, 7.74]
```

In [20]:

```
sn.distplot( xyz_kids_growth_1, hist = False, color = 'b' )
sn.distplot( kids_growth, hist = False, color = 'g' )
```

Out[20]:

### Are we observing a new behavior or similar behavior as before?¶

In [21]:

```
np.mean( xyz_kids_growth_1 )
```

Out[21]:

### What is the probability that we have already observed this behavior before?¶

In [22]:

```
1 - stats.norm.cdf( np.mean( xyz_kids_growth_1 ),
np.mean( kids_avg_growth ),
np.std( kids_avg_growth ))
```

Out[22]:

## Using stats APIs¶

In [23]:

```
stats.ttest_ind( xyz_kids_growth_1, kids_growth )
```

Out[23]:

## Comments