Rossmann Store Sales Behavior

  • Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.
  • This dataset is taken from a kaggle competition and is available at the following link

  • The dataset contains historical sales data for 1,115 Rossmann stores.

    • train.csv - historical data including Sales
    • store.csv - supplemental information about the stores

Understanding the impact of promos and upcoming holidays on sales

  • Does average sales during promos days are higher than sales during non-promo days?
  • Does the influence of upcoming holidays have any impact on average sales
  • Does the promos have different impact on different store types?
  • Does the impact of sales for upcoming holiday differ for different weekdays?

Establish the sales behavior using hypothesis test

Loading the dataset

In [1]:
import pandas as pd
In [2]:
rossmann_df = pd.read_csv( '../Pandas/train.csv' )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
In [3]:
rossmann_df.head( 5 )
Out[3]:
Store DayOfWeek Date Sales Customers Open Promo StateHoliday SchoolHoliday
0 1 5 2015-07-31 5263 555 1 1 0 1
1 2 5 2015-07-31 6064 625 1 1 0 1
2 3 5 2015-07-31 8314 821 1 1 0 1
3 4 5 2015-07-31 13995 1498 1 1 0 1
4 5 5 2015-07-31 4822 559 1 1 0 1
In [4]:
stores_df = pd.read_csv( '../Pandas/store.csv' )
In [5]:
stores_df.head( 5 )
Out[5]:
Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval
0 1 c a 1270.0 9.0 2008.0 0 NaN NaN NaN
1 2 a a 570.0 11.0 2007.0 1 13.0 2010.0 Jan,Apr,Jul,Oct
2 3 a a 14130.0 12.0 2006.0 1 14.0 2011.0 Jan,Apr,Jul,Oct
3 4 c c 620.0 9.0 2009.0 0 NaN NaN NaN
4 5 a a 29910.0 4.0 2015.0 0 NaN NaN NaN

Remove entries with zero sales

In [6]:
rossmann_df = rossmann_df[ rossmann_df.Sales > 0 ]

Are the sales higher when promos running?

In [7]:
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
In [8]:
sn.distplot( rossmann_df[rossmann_df.Promo == 1].Sales, color = 'r' )
sn.distplot( rossmann_df[rossmann_df.Promo == 0].Sales, color = 'b' )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x122bd5828>

Average Sales for promo and non-promo days

In [9]:
avg_sales_non_promo = rossmann_df[rossmann_df.Promo == 0].Sales.mean()
avg_sales_promo = rossmann_df[rossmann_df.Promo == 1].Sales.mean()
In [10]:
print( "Average sales during promo days:", round( avg_sales_promo, 2 ) )
print( "Average sales during non promo days:", round( avg_sales_non_promo, 2 ) )
Average sales during promo days: 8228.74
Average sales during non promo days: 5929.83

Note:

There are overlaps in terms of sales number for promo and non-promo days. There some days sales are higher compared to non-promo days. But do we observe an higher average sales during promo days. Let's do a hypothesis test to understand.

Sales during promo days are higher?

Hypothesis Test:

  • $H_{0}$ : Average sales during non-promo days = Average sales during promo days
  • $H_{A}$ : Average Sales during Promo Days > Average sales during non-promo days
In [11]:
from scipy import stats
In [12]:
stats.ttest_ind( rossmann_df[rossmann_df.Promo == 1].Sales,
              rossmann_df[rossmann_df.Promo == 0].Sales )
Out[12]:
Ttest_indResult(statistic=363.89481333533848, pvalue=0.0)

Note:

p-value is less than 0.05, which indicates that the null hypothesis is rejected in favor of alternate hypothesis. This means the average sales during promo days are higher than the average sales during non promo days.

Merge sales data with stores data

In [13]:
rossmann_new_df = rossmann_df.merge( stores_df, on = ['Store'], how = 'inner')
In [14]:
rossmann_new_df.head( 5 )
Out[14]:
Store DayOfWeek Date Sales Customers Open Promo StateHoliday SchoolHoliday StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval
0 1 5 2015-07-31 5263 555 1 1 0 1 c a 1270.0 9.0 2008.0 0 NaN NaN NaN
1 1 4 2015-07-30 5020 546 1 1 0 1 c a 1270.0 9.0 2008.0 0 NaN NaN NaN
2 1 3 2015-07-29 4782 523 1 1 0 1 c a 1270.0 9.0 2008.0 0 NaN NaN NaN
3 1 2 2015-07-28 5011 560 1 1 0 1 c a 1270.0 9.0 2008.0 0 NaN NaN NaN
4 1 1 2015-07-27 6102 612 1 1 0 1 c a 1270.0 9.0 2008.0 0 NaN NaN NaN

Compare the sales distribution for different store type and analyze promo effect

In [15]:
sn.boxplot( x = 'StoreType', y = 'Sales', hue = 'Promo', data = rossmann_new_df )
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x11498ea58>
In [16]:
def isSalesHigherDuringPromos( storeType ):
  store_type_sales_df = rossmann_new_df[rossmann_new_df.StoreType == storeType ]
  return stats.ttest_ind( store_type_sales_df[store_type_sales_df.Promo == 1].Sales,
              store_type_sales_df[store_type_sales_df.Promo == 0].Sales )
In [17]:
isSalesHigherDuringPromos( 'a' )
Out[17]:
Ttest_indResult(statistic=276.54664703848425, pvalue=0.0)
In [18]:
isSalesHigherDuringPromos( 'b' )
Out[18]:
Ttest_indResult(statistic=20.778835096313788, pvalue=1.2915804424519336e-94)
In [19]:
isSalesHigherDuringPromos( 'c' )
Out[19]:
Ttest_indResult(statistic=123.88357113852456, pvalue=0.0)
In [20]:
isSalesHigherDuringPromos( 'd' )
Out[20]:
Ttest_indResult(statistic=235.83789503346077, pvalue=0.0)

Note:

As p-values are less than 0.05, sales are higher during promos across all store types.

Are the sales higher when these is a holiday next day?

Create a new feature called next_day_school_holiday

In [21]:
rossmann_new_df['next_day_school_holiday'] = rossmann_new_df.groupby(['Store'])['SchoolHoliday'].shift( -1 )
In [22]:
rossmann_new_df[['Store', 'SchoolHoliday', 'next_day_school_holiday']].head( 6 )
Out[22]:
Store SchoolHoliday next_day_school_holiday
0 1 1 1.0
1 1 1 1.0
2 1 1 1.0
3 1 1 1.0
4 1 1 0.0
5 1 0 0.0

Hypothesis Test:

  • $H_{0}$ : Average sales when the next day is not a school holiday = Average sales when the next day is a school holiday
  • $H_{A}$ : Average sales when the next day is not a school holiday > Average sales when the next day is a school holiday
In [23]:
next_day_holiday = rossmann_new_df[(rossmann_new_df.next_day_school_holiday == 1) &
                                 (rossmann_new_df.SchoolHoliday == 0)]
next_day_no_holiday = rossmann_new_df[(rossmann_new_df.next_day_school_holiday == 0) &
                                 (rossmann_new_df.SchoolHoliday == 0)]
In [24]:
sn.set(rc={"figure.figsize": (6, 4)});

sn.distplot( next_day_holiday.Sales, color = 'r' )
sn.distplot( next_day_no_holiday.Sales, color = 'b' )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x115384a58>
In [25]:
stats.ttest_ind( next_day_holiday.Sales,
              next_day_no_holiday.Sales )
Out[25]:
Ttest_indResult(statistic=-66.447988837996007, pvalue=0.0)

Surprisingly the sales are lower if the next day is a holiday

In [26]:
sn.set(rc={"figure.figsize": (10, 6)});

sn.boxplot( x = 'DayOfWeek',
         y = 'Sales',
         hue = 'next_day_school_holiday',
         data = pd.concat( [next_day_holiday, next_day_no_holiday] ) )
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x11c465278>

Impact of weekday on sales if the next day is a holiday

In [27]:
def isSalesHigherForNextDayHoliday( dayOfWeek ):
  next_day_holiday_sales = next_day_holiday[next_day_holiday.DayOfWeek == dayOfWeek].Sales
  next_day_no_holiday_sales = next_day_no_holiday[next_day_no_holiday.DayOfWeek == dayOfWeek].Sales

  return stats.ttest_ind( next_day_holiday_sales, next_day_no_holiday_sales, equal_var = False )
In [28]:
for day in range( 1, 8 ):
  t_stat, p_val = isSalesHigherForNextDayHoliday( day )
  print( "For Day :",
        day ,
        "t-stats: ",
        round( t_stat, 3 ),
        "  p_vals: ",
        round( p_val, 3 ) )
For Day : 1 t-stats:  1.896   p_vals:  0.066
For Day : 2 t-stats:  4.594   p_vals:  0.0
For Day : 3 t-stats:  4.815   p_vals:  0.0
For Day : 4 t-stats:  0.935   p_vals:  0.35
For Day : 5 t-stats:  2.515   p_vals:  0.014
For Day : 6 t-stats:  -11.326   p_vals:  0.0
For Day : 7 t-stats:  nan   p_vals:  nan

Note:

  • Interesting observations on the impact of "if the next day is a holiday"

    • It has no impact if it is a monday or thursday.
    • Sales are higher if it is a tuesday, wednesday or friday
    • Sales are lower if it is a saturday

Manually create multiple samples and compare the sampling distributions

In [29]:
import numpy as np
In [37]:
def samplingMeansForDaysOfWeek( dayOfWeek ):
  next_day_holiday_sales = next_day_holiday[next_day_holiday.DayOfWeek == dayOfWeek].Sales

  holiday_sampling_mean = []

  for a in range( 1, 50 ):
      holiday_sampling_mean.append( next_day_holiday_sales.sample( frac = 0.1 ).mean() )

  next_day_no_holiday_sales = next_day_no_holiday[next_day_no_holiday.DayOfWeek == dayOfWeek].Sales

  no_holiday_sampling_mean = []

  for a in range( 1, 50 ):
      no_holiday_sampling_mean.append( next_day_no_holiday_sales.sample( frac = 0.1 ).mean() )

  sn.set(rc={"figure.figsize": (6, 4)});

  sn.distplot( holiday_sampling_mean, color = 'g' )
  plt.axvline(np.mean( holiday_sampling_mean ), color='g', linestyle='dashed', linewidth=2)
  sn.distplot( no_holiday_sampling_mean, color= 'r' )
  plt.axvline(np.mean( no_holiday_sampling_mean ), color='r', linestyle='dashed', linewidth=2)
  plt.show()
In [38]:
samplingMeansForDaysOfWeek( 1 )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
In [39]:
samplingMeansForDaysOfWeek( 2 )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
In [40]:
samplingMeansForDaysOfWeek( 3 )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
In [41]:
samplingMeansForDaysOfWeek( 4 )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
In [42]:
samplingMeansForDaysOfWeek( 5 )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
In [43]:
samplingMeansForDaysOfWeek( 6 )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

Note:

  • The above observations, concur with the t-test results.