Apply Student's t-test in SciPy to data science task

Preface

I don’t use statistical functions in scipy.stats very often, today I have a use-case for applying it with Student’s t-test, therefore I take a note here.

Data and Story

We have 43019 users and their navigation time (from Piwik), 234, or 0.5% of them made a purchase in the end.

1
>>> df.head()
   purchased  first_payment_date  navigation_secs   user_id
0       True 2016-01-11 02:03:59        1648       15***689
1       True 2016-01-12 23:20:41          33       16***417
2       True 2016-01-27 19:14:27          24       16***354
3       True 2016-02-02 17:50:17         243       17***327
4       True 2016-02-09 10:50:29         261       17***961
>>> df['purchased'].value_counts()
False    42785
True       234
dtype: int64

We want to check if there is a significant difference between the navigation time of purchasing customers and other ones, so we can further confirm our suspicion on if people stay longer on our site, they have a higher probability of purchasing

Look into the data

The mean navigation time of purchased and non-purchased users are 807, and 911 seconds, respectively, and median navigation time of purchased and non-purchased users are 363, and 529 seconds.

1
>>> df[df['purchased']==False].describe()
      navigation_secs
count    42785.000000
mean       911.361692
std       1056.342158
min          1.000000
25%        178.000000
50%        529.000000
75%       1294.000000
max      13418.000000

>>> df[df['purchased']==True].describe()
     navigation_secs
count     234.000000
mean      807.337607
std      1097.884513
min         5.000000
25%       122.500000
50%       363.000000
75%      1047.000000
max      7532.000000

Isn’t it good? It is fortunately that we can close the terminal and tell others we just found an important insight, right?

Let us use some statistics

Well, not yet unless we apply some Statistics methods, if we still remember what we have learned back in college.

Student’s t-test is a good function for our scenario as code shown below:

1
>>> from scipy import stats
>>> purchase_navigate_secs = df[df['purchased']==True].navigation_secs
>>> nonpurchase_navigate_secs = df[df['purchased']==False]. navigation_secs

>>> stats.ttest_ind(purchase_navigate_secs, nonpurchase_navigate_secs)
(array(-1.5019605524327022), 0.13311463698221435)

Interpret p-value

The p-value was 0.13, it is bigger than the commonly-used threshold 0.05. What does it mean?.
“We failed to reject the null hypothesis”.
Statisticians would tell you so

“In English, please?”
Null hypothesis here stands for that the means of two populations are equal, so, failed to reject the null hypothesis means that we don’t have enough evidence to say they are not equal.

“What does a 0.13 p-value here means?” Does it represent error rate so 13% is simply too high for us?
No, it means that we have a 13% probability of getting results like ours – when the null hypothesis is true. It’s different than the error rate which misunderstood by many.

Conclusion

Unfortunately, we can’t say when people stay longer on our site, they have a higher probability of purchasing, which looks like a straightforward statement. But why doesn’t Student t-test support us?, I found that we can get a less than 0.05 p-value when we simply double the size of 234 purchased users

1
>>> type(purchase_navigate_secs)
<class 'pandas.core.series.Series'>
>>> more_purchase_navigate_secs = list(purchase_navigate_secs)*2
>>> len(more_purchase_navigate_secs)
468
# note that we can't just use purchase_navigate_secs*2 because in Series of pandas that means multiply each element by two

>>> stats.ttest_ind(noncb_piwik_list, more_purchase_navigate_secs)[1]
0.034188449197738648

>>> stats.ttest_ind(noncb_piwik_list, list(cb_piwik_list)*3)[1]
0.0097019140978811327

So, Time will prove, we’ll know it when we collect enough data. Let’s see.

End Notes

I was wondering that if I should add this kind of hypothesis testing as another layer in feature engineering/feature selection phase to filter out some biased features, but think this relies on underlying ML models, since non-linear models would have better complexity to handle what we simplified in this story.

Read more

Other - chi-square statistic test for frequencies in the contingency table

chi2_contingency: docs
This function computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table

Goal: test the null hypothesis that the click ratios between male and female are equal

1
from scipy.stats import chi2_contingency

case1 = [
	[40, 60],
	[30, 70]
]
case2 = [
	[400, 600],
	[300, 700]
]
case3 = [
	[40, 60],
	[25, 75]
]

print chi2_contingency(case1)[1] # print 0.18 p-value
print chi2_contingency(case2)[1] # print 0.000003 p-value
print chi2_contingency(case3)[1] # print 0.03 p-value

[note] 推薦programmer書單

Intro

用這篇文章,陸續紀錄一些自己讀過跟軟體開發有關的好書

Book list

Other

https://91-tdd.hackpad.com/91--SCin8rM6vpI