Apply Student's t-test in SciPy to data science task

Preface

I don’t use statistical functions in scipy.stats very often, today I have a use-case for applying it with Student’s t-test, therefore I take a note here.

Data and Story

We have 43019 users and their navigation time (from Piwik), 234, or 0.5% of them made a purchase in the end.

1
>>> df.head()
   purchased  first_payment_date  navigation_secs   user_id
0       True 2016-01-11 02:03:59        1648       15***689
1       True 2016-01-12 23:20:41          33       16***417
2       True 2016-01-27 19:14:27          24       16***354
3       True 2016-02-02 17:50:17         243       17***327
4       True 2016-02-09 10:50:29         261       17***961
>>> df['purchased'].value_counts()
False    42785
True       234
dtype: int64

We want to check if there is a significant difference between the navigation time of purchasing customers and other ones, so we can further confirm our suspicion on if people stay longer on our site, they have a higher probability of purchasing

Look into the data

The mean navigation time of purchased and non-purchased users are 807, and 911 seconds, respectively, and median navigation time of purchased and non-purchased users are 363, and 529 seconds.

1
>>> df[df['purchased']==False].describe()
      navigation_secs
count    42785.000000
mean       911.361692
std       1056.342158
min          1.000000
25%        178.000000
50%        529.000000
75%       1294.000000
max      13418.000000

>>> df[df['purchased']==True].describe()
     navigation_secs
count     234.000000
mean      807.337607
std      1097.884513
min         5.000000
25%       122.500000
50%       363.000000
75%      1047.000000
max      7532.000000

Isn’t it good? It is fortunately that we can close the terminal and tell others we just found an important insight, right?

Let us use some statistics

Well, not yet unless we apply some Statistics methods, if we still remember what we have learned back in college.

Student’s t-test is a good function for our scenario as code shown below:

1
>>> from scipy import stats
>>> purchase_navigate_secs = df[df['purchased']==True].navigation_secs
>>> nonpurchase_navigate_secs = df[df['purchased']==False]. navigation_secs

>>> stats.ttest_ind(purchase_navigate_secs, nonpurchase_navigate_secs)
(array(-1.5019605524327022), 0.13311463698221435)

Interpret p-value

The p-value was 0.13, it is bigger than the commonly-used threshold 0.05. What does it mean?.
“We failed to reject the null hypothesis”.
Statisticians would tell you so

“In English, please?”
Null hypothesis here stands for that the means of two populations are equal, so, failed to reject the null hypothesis means that we don’t have enough evidence to say they are not equal.

“What does a 0.13 p-value here means?” Does it represent error rate so 13% is simply too high for us?
No, it means that we have a 13% probability of getting results like ours – when the null hypothesis is true. It’s different than the error rate which misunderstood by many.

Conclusion

Unfortunately, we can’t say when people stay longer on our site, they have a higher probability of purchasing, which looks like a straightforward statement. But why doesn’t Student t-test support us?, I found that we can get a less than 0.05 p-value when we simply double the size of 234 purchased users

1
>>> type(purchase_navigate_secs)
<class 'pandas.core.series.Series'>
>>> more_purchase_navigate_secs = list(purchase_navigate_secs)*2
>>> len(more_purchase_navigate_secs)
468
# note that we can't just use purchase_navigate_secs*2 because in Series of pandas that means multiply each element by two

>>> stats.ttest_ind(noncb_piwik_list, more_purchase_navigate_secs)[1]
0.034188449197738648

>>> stats.ttest_ind(noncb_piwik_list, list(cb_piwik_list)*3)[1]
0.0097019140978811327

So, Time will prove, we’ll know it when we collect enough data. Let’s see.

End Notes

I was wondering that if I should add this kind of hypothesis testing as another layer in feature engineering/feature selection phase to filter out some biased features, but think this relies on underlying ML models, since non-linear models would have better complexity to handle what we simplified in this story.

Read more

Other - chi-square statistic test for frequencies in the contingency table

chi2_contingency: docs
This function computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table

Goal: test the null hypothesis that the click ratios between male and female are equal

1
from scipy.stats import chi2_contingency

case1 = [
	[40, 60],
	[30, 70]
]
case2 = [
	[400, 600],
	[300, 700]
]
case3 = [
	[40, 60],
	[25, 75]
]

print chi2_contingency(case1)[1] # print 0.18 p-value
print chi2_contingency(case2)[1] # print 0.000003 p-value
print chi2_contingency(case3)[1] # print 0.03 p-value

[note] 推薦programmer書單

Intro

用這篇文章,陸續紀錄一些自己讀過跟軟體開發有關的好書

Book list

Other

https://91-tdd.hackpad.com/91--SCin8rM6vpI

[note] MLDM Monday | AlphaGo 圍棋演算法原理解析

Taiwan R User Group meetup - even page
Date: 2016-03-21
Topic: MLDM Monday | AlphaGo 圍棋演算法原理解析
Lecturer: Mark Chang
Slide: here
Video: here

Preface

最近AlphaGo和李世石的對弈實在太讓人興奮和關注,這場meetup在兩個禮拜前公布主題,很快地就達到160人的參加人數上限,大約是MLDM Monday平常參加人數的兩倍,讓主辦單位在前幾天一直呼籲大家說場地容納人數有限,天氣不好,當天會提供直播還有影片會很快釋出等等,鼓勵大家分流XD,算是我參加MLDM Monday meetup以來最熱門的一次狀況。講者Mark Chang也很用心前幾天就放出了幾十頁的精美投影片,在meetup頁面上大家也討論的滿熱烈。很讓人期待的一場。

160 people signed up

Note

開始前10分鐘已經來了不少人

Read More

What would you do if you were the CEO

picture credit: EP.53 of the hot Chinese TV series 'Nirvana in Fire' 琅琊榜

I recall that in the exit interview of my previous company, there was a nice question: If you were in the senior management team, what would you have done to improve the working environment?, and we can see a ‘Advice to Management’ section in every review on Glassdoor. It’s always easier to criticize than to create, so here comes the question, will you become the person you hate if are in the position?. I try to ask myself, am I really able to avoid making similar mistakes?

In one of the classic scenes of 琅琊榜 TV series as taken above, the emperor said to the revenger: ‘When he is in this position, he will change as well. That world which Hsieh Lin wanted, I couldn’t provide him, but so did King Chi. It will never be anyone able to provide it’.

It’s difficult for people not to complain anything about their supervisors, policies or decisions. I would try to be sympathetic and think further about the reason objectively, and keep reminding me of things I could prabably do better if I have the chance.

我們變成了那個時候的我們, 所希望變成的大人了嗎? 看著現在的我們, 那個時候的我們會不會笑呢? - 20世紀少年

你永遠不需要在成長的過程中,變成你不喜歡的大人。 - 小王子賞析

I wrote so many advice for this question in exit review...