HW2
Q10
Hypothesis testing
Problem
Tweets from Clinton and her aides are the sample from the same distribution.
Solution
Merge two samples and treat it as population, then do resample (no replacement) within this population. If the original two samples are from the same distribution, then the difference between these two resamples should not have statistical difference from the original two samples' difference.
# This one expand distribution into population
def expand_counts(source_counts):
return np.repeat(np.arange(len(source_counts)), source_counts)
# This calculate tvd
def tvd(sample0, sample1):
return 1/2*sum(np.abs(sample0/sum(sample0) - sample1/sum(sample1)))
def testing(sample0, sample1, iter=100000):
first = expand_counts(sample0)
second = expand_counts(sample1)
population = np.append(first, second)
tvds = []
for _ in range(iter):
shuffled = np.random.permutation(population)
a = np.bincount(shuffled[:len(first)], minlength=len(first))
b = np.bincount(shuffled[len(first):], minlength=len(first))
tvds.append(tvd(a,b))
actual_tvd = tvd(sample0, sample1)
return np.count_nonzero(np.array(tvds) > actual_tvd) / iter
testing(clinton_pivoted["True"], clinton_pivoted["False"], 100000) == 0.00133
Q11
Bayes' rule
Problem
Let random variables $$ C = tweet\ by\ Clinton $$ and $$ W = tweet\ from\ web\ client $$, object is to know $$ P(C|W) $$
Solution
From Bayes' rule, we know that $$ P(C|W) = \frac{P(C,W)}{P(W)} $$
wc = "Twitter Web Client"
# We get P(C,W) here
tweet_by_clinton_and_tweet_in_wc = clinton_pivoted[clinton_pivoted.index == wc]["True"].iloc[0]
# We get P(W) here
tweet_in_wc = clinton_pivoted[clinton_pivoted.index == wc].sum().sum()
# P(C,W) / P(W)
probability_clinton = tweet_by_clinton_and_tweet_in_wc / tweet_in_wc
Q12-17
Copy paste
Nothing new comparing to Q1-Q9
Just change Clinton to Trump
Q18
Can we deduce results from small samples?
Problem
There are no retweets sent from Trump's aides using Android. Can we say that Trump's aides never retweet on Android?
Note
Bootstrap mentioned in the question is misleading. The essence of this problem is that: can we draw a conclusion based on limited samples.
The solution
It said that "If we'd seen 1 million retweets by Trump aides, it might be okay to make this conclusion. But we have seen only 177, so the conclusion seems a bit premature.".
My opinion
It depends on the confidence level and confidence interval actually. If you choose 95% confidence level with 10% confidence interval, the sample size required to draw the conclusion is 96.
DS100 HW2 Episode II(10-18)
By Weiyüen Wu
DS100 HW2 Episode II(10-18)
- 630