HW2

Q10

Hypothesis testing

Problem

Tweets from Clinton and her aides are the sample from the same distribution.

Solution

Merge two samples and treat it as population, then do resample (no replacement) within this population. If the original two samples are from the same distribution, then the difference between these two resamples should not have statistical difference from the original two samples' difference.

# This one expand distribution into population
def expand_counts(source_counts):
    return np.repeat(np.arange(len(source_counts)), source_counts)

# This calculate tvd
def tvd(sample0, sample1):
    return 1/2*sum(np.abs(sample0/sum(sample0) - sample1/sum(sample1)))

def testing(sample0, sample1, iter=100000):
    first = expand_counts(sample0)
    second = expand_counts(sample1)
    population = np.append(first, second)
    
    tvds = []
    for _ in range(iter):
        shuffled = np.random.permutation(population)
        a = np.bincount(shuffled[:len(first)], minlength=len(first))
        b = np.bincount(shuffled[len(first):], minlength=len(first))
        tvds.append(tvd(a,b))
    
    actual_tvd = tvd(sample0, sample1)
    return np.count_nonzero(np.array(tvds) > actual_tvd) / iter

testing(clinton_pivoted["True"], clinton_pivoted["False"], 100000) == 0.00133

Q11

Bayes' rule

Problem

Let random variables $$ C = tweet\ by\ Clinton $$ and $$ W = tweet\ from\ web\ client $$, object is to know $$ P(C|W) $$

Solution

From Bayes' rule, we know that $$ P(C|W) = \frac{P(C,W)}{P(W)} $$

wc = "Twitter Web Client"

# We get P(C,W) here
tweet_by_clinton_and_tweet_in_wc = clinton_pivoted[clinton_pivoted.index == wc]["True"].iloc[0] 

# We get P(W) here
tweet_in_wc = clinton_pivoted[clinton_pivoted.index == wc].sum().sum()

# P(C,W) / P(W)
probability_clinton = tweet_by_clinton_and_tweet_in_wc / tweet_in_wc

Q12-17

Copy paste

Nothing new comparing to Q1-Q9

Just change Clinton to Trump

Q18

Can we deduce results from small samples?

Problem

There are no retweets sent from Trump's aides using Android. Can we say that Trump's aides never retweet on Android?

Note

Bootstrap mentioned in the question is misleading. The essence of this problem is that: can we draw a conclusion based on limited samples.

The solution

It said that "If we'd seen 1 million retweets by Trump aides, it might be okay to make this conclusion. But we have seen only 177, so the conclusion seems a bit premature.".

My opinion

It depends on the confidence level and confidence interval actually. If you choose 95% confidence level with 10% confidence interval, the sample size required to draw the conclusion is 96.