Tweets from Clinton and her aides are the sample from the same distribution.
Merge two samples and treat it as population, then do resample (no replacement) within this population. If the original two samples are from the same distribution, then the difference between these two resamples should not have statistical difference from the original two samples' difference.
# This one expand distribution into population
def expand_counts(source_counts):
return np.repeat(np.arange(len(source_counts)), source_counts)
# This calculate tvd
def tvd(sample0, sample1):
return 1/2*sum(np.abs(sample0/sum(sample0) - sample1/sum(sample1)))
def testing(sample0, sample1, iter=100000):
first = expand_counts(sample0)
second = expand_counts(sample1)
population = np.append(first, second)
tvds = []
for _ in range(iter):
shuffled = np.random.permutation(population)
a = np.bincount(shuffled[:len(first)], minlength=len(first))
b = np.bincount(shuffled[len(first):], minlength=len(first))
tvds.append(tvd(a,b))
actual_tvd = tvd(sample0, sample1)
return np.count_nonzero(np.array(tvds) > actual_tvd) / iter
testing(clinton_pivoted["True"], clinton_pivoted["False"], 100000) == 0.00133
Let random variables $$ C = tweet\ by\ Clinton $$ and $$ W = tweet\ from\ web\ client $$, object is to know $$ P(C|W) $$
From Bayes' rule, we know that $$ P(C|W) = \frac{P(C,W)}{P(W)} $$
wc = "Twitter Web Client"
# We get P(C,W) here
tweet_by_clinton_and_tweet_in_wc = clinton_pivoted[clinton_pivoted.index == wc]["True"].iloc[0]
# We get P(W) here
tweet_in_wc = clinton_pivoted[clinton_pivoted.index == wc].sum().sum()
# P(C,W) / P(W)
probability_clinton = tweet_by_clinton_and_tweet_in_wc / tweet_in_wc
There are no retweets sent from Trump's aides using Android. Can we say that Trump's aides never retweet on Android?
Bootstrap mentioned in the question is misleading. The essence of this problem is that: can we draw a conclusion based on limited samples.
It said that "If we'd seen 1 million retweets by Trump aides, it might be okay to make this conclusion. But we have seen only 177, so the conclusion seems a bit premature.".
It depends on the confidence level and confidence interval actually. If you choose 95% confidence level with 10% confidence interval, the sample size required to draw the conclusion is 96.