Q12

Task

Given Poisson process simulation of site's update, we try to see what will happen if we just visit the website hourly. And show what you found.

Solution

Just plot out the point, and it can be easily gotten from the graphs that when the updating rate less than 1/2, the crawler result and the simulated result quite match. Otherwise, they become a mismatch.

Q13

Task

Simulate the visits for a particular rate \(\lambda\), then use MLE to estimate the rate \(\lambda\) to see if they matches.

Solution

Use the function mentioned in Q12, and use the MLE formula \(\lambda^{\text{MLE}} = \frac{n}{N}\) deducted before.

Q13

Task

Explain why the estimated rates seem to level off at 1

Solution

Because we check update once an hour, so the value could not larger than that.

Task

How far off is the estimate from the truth for \(\lambda\) less than 0.25?

Solution

not quite far off, it's \( \frac{0.4-0.32}{0.4} = 20\% \) most

Q14

Task

What is the chance that a \( Poisson(\lambda) \) random variable is equal to 0? What is the chance that it's greater than or equal to 1?

Solution

Because

. We simply set \(k = 0\) and get

\(e^{-\lambda} \) and \(1 - e^{-\lambda}\)

P(k) = \frac{\lambda^k}{k!} e^{-\lambda}

P(k) = \frac{\lambda^k}{k!} e^{-\lambda}

Q16

Task

We assume that the probability of observing an update in a window conforms binomial distribution with \( p = e^{-\lambda} \). Show that the MLE of \( \lambda \) is \(\lambda^* = \log \left(\frac{n}{N - n} + 1 \right) \)

Solution

Q17

Task

Add the new estimated \( \lambda ^ * \) to the dataframe and plot a histogram.

Solution

crawl_stats["modified mle"] = np.log(
    (crawl_stats['number of updates'] + 0.5) / (crawl_stats['number of crawls'] - crawl_stats['number of updates'] + 0.5) + 1
)
crawl_stats["modified mle"].hist();

Q18

Task

Show that how accurate are our estimates for \( \lambda ^* \)

Solution

First for all the pages \( i \) we assume the change rate \( r_i \) equals to our estimate \( \lambda_i^+ \) and then use it to generate simulated changes and positive checks. Then again we use these simulated data to estimate \(r_i\) again, to see if estimated \(r_i\) approximately equals to \( \lambda_i^+ \)

Q19

Task

Finish the plotting part of Q18.

Solution

Q20

Task

Just looking at the graph to measure the quality of a estimation is not enough. One way to quantitatively measure the estimation quality is RMSE.

Solution

Same technique as above.

Use

to calculate the RMSE.

np.mean((bootstrap_estimates - e)**2)**0.5

Q21

Task

Create a visualization to display the RMSEs you computed. **Then,** create another visualization to see the relationship (across the 1000 pages) between RMSE and the modified MLE.

Solution

A good way to display a series of value is histogram. And a good way to display a relationship is scatter plot.

Q22

Task

Design a test to get the answer of one of these questions:

1. How much did the rate of changes vary by hour of the day?
2. How much did the rate of changes vary by day of the week?

Solution

My solution: We assume the rate of change for a single day is fixed. Then we can use the technique above, and partition the dataset by 7, to get a roughly estimate of day of week. Then we can plot them out to see the trending.