Given Poisson process simulation of site's update, we try to see what will happen if we just visit the website hourly. And show what you found.
Just plot out the point, and it can be easily gotten from the graphs that when the updating rate less than 1/2, the crawler result and the simulated result quite match. Otherwise, they become a mismatch.
Simulate the visits for a particular rate \(\lambda\), then use MLE to estimate the rate \(\lambda\) to see if they matches.
Use the function mentioned in Q12, and use the MLE formula \(\lambda^{\text{MLE}} = \frac{n}{N}\) deducted before.
Explain why the estimated rates seem to level off at 1
Because we check update once an hour, so the value could not larger than that.
How far off is the estimate from the truth for \(\lambda\) less than 0.25?
not quite far off, it's \( \frac{0.4-0.32}{0.4} = 20\% \) most
What is the chance that a \( Poisson(\lambda) \) random variable is equal to 0? What is the chance that it's greater than or equal to 1?
Because
. We simply set \(k = 0\) and get
\(e^{-\lambda} \) and \(1 - e^{-\lambda}\)
We assume that the probability of observing an update in a window conforms binomial distribution with \( p = e^{-\lambda} \). Show that the MLE of \( \lambda \) is \(\lambda^* = \log \left(\frac{n}{N - n} + 1 \right) \)
Add the new estimated \( \lambda ^ * \) to the dataframe and plot a histogram.
crawl_stats["modified mle"] = np.log(
(crawl_stats['number of updates'] + 0.5) / (crawl_stats['number of crawls'] - crawl_stats['number of updates'] + 0.5) + 1
)
crawl_stats["modified mle"].hist();
Show that how accurate are our estimates for \( \lambda ^* \)
First for all the pages \( i \) we assume the change rate \( r_i \) equals to our estimate \( \lambda_i^+ \) and then use it to generate simulated changes and positive checks. Then again we use these simulated data to estimate \(r_i\) again, to see if estimated \(r_i\) approximately equals to \( \lambda_i^+ \)
Finish the plotting part of Q18.
Just looking at the graph to measure the quality of a estimation is not enough. One way to quantitatively measure the estimation quality is RMSE.
Same technique as above.
Use
to calculate the RMSE.
np.mean((bootstrap_estimates - e)**2)**0.5
Create a visualization to display the RMSEs you computed. **Then,** create another visualization to see the relationship (across the 1000 pages) between RMSE and the modified MLE.
A good way to display a series of value is histogram. And a good way to display a relationship is scatter plot.
Design a test to get the answer of one of these questions:
1. How much did the rate of changes vary by hour of the day?
2. How much did the rate of changes vary by day of the week?
My solution: We assume the rate of change for a single day is fixed. Then we can use the technique above, and partition the dataset by 7, to get a roughly estimate of day of week. Then we can plot them out to see the trending.