Q12
Task
Given Poisson process simulation of site's update, we try to see what will happen if we just visit the website hourly. And show what you found.
Solution
Just plot out the point, and it can be easily gotten from the graphs that when the updating rate less than 1/2, the crawler result and the simulated result quite match. Otherwise, they become a mismatch.
Q13
Task
Simulate the visits for a particular rate \(\lambda\), then use MLE to estimate the rate \(\lambda\) to see if they matches.
Solution
Use the function mentioned in Q12, and use the MLE formula \(\lambda^{\text{MLE}} = \frac{n}{N}\) deducted before.
Q13
Task
Explain why the estimated rates seem to level off at 1
Solution
Because we check update once an hour, so the value could not larger than that.
Task
How far off is the estimate from the truth for \(\lambda\) less than 0.25?
Solution
not quite far off, it's \( \frac{0.4-0.32}{0.4} = 20\% \) most
Q14
Task
What is the chance that a \( Poisson(\lambda) \) random variable is equal to 0? What is the chance that it's greater than or equal to 1?
Solution
Because
. We simply set \(k = 0\) and get
\(e^{-\lambda} \) and \(1 - e^{-\lambda}\)
Q16
Task
We assume that the probability of observing an update in a window conforms binomial distribution with \( p = e^{-\lambda} \). Show that the MLE of \( \lambda \) is \(\lambda^* = \log \left(\frac{n}{N - n} + 1 \right) \)
Solution
Q17
Task
Add the new estimated \( \lambda ^ * \) to the dataframe and plot a histogram.
Solution
crawl_stats["modified mle"] = np.log(
(crawl_stats['number of updates'] + 0.5) / (crawl_stats['number of crawls'] - crawl_stats['number of updates'] + 0.5) + 1
)
crawl_stats["modified mle"].hist();
Q18
Task
Show that how accurate are our estimates for \( \lambda ^* \)
Solution
First for all the pages \( i \) we assume the change rate \( r_i \) equals to our estimate \( \lambda_i^+ \) and then use it to generate simulated changes and positive checks. Then again we use these simulated data to estimate \(r_i\) again, to see if estimated \(r_i\) approximately equals to \( \lambda_i^+ \)
Q19
Task
Finish the plotting part of Q18.
Solution
Q20
Task
Just looking at the graph to measure the quality of a estimation is not enough. One way to quantitatively measure the estimation quality is RMSE.
Solution
Same technique as above.
Use
to calculate the RMSE.
np.mean((bootstrap_estimates - e)**2)**0.5
Q21
Task
Create a visualization to display the RMSEs you computed. **Then,** create another visualization to see the relationship (across the 1000 pages) between RMSE and the modified MLE.
Solution
A good way to display a series of value is histogram. And a good way to display a relationship is scatter plot.
Q22
Task
Design a test to get the answer of one of these questions:
1. How much did the rate of changes vary by hour of the day?
2. How much did the rate of changes vary by day of the week?
Solution
My solution: We assume the rate of change for a single day is fixed. Then we can use the technique above, and partition the dataset by 7, to get a roughly estimate of day of week. Then we can plot them out to see the trending.
DS100 HW5 Episode II (12-22)
By Weiyüen Wu
DS100 HW5 Episode II (12-22)
- 600