Challenges, Concerns, &
A little bit about Machine Learning
Yu-Chang Ho (Andy)
Oct. 17, 2019
Oct. 22, 2019
UC Davis
Let's go through the solutions
I prepared for you guys
Web Crawler
Content Parsing
Basic Cleaning
World-wide Web (WWW)
Web-scraper
| API | Web-scraping |
|---|---|
| Granted to use the data | Not grant to use the data |
| No need to make a parser | Need to design the parser for the webpage source |
| No need to clean the data | Need to make sure the data obtained is correct |
|
Easy to use (as a programmer perspective), but not always provided |
Some companies can prevent the data from being scraped |
| Have rules to follow, so not a security issue for the website | A kind of attack to their service! |
It's a browser automation software, which is mainly designed for testing.
Because of the feature it has, we are also able to perform web-scraping using it.
The problem, is the performance!
Time-consuming!!!
Database
Multi-threaded Crawler
Multi-threaded Parser
Basic Data Cleaning
Normalization
Visualization!
Web-scraper
Error Handling & Retry
De-dupe
Aggregation
Male
Female
Unkown
Provides categorized dataset beforehand as a "Training Dataset" for model training.
After the model is trained, feed the dataset you would like to retrieve the parsing result based on the trained model.
Tried to increase the accuracy for the identification.
No "Training Dataset" at all. Given examples and let the machine figured out the "pattern" or "rule" itself.
$ python retrain.py --img_dir=../training_imgs
# Wait until the process complete.$ python label_image.py \
--graph=/tmp/output_graph.pb --labels=/tmp/output_labels.txt \
--input_layer=Placeholder \
--output_layer=final_result \
--image ./*.jpg
# Wait until the process complete.https://github.com/hippoandy/UCDavis_CMN189_F2019_Seminar_Webscraping
My contact e-mail: ycaho@ucdavis.edu