Hui Hu Ph.D.
Department of Epidemiology
College of Public Health and Health Professions & College of Medicine
March 16, 2020
Web Scraping
Internet
HTML, CSS, JavaScript, ...
PHP, MySQL, ...
HTTP, Request, Response, GET, POST, ...
Transport Control Protocol (TCP):
A stream socket is a type of interprocess communications socket or network socket which provides a connection-oriented, sequenced, and unique flow of data without record boundaries, with well-defined mechanisms for creating and destroying connections and for detecting errors.
Internet
Socket
www.ufl.edu
128.227.9.48
23
80
25
Login
Web Server
Sometimes, the port number is shown (if the server is running on a non-standard port)
http://www.ufl.edu/about
protocol
host
document
GET http://www.ufl.edu/index.html
Parsing HTML
Server
Get
HTML
import urllib
fhand = urllib.urlopen('http://www.ufl.edu/index.html')
for line in fhand:
print line.strip()
Get
HTML
Python Dictionary
Java HashMap
Wire Protocol
Serialize
De-Serialize
Increase the limit here: e.g. 1000