Hui Hu Ph.D.
Department of Epidemiology
College of Public Health and Health Professions & College of Medicine
February 25, 2019
Web Scraping
Internet
HTML, CSS, JavaScript, ...
PHP, MySQL, ...
HTTP, Request, Response, GET, POST, ...
Transport Control Protocol (TCP):
A stream socket is a type of interprocess communications socket or network socket which provides a connection-oriented, sequenced, and unique flow of data without record boundaries, with well-defined mechanisms for creating and destroying connections and for detecting errors.
Internet
Socket
www.ufl.edu
128.227.9.48
23
80
25
Login
Web Server
Sometimes, the port number is shown (if the server is running on a non-standard port)
http://www.ufl.edu/about
protocol
host
document
GET http://www.ufl.edu/index.html
Python has built-in support for TCP sockets through the socket library
import socket
mysocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysocket.connet( ('www.ufl.edu', 80) )
mysocket.send('GET http://www.ufl.edu/index.html \n\n')
while True:
data = mysocket.recv(512)
if ( len(data) < 1 ):
break
print data
mysocket.close()
We can also use the urllib library, which turns urls to files
import urllib
fhand = urllib.urlopen('http://www.ufl.edu/index.html')
for line in fhand:
print line.strip()
Parsing HTML
Server
Get
HTML
import urllib
fhand = urllib.urlopen('http://www.ufl.edu/index.html')
for line in fhand:
print line.strip()
Get
HTML
import urllib
from bs4 import *
html = urllib.urlopen('http://www.ufl.edu/index.html').read()
soup = BeautifulSoup(html,"html5lib")
tags = soup('a')
for tag in tags:
print tag.get('href',None)
Python Dictionary
Java HashMap
Wire Protocol
Serialize
De-Serialize
Increase the limit here: e.g. 1000