PHC7065 CRITICAL SKILLS IN DATA MANIPULATION FOR POPULATION SCIENCE
Access Web Data
Hui Hu Ph.D.
Department of Epidemiology
College of Public Health and Health Professions & College of Medicine
March 16, 2020
Lab: Access Web Data
PHP, MySQL, ...
HTTP, Request, Response, GET, POST, ...
- Built on top of IP (Internet Protocol)
- Assume IP may lose some data during transmission, and it will store and retransmit these data
- Provides a reliable pipe
Stream Sockets / TCP Connections
A stream socket is a type of interprocess communications socket or network socket which provides a connection-oriented, sequenced, and unique flow of data without record boundaries, with well-defined mechanisms for creating and destroying connections and for detecting errors.
TCP Port (continued)
Sometimes, the port number is shown (if the server is running on a non-standard port)
- What can we do with the TCP socket?
- Application protocols:
- a set of rules that all parties follow so that we can predict each other's behavior and not bump into each other
- World Wide Web
- The set of rules that allow browsers to retrieve web documents from servers over the internet
- The dominant Application Protocol on the internet
- Invented for the web to retrieve HTML, images, documents, ...
- Extended to be data in addition to documents. E.g. RSS, web services, ...
- Basic flow:
- make a connection
- request a document
- retrieve the document
- close the connection
Retrieving Data from the Server
- Each time the user opens a new page, the browser makes a connection to the server and issues a "GET" request - to retrieve the content of the page at the specified URL
- The server returns the HTML document to the browser, which formats and displays the HTML document to the user
Making an HTTP Request
- Connect to the server, e.g. www.ufl.edu
- a "hand shake"
- Request a document
- GET http://www.ufl.edu/index.html
- Port 80 is the non-encrypted HTTP port
- When a program or script pretends to be a browser to retrieve web pages and to extract information
- Search engines scrape web pages - "web crawling"
import urllib fhand = urllib.urlopen('http://www.ufl.edu/index.html') for line in fhand: print line.strip()
Why Web Scraping?
- Get data:
- e.g. social network data
- Get your own data from some system that has no export capability
- Monitor a site for new information
- Crawl the web to make a search engine
- HTML is not really intended for consumption by an application which is interested in data
- We need an agreed way to represent data going between applications and across networks
- Wire Protocol
- Two common wire format: XML, JSON
- Most web applications use services
- use services from other applications: credit card charge, etc.
- Services publish the "rules" which must be followed by applications to make use of the service
Application Program Interface
Application programming interface (API) is a set of subroutine definitions, protocols, and tools for building application software
- In general terms, it's a set of clearly defined methods of communication between various software components
- Common web service technologies:
- SOAP - Simple Object Access Protocol
- REST - Representational State Transfer
Google Geocoding API
Security and Rate Limiting
- The data provided by these APIs is usually valuable
- The data providers might
- limit the number of requests per day,
- or demand an API "key",
- or charge for usage
Increase the limit here: e.g. 1000
Lab: Access Web Data
By Hui Hu