PHC6194 SPATIAL EPIDEMIOLOGY
Introduction to Spatial Database, Relational Database, and SQL
Hui Hu Ph.D.
Department of Epidemiology
College of Public Health and Health Professions & College of Medicine
January 15, 2020
CARTO Account
CARTO for students accounts
R Markdown and Notebook
R Markdown
-
R Markdown provides an authoring framework for data science.
-
You can use a single R Markdown file to both
save and execute code, and generate high quality reports that can be shared with an audience.
R Notebook
- An R Notebook is an R Markdown document with chunks that can be executed independently and interactively, with output visible immediately beneath the input.
Create a Notebook
- File->New File-> R Notebook
- By default, RStudio enables inline output (Notebook mode) on all R Markdown documents, so you can interact with any R Markdown document as though it were a notebook.
Introduction to Spatial Data
Introduction to Relational Database
SQL
In-class Exercises
Introduction to Spatial Data
Where is something?
Finding directions
Going beyond getting directions
- Mapping can be a great resource for analyzing patterns in data
- Example: by plotting the addresses of pizza lovers, a national pizza chain can assess where to locate the next grand opening
- What if we want to differentiate pizza lovers by income level?
- could use pushpins of different symbols to indicate income
Spatial databases come to the rescure
- A spatial database has column data types specifically designed to store objects in space:
- these variables can be added to database tables
- The information stored is usually geographic in nature:
- a point location
- the boundary of a lake
- A spatial database also provides specific functions and indexes for querying and manipulating the data
- usually through a query language such as SQL
- Often used as just a storage container for spatial data, but can also do much more than that
Spatial Databases
- Although not necessarily, most spatial databases are relational in nature
- Storage, analysis, and organize
- Example:
- generate a list of neighborhoods ranked by the number of high income pizza lovers who have more than two children
- incorporate additional data from varied sources (e.g. location and rating of existing pizzerias)
- More examples:
- get the number of houses within two miles of the coastline requiring evacuation in the event of a hurricane
- how many households would be affected by the noise of a newly proposed runway
PostgreSQL and PostGIS
- PostGIS is a free and open source library that spatially enables the free and open source PostgreSQL object-relational database management system (ORDBMS)
- The foundation of PostGIS is the PostgreSQL
- Both PostGIS and PostgreSQL conform to industry standards more closely than most products
- you are not siimply learning how to use a set of products; you are learning about industry standards that will help you understand other commercial and open source geospatial databases and mapping tools
Introduction to Relational Database
Relational Databases
- Relational databases model data by storing rows and columns in tables
- The power of the relational database lies in its ability to efficiently retrieve data from those tables
- in particular where there are multiple tables and relationships between those tables involved in the query
Terminology
- Database: contains many tables
- Table (or relation): contains rows and columns
- Row (or tuple): a set of fields that generally represents an object like a person or a music track
- Column(or attribute, field): one of possibly many elements of data corresponding to the object represented by the row
Column
Row
Table
- A table is defined as a set of rows that have the same columns
- A row usually represents an object and information about the object
- Objects are typically physical objects or concepts
- All the data referenced by a column are in the same domain and conform to the same constraints
SQL
- Structured Query Language: the language used to issue commands to the database
-
Create a table
-
Retrieve some data
-
Update data
- Delete data
Common Database Systems
- Oracle: large, commercial, enterprise-scale
- MySQL: simpler but very fast and scalable, commercial open source
- SqlServer: Microsoft
- Many other open source projects: SQLite, PostgreSQL
SQL
Part 1
SELECT and FROM
- Two required ingredients in any SQL query:
- SELECT
- FROM
- They have to be in that order
- SELECT indicates which columns you'd like to view
- FROM identifies the table that they live in
SELECT and FROM
LIMIT
- Why limit results?
- simply be to see what a particular table looks like
- Use the LIMIT command
WHERE
- WHERE: limit the results in all columns to rows that satisfy the condition
- Use the WHERE command
Comparison Operators
Arithmetic in SQL
Logical Operators
- LIKE: allows you to match similar values, instead of exact values
- IN: allows you to specify a list of values you'd like to include
- BETWEEN: allows you to select only rows within a certain range
- IS NULL: allows you to slect rows that contain no data ina given column
- AND: allows you to select only rows that satisfy two conditions
- OR: allows you to select rows that satisfy either of two conditions
- NOT: allows you to select rows that do not match a certain condition
LIKE
- _: used to substitute an individual character
- %: used to represent any character or set of characters
- ILIKE: ignore case when you're matching values
IN
- Allows you to specify a list of values that you'd like to include in the results
BETWEEN
- Allows you to select only rows that are within a specific range
- Has to be paired with the AND operator
IS NULL
- Allows you to exclude rows with missing data from your results
AND
- Allows you to select only rows that satisfy two conditions
OR
- Allows you to select rows that satisfy either of two conditions
NOT
- Can be put before any conditional statement to select rows for which that statement is false
ORDER BY
- Allows you to reorder your results based on the data in one or more columns
Lab: PostgreSQL
Part 1
git pull
SQL
Part 2
Database Design
- The goal is to avoid mistakes and design clean and easily understood databases
- Usually starts with a picture
Building a Data Model
- Draw a picture of the data objects, and then figure out how to represent the objects and their relationships
- Basic rule: don't put the same string data in twice - use a relationship instead
- Model the real world: if there is one thing in the real world, there should be one copy of that thing in the database
- For each piece of information:
- is the column an object or an attribute of another object
- we need to define relationships between objects once we define objects
Subject |
---|
id |
name |
gender |
age |
race/ethnicity |
county_id |
state_id |
County |
---|
id |
name |
median household income |
state_id |
State |
---|
id |
name |
policy |
policy start date |
Primary Key
Logical Key
Foreign Key
Power of Relational Database
- The relational database can be read through very quickly, even for very large amounts of data
- removed all replicated data and replaced them with references to a single copy of each bit of data
- When you want to query some data, the data will be read from a number of tables linked by these foreign keys
JOIN
- JOIN links across several tables as part of a select operation
- You must tell SQL how to use the keys that make the connection between the tables using an ON clause
SELECT subject.name, state.name as state_name, state.policy
FROM subject
JOIN state ON subject.state_id=state.id
;
A More Complex Example
SELECT subject.name, state.name as state_name, county.income
FROM subject
JOIN state ON subject.state_id=state.id
JOIN county ON subject.state_id=county.state_id
AND subject.county_id=county.id
;
Aggregate Functions
- COUNT
- SUM
- MIN
- MAX
- AVG
SELECT COUNT(id) AS n
FROM subject;
SELECT COUNT(id) AS n, state_id
FROM subject
GROUP BY state_id;
DISTINCT
SELECT COUNT(DISTINCT state_id) AS nstate
FROM subject
;
Having
SELECT MAX(income) AS maxIncome, state_id
FROM county
GROUP BY state_id
HAVING MAX(income) > 90000;
What's the difference between WHERE and HAVING?
Query Clause Order
- SELECT
- FROM
- WHERE
- GROUP BY
- HAVING
- ORDER BY
CASE
- Handles if/then logic in SQL
SELECT name,
gender,
CASE WHEN age BETWEEN 1 AND 4 THEN 1
WHEN age BETWEEN 5 AND 8 THEN 2
WHEN age > 8 THEN 3
ELSE NULL
END AS recodeage
FROM subject
;
Lab: PostgreSQL
Part 2
In-class Exercises
PHC6194-Spring2020-Lecture2
By Hui Hu
PHC6194-Spring2020-Lecture2
Slides for Lecture 2, Spring 2020, PHC6194 Spatial Epidemiology
- 866