PHC6194 SPATIAL EPIDEMIOLOGY

Introduction to Spatial Database, Relational Database, and SQL

Hui Hu Ph.D.

Department of Epidemiology

College of Public Health and Health Professions & College of Medicine

January 15, 2020

CARTO Account

CARTO for students accounts

R Markdown and Notebook

R Markdown

  • R Markdown provides an authoring framework for data science.
     

  • You can use a single R Markdown file to both

    save and execute code, and generate high quality reports that can be shared with an audience.

R Notebook

  • An R Notebook is an R Markdown document with chunks that can be executed independently and interactively, with output visible immediately beneath the input.

Create a Notebook

  • File->New File-> R Notebook
     
  • By default, RStudio enables inline output (Notebook mode) on all R Markdown documents, so you can interact with any R Markdown document as though it were a notebook.

Introduction to Spatial Data

Introduction to Relational Database

SQL


In-class Exercises

Introduction to Spatial Data

Where is something?

Finding directions

Going beyond getting directions

  • Mapping can be a great resource for analyzing patterns in data
     
  • Example: by plotting the addresses of pizza lovers, a national pizza chain can assess where to locate the next grand opening
     
  • What if we want to differentiate pizza lovers by income level?
    - could use pushpins of different symbols to indicate income

Spatial databases come to the rescure

  • A spatial database has column data types specifically designed to store objects in space:
    -  these variables can be added to database tables
     
  • The information stored is usually geographic in nature:
    -  a point location
    -  the boundary of a lake
     
  • A spatial database also provides specific functions and indexes for querying and manipulating the data
    -  usually through a query language such as SQL
     
  • Often used as just a storage container for spatial data, but can also do much more than that

Spatial Databases

  • Although not necessarily, most spatial databases are relational in nature
     
  • Storage, analysis, and organize
     
  • Example:
    -  generate a list of neighborhoods ranked by the number of high income pizza lovers who have more than two children
    -  incorporate additional data from varied sources (e.g. location and rating of existing pizzerias)
     
  • More examples:
    -  get the number of houses within two miles of the coastline requiring evacuation in the event of a hurricane
    -  how many households would be affected by the noise of a newly proposed runway

PostgreSQL and PostGIS

  • PostGIS is a free and open source library that spatially enables the free and open source PostgreSQL object-relational database management system (ORDBMS)
     
  • The foundation of PostGIS is the PostgreSQL
     
  • Both PostGIS and PostgreSQL conform to industry standards more closely than most products
    -  you are not siimply learning how to use a set of products; you are learning about industry standards that will help you understand other commercial and open source geospatial databases and mapping tools

Introduction to Relational Database

Relational Databases

  • Relational databases model data by storing rows and columns in tables
     
  • The power of the relational database lies in its ability to efficiently retrieve data from those tables
    -  in particular where there are multiple tables and relationships between those tables involved in the query

Terminology

  • Database: contains many tables
     
  • Table (or relation): contains rows and columns
     
  • Row (or tuple): a set of fields that generally represents an object like a person or a music track
     
  • Column(or attribute, field): one of possibly many elements of data corresponding to the object represented by the row

Column

Row

Table

  • A table is defined as a set of rows that have the same columns
  • A row usually represents an object and information about the object
  • Objects are typically physical objects or concepts
  • All the data referenced by a column are in the same domain and conform to the same constraints

SQL

  • Structured Query Language: the language used to issue commands to the database
     
  • Create a table
     
  • Retrieve some data
     
  • Update data
     
  • Delete data

Common Database Systems

  • Oracle: large, commercial, enterprise-scale
     
  • MySQL: simpler but very fast and scalable, commercial open source
     
  • SqlServer: Microsoft
     
  • Many other open source projects: SQLite, PostgreSQL

SQL

Part 1

SELECT and FROM

  • Two required ingredients in any SQL query:
    -  SELECT
    -  FROM
     
  • They have to be in that order
    -  SELECT indicates which columns you'd like to view
    -  FROM identifies the table that they live in

SELECT and FROM

LIMIT

  • Why limit results?
    -  simply be to see what a particular table looks like
     
  • Use the LIMIT command

WHERE

  • WHERE: limit the results in all columns to rows that satisfy the condition
     
  • Use the WHERE command

Comparison Operators

Arithmetic in SQL

Logical Operators

  • LIKE: allows you to match similar values, instead of exact values
  • IN: allows you to specify a list of values you'd like to include
  • BETWEEN: allows you to select only rows within a certain range
  • IS NULL: allows you to slect rows that contain no data ina given column
  • AND: allows you to select only rows that satisfy two conditions
  • OR: allows you to select rows that satisfy either of two conditions
  • NOT: allows you to select rows that do not match a certain condition

LIKE

  • _: used to substitute an individual character
     
  • %: used to represent any character or set of characters
     
  • ILIKE: ignore case when you're matching values

IN

  • Allows you to specify a list of values that you'd like to include in the results

BETWEEN

  • Allows you to select only rows that are within a specific range
  • Has to be paired with the AND operator

IS NULL

  • Allows you to exclude rows with missing data from your results

AND

  • Allows you to select only rows that satisfy two conditions

OR

  • Allows you to select rows that satisfy either of two conditions

NOT

  • Can be put before any conditional statement to select rows for which that statement is false

ORDER BY

  • Allows you to reorder your results based on the data in one or more columns

Lab: PostgreSQL
Part 1

git pull

SQL

Part 2

Database Design

  • The goal is to avoid mistakes and design clean and easily understood databases
     
  • Usually starts with a picture

Building a Data Model

  • Draw a picture of the data objects, and then figure out how to represent the objects and their relationships
     
  • Basic rule: don't put the same string data in twice - use a relationship instead
     
  • Model the real world: if there is one thing in the real world, there should be one copy of that thing in the database
     
  • For each piece of information:
    -  is the column an object or an attribute of another object
    -  we need to define relationships between objects once we define objects
Subject
id
name
gender
age
race/ethnicity
county_id
state_id
County
id
name
median household income
state_id
State
id
name
policy
policy start date

Primary Key

Logical Key

Foreign Key

Power of Relational Database

  • The relational database can be read through very quickly, even for very large amounts of data
    -  removed all replicated data and replaced them with references to a single copy of each bit of data
     
  • When you want to query some data, the data will be read from a number of tables linked by these foreign keys

JOIN

  • JOIN links across several tables as part of a select operation
     
  • You must tell SQL how to use the keys that make the connection between the tables using an ON clause
SELECT subject.name, state.name as state_name, state.policy
FROM subject
JOIN state ON subject.state_id=state.id
;

A More Complex Example

SELECT subject.name, state.name as state_name, county.income
FROM subject
JOIN state ON subject.state_id=state.id
JOIN county ON subject.state_id=county.state_id
           AND subject.county_id=county.id
;

Aggregate Functions

  • COUNT
  • SUM
  • MIN
  • MAX
  • AVG
SELECT COUNT(id) AS n 
FROM subject;
SELECT COUNT(id) AS n, state_id
FROM subject
GROUP BY state_id;

DISTINCT

SELECT COUNT(DISTINCT state_id) AS nstate
FROM subject
;

Having

SELECT MAX(income) AS maxIncome, state_id
FROM county
GROUP BY state_id
HAVING MAX(income) > 90000;

What's the difference between WHERE and HAVING?

Query Clause Order

  1. SELECT
  2. FROM
  3. WHERE
  4. GROUP BY
  5. HAVING
  6. ORDER BY

CASE

  • Handles if/then logic in SQL
SELECT name,
       gender,
       CASE WHEN age BETWEEN 1 AND 4 THEN 1
       WHEN age BETWEEN 5 AND 8 THEN 2
       WHEN age > 8 THEN 3
       ELSE NULL
       END AS recodeage
FROM subject
;

Lab: PostgreSQL

Part 2

In-class Exercises