PHC7065 CRITICAL SKILLS IN DATA MANIPULATION FOR POPULATION SCIENCE

Data Models and Relational Database

Hui Hu Ph.D.

Department of Epidemiology

College of Public Health and Health Professions & College of Medicine

February 4, 2019

Relational SQL

 

Lab: SQL Part 2

Relational SQL

Database Design

  • The goal is to avoid mistakes and design clean and easily understood databases
     
  • Usually starts with a picture

Building a Data Model

  • Draw a picture of the data objects, and then figure out how to represent the objects and their relationships
     
  • Basic rule: don't put the same string (character) data in twice - use a relationship instead
     
  • Model the real world: if there is one thing in the real world, there should be one copy of that thing in the database
Subject
id
name
gender
age
race/ethnicity
county_id
state_id
County
id
name
median household income
state_id
State
id
name
policy
policy start date

Primary Key

Logical Key

Foreign Key

Power of Relational Database

  • The relational database can be read through very quickly, even for very large amounts of data
    -  all replicated data are removed and replaced with references to a single copy of each bit of data
     
  • When you want to query some data, the data will be read from a number of tables linked by the foreign keys

JOIN

  • JOIN links across several tables as part of a select operation
     
  • You must tell SQL how to use the keys that make the connection between the tables using an ON clause
SELECT Subject.name, State.name as State_name, State.policy
FROM Subject
JOIN State ON Subject.state_id=State.id
;

A More Complex Example

SELECT Subject.name, State.name as State_name, County.income
FROM Subject
JOIN State 
JOIN County
ON Subject.state_id=State.id
    AND Subject.state_id=County.state_id
    AND Subject.county_id=County.id
;

Aggregate Functions

  • COUNT
  • SUM
  • MIN
  • MAX
  • AVG
SELECT COUNT(id) AS n 
FROM Subject;
SELECT COUNT(id) AS n, state_id
FROM Subject
GROUP BY state_id;

DISTINCT

SELECT COUNT(DISTINCT state_id) AS nState
FROM Subject
;

Having

SELECT MAX(income) AS maxIncome, state_id
FROM County
GROUP BY state_id
HAVING MAX(income) > 90000;

What's the difference between WHERE and HAVING?

Query Clause Order

  1. SELECT
  2. FROM
  3. WHERE
  4. GROUP BY
  5. HAVING
  6. ORDER BY

CASE

  • Handles if/then logic in SQL
SELECT name,
       gender,
       CASE WHEN age BETWEEN 1 AND 4 THEN 1
       WHEN age BETWEEN 5 AND 8 THEN 2
       WHEN age > 8 THEN 3
       ELSE NULL
       END AS recodeAge
FROM Subject
;

SQL Part 2

git pull

PHC7065-Spring2019-Lecture4

By Hui Hu

PHC7065-Spring2019-Lecture4

Slides for Lecture 4, Spring 2019, PHC7065 Critical Skills in Data Manipulation for Population Science

  • 802