PHC7065 CRITICAL SKILLS IN DATA MANIPULATION FOR POPULATION SCIENCE
Data Models and Relational Database
Hui Hu Ph.D.
Department of Epidemiology
College of Public Health and Health Professions & College of Medicine
January 29, 2018
Relational SQL
Lab: SQL Part 2
Relational SQL
Database Design
- The goal is to avoid mistakes and design clean and easily understood databases
- Usually starts with a picture
Building a Data Model
- Draw a picture of the data objects, and then figure out how to represent the objects and their relationships
- Basic rule: don't put the same string data in twice - use a relationship instead
- Model the real world: if there is one thing in the real world, there should be one copy of that thing in the database
- For each piece of information:
- is the column an object or an attribute of another object
- we need to define relationships between objects once we define objects
Subject |
---|
id |
name |
gender |
age |
race/ethnicity |
county_id |
state_id |
County |
---|
id |
name |
median household income |
state_id |
State |
---|
id |
name |
policy |
policy start date |
Primary Key
Logical Key
Foreign Key
Power of Relational Database
- The relational database can be read through very quickly, even for very large amounts of data
- removed all replicated data and replaced them with references to a single copy of each bit of data
- When you want to query some data, the data will be read from a number of tables linked by these foreign keys
JOIN
- JOIN links across several tables as part of a select operation
- You must tell SQL how to use the keys that make the connection between the tables using an ON clause
SELECT Subject.name, State.name as State_name, State.policy
FROM Subject
JOIN State ON Subject.state_id=State.id
;
A More Complex Example
SELECT Subject.name, State.name as State_name, County.income
FROM Subject
JOIN State
JOIN County
ON Subject.state_id=State.id
AND Subject.state_id=County.state_id
AND Subject.county_id=County.id
;
Aggregate Functions
- COUNT
- SUM
- MIN
- MAX
- AVG
SELECT COUNT(id) AS n
FROM Subject;
SELECT COUNT(id) AS n, state_id
FROM Subject
GROUP BY state_id;
DISTINCT
SELECT COUNT(DISTINCT state_id) AS nState
FROM Subject
;
Having
SELECT MAX(income) AS maxIncome, state_id
FROM County
GROUP BY state_id
HAVING MAX(income) > 90000;
What's the difference between WHERE and HAVING?
Query Clause Order
- SELECT
- FROM
- WHERE
- GROUP BY
- HAVING
- ORDER BY
CASE
- Handles if/then logic in SQL
SELECT name,
gender,
CASE WHEN age BETWEEN 1 AND 4 THEN 1
WHEN age BETWEEN 5 AND 8 THEN 2
WHEN age > 8 THEN 3
ELSE NULL
END AS recodeAge
FROM Subject
;
SQL Part 2
git pull
PHC7065-Spring2018-Lecture3
By Hui Hu
PHC7065-Spring2018-Lecture3
Slides for Lecture 3, Spring 2018, PHC7065 Critical Skills in Data Manipulation for Population Science
- 548