PHC7065 CRITICAL SKILLS IN DATA MANIPULATION FOR POPULATION SCIENCE
Basic Structured Query Language
Hui Hu Ph.D.
Department of Epidemiology
College of Public Health and Health Professions & College of Medicine
January 27, 2020
Relational Database and SQL
Lab: SQL Part 1
Relational Database and SQL
Relational Databases
- Relational databases model data by storing rows and columns in tables
- The power of the relational database lies in its ability to efficiently retrieve data from those tables
- in particular where there are multiple tables and relationships between those tables involved in the query
Terminology in Relational Database
- Database: contains many tables
- Table (or relation): contains rows and columns
- Row (or tuple): a set of fields that generally represents an object like a person or a music track
- Column (or attribute, field): one of possibly many elements of data corresponding to the object represented by the row
Column
Row
Table
- A table is defined as a set of rows that have the same columns
- A row usually represents an object and information about the object
- Objects are typically physical objects or concepts
- All the data referenced by a column are in the same domain and conform to the same constraints
SQL
- Structured Query Language: the language used to issue commands to the database
-
Create a table
-
Retrieve some data
-
Update data
- Delete data
Common Database Systems
- Oracle: large, commercial, enterprise-scale
- MySQL: simpler but very fast and scalable, commercial open source
- SqlServer: Microsoft
- Many other open source projects: SQLite, PostgreSQL
SELECT and FROM
- Two required ingredients in any SQL query:
- SELECT
- FROM
- They have to be in that order
- SELECT indicates which columns you'd like to view
- FROM identifies the table that they live in
SELECT and FROM
SELECT year, month, west
FROM tutorial.us_housing_units
SELECT *
FROM tutorial.us_housing_units
LIMIT
- Why limit results?
- simply be to see what a particular table looks like
- Use the LIMIT command
SELECT year, month, west
FROM tutorial.us_housing_units
LIMIT 100
WHERE
- WHERE: limit the results in all columns to rows that satisfy the condition
- Use the WHERE command
SELECT year, month, west
FROM tutorial.us_housing_units
WHERE month=1
Comparison Operators
SELECT *
FROM tutorial.us_housing_units
WHERE west > 30
SELECT *
FROM tutorial.us_housing_units
WHERE month_name != 'January'
Arithmetic in SQL
SELECT year, month, west, south, west+south AS west_plus+south
FROM tutorial.us_housing_units
Logical Operators
- LIKE: allows you to match similar values, instead of exact values
- IN: allows you to specify a list of values you'd like to include
- BETWEEN: allows you to select only rows within a certain range
- IS NULL: allows you to slect rows that contain no data ina given column
- AND: allows you to select only rows that satisfy two conditions
- OR: allows you to select rows that satisfy either of two conditions
- NOT: allows you to select rows that do not match a certain condition
LIKE
- _: used to substitute or an individual character
- %: used to represent any character or set of characters
- ILIKE: ignore case when you're matching values
In MySQL, LIKE is case insensitive, and there is no ILIKE
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE group LIKE 'Snoo%'
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE group ILIKE 'snoo%'
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE group ILIKE 'snoop_'
IN
- Allows you to specify a list of values that you'd like to include in the results
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year_rank IN (1,2,3)
BETWEEN
- Allows you to select only rows that are within a specific range
- Has to be paired with the AND operator
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year_rank BETWEEN 5 AND 10
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year_rank >= 5 AND year_rank <= 10
IS NULL
- Allows you to exclude rows with missing data from your results
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE artist IS NULL
AND
- Allows you to select only rows that satisfy two conditions
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year = 2012 AND year_rank <=10
OR
- Allows you to select rows that satisfy either of two conditions
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year_rank = 5 OR artist = 'Gotye'
NOT
- Can be put before any conditional statement to select rows for which that statement is false
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year_rank NOT BETWEEN 5 AND 10
ORDER BY
- Allows you to reorder your results based on the data in one or more columns
SELECT *
FROM tutorial.billboard_top_100_year_end
ORDER BY year_rank
SELECT *
FROM tutorial.billboard_top_100_year_end
ORDER BY year DESC, year_rank
SQL Part 1
git pull
PHC7065-Spring2020-Lecture3
By Hui Hu
PHC7065-Spring2020-Lecture3
Slides for Lecture 3, Spring 2020, PHC7065 Critical Skills in Data Manipulation for Population Science
- 817