PHC7065 CRITICAL SKILLS IN DATA MANIPULATION FOR POPULATION SCIENCE

Basic Structured Query Language

Hui Hu Ph.D.

Department of Epidemiology

College of Public Health and Health Professions & College of Medicine

January 27, 2020

Relational Database and SQL

 

Lab: SQL Part 1

Relational Database and SQL

Relational Databases

  • Relational databases model data by storing rows and columns in tables
     
  • The power of the relational database lies in its ability to efficiently retrieve data from those tables
    -  in particular where there are multiple tables and relationships between those tables involved in the query

Terminology in Relational Database

  • Database: contains many tables
     
  • Table (or relation): contains rows and columns
     
  • Row (or tuple): a set of fields that generally represents an object like a person or a music track
     
  • Column (or attribute, field): one of possibly many elements of data corresponding to the object represented by the row

Column

Row

Table

  • A table is defined as a set of rows that have the same columns
  • A row usually represents an object and information about the object
  • Objects are typically physical objects or concepts
  • All the data referenced by a column are in the same domain and conform to the same constraints

SQL

  • Structured Query Language: the language used to issue commands to the database
     
  • Create a table
     
  • Retrieve some data
     
  • Update data
     
  • Delete data

Common Database Systems

  • Oracle: large, commercial, enterprise-scale
     
  • MySQL: simpler but very fast and scalable, commercial open source
     
  • SqlServer: Microsoft
     
  • Many other open source projects: SQLite, PostgreSQL

SELECT and FROM

  • Two required ingredients in any SQL query:
    -  SELECT
    -  FROM
     
  • They have to be in that order
    -  SELECT indicates which columns you'd like to view
    -  FROM identifies the table that they live in

SELECT and FROM

SELECT year, month, west
FROM tutorial.us_housing_units
SELECT *
FROM tutorial.us_housing_units

LIMIT

  • Why limit results?
    -  simply be to see what a particular table looks like
     
  • Use the LIMIT command
SELECT year, month, west
FROM tutorial.us_housing_units
LIMIT 100

WHERE

  • WHERE: limit the results in all columns to rows that satisfy the condition
     
  • Use the WHERE command
SELECT year, month, west
FROM tutorial.us_housing_units
WHERE month=1

Comparison Operators

SELECT *
FROM tutorial.us_housing_units
WHERE west > 30
SELECT *
FROM tutorial.us_housing_units
WHERE month_name != 'January'

Arithmetic in SQL

SELECT year, month, west, south, west+south AS west_plus+south
FROM tutorial.us_housing_units

Logical Operators

  • LIKE: allows you to match similar values, instead of exact values
  • IN: allows you to specify a list of values you'd like to include
  • BETWEEN: allows you to select only rows within a certain range
  • IS NULL: allows you to slect rows that contain no data ina given column
  • AND: allows you to select only rows that satisfy two conditions
  • OR: allows you to select rows that satisfy either of two conditions
  • NOT: allows you to select rows that do not match a certain condition

LIKE

  • _: used to substitute or an individual character
     
  • %: used to represent any character or set of characters
     
  • ILIKE: ignore case when you're matching values

In MySQL, LIKE is case insensitive, and there is no ILIKE

SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE group LIKE 'Snoo%'
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE group ILIKE 'snoo%'
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE group ILIKE 'snoop_'

IN

  • Allows you to specify a list of values that you'd like to include in the results
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year_rank IN (1,2,3)

BETWEEN

  • Allows you to select only rows that are within a specific range
  • Has to be paired with the AND operator
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year_rank BETWEEN 5 AND 10
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year_rank >= 5 AND year_rank <= 10

IS NULL

  • Allows you to exclude rows with missing data from your results
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE artist IS NULL

AND

  • Allows you to select only rows that satisfy two conditions
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year = 2012 AND year_rank <=10

OR

  • Allows you to select rows that satisfy either of two conditions
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year_rank = 5 OR artist = 'Gotye'

NOT

  • Can be put before any conditional statement to select rows for which that statement is false
SELECT *
FROM tutorial.billboard_top_100_year_end
WHERE year_rank NOT BETWEEN 5 AND 10

ORDER BY

  • Allows you to reorder your results based on the data in one or more columns
SELECT *
FROM tutorial.billboard_top_100_year_end
ORDER BY year_rank
SELECT *
FROM tutorial.billboard_top_100_year_end
ORDER BY year DESC, year_rank

SQL Part 1

git pull

PHC7065-Spring2020-Lecture3

By Hui Hu

PHC7065-Spring2020-Lecture3

Slides for Lecture 3, Spring 2020, PHC7065 Critical Skills in Data Manipulation for Population Science

  • 798