PHC7065 CRITICAL SKILLS IN DATA MANIPULATION FOR POPULATION SCIENCE

Basic Structured Query Language

Hui Hu Ph.D.

Department of Epidemiology

College of Public Health and Health Professions & College of Medicine

January 22, 2018

Basic SQL

 

Lab: SQL Part 1

Basic SQL

Relational Databases

  • Relational databases model data by storing rows and columns in tables
     
  • The power of the relational database lies in its ability to efficiently retrieve data from those tables
    -  in particular where there are multiple tables and relationships between those tables involved in the query

Terminology

  • Database: contains many tables
     
  • Relation (or table): contains tuples and attributes
     
  • Tuple (or row): a set of fields that generally represents an object like a person or a music track
     
  • Attribute (or column, field): one of possibly many elements of data corresponding to the object represented by the row

Attribute

Tuple

Relation

  • A relation is defined as a set of tuples that have the same attributes
  • A tuple usually represents an object and information about the object
  • Objects are typically physical objects or concepts
  • A relation is usually described as a table, which is organized into rows and columns
  • All the data referenced by an attribute are in the same domain and conform to the same constraints

SQL

  • Structured Query Language: the language used to issue commands to the database
     
  • Create a table
     
  • Retrieve some data
     
  • Update data
     
  • Delete data

Two Roles in Large Projects

  • Researcher / Developer: conduct scientific research or build the logic for applications
     
  • Database Administrator: monitors and adjusts the database as the program runs in production
     
  • Often both people participate in the building of the "Data model"

Large Project Structure

Application Software

Developer

Database Data Server

Database Tools

Database Administrator

SQL

SQL

Data Analysis Structure

Python Programs

Researcher

Database File

SQL

SQL

Input Files

OutputFiles

Database Model

  • A database model or database schema is the structure or format of a database, described in a formal language supported by the database management system
     
  • In other worlds, it is the application of a data model when used in conjunction with a database management system

Common Database Systems

  • Oracle: large, commercial, enterprise-scale
     
  • MySQL: simpler but very fast and scalable, commercial open source
     
  • SqlServer: Microsoft
     
  • Many other open source projects: SQLite, PostgreSQL

SELECT and FROM

  • Two required ingredients in any SQL query:
    -  SELECT
    -  FROM
     
  • They have to be in that order
    -  SELECT indicates which columns you'd like to view
    -  FROM identifies the table that they live in

SELECT and FROM

LIMIT

  • Why limit results?
    -  simply be to see what a particular table looks like
     
  • Use the LIMIT command

WHERE

  • WHERE: limit the results in all columns to rows that satisfy the condition
     
  • Use the WHERE command

Comparison Operators

Arithmetic in SQL

Logical Operators

  • LIKE: allows you to match similar values, instead of exact values
  • IN: allows you to specify a list of values you'd like to include
  • BETWEEN: allows you to select only rows within a certain range
  • IS NULL: allows you to slect rows that contain no data ina given column
  • AND: allows you to select only rows that satisfy two conditions
  • OR: allows you to select rows that satisfy either of two conditions
  • NOT: allows you to select rows that do not match a certain condition

LIKE

  • _: used to substitute or an individual character
     
  • %: used to represent any character or set of characters
     
  • ILIKE: ignore case when you're matching values

In MySQL, LIKE is case insensitive, and there is no ILIKE

IN

  • Allows you to specify a list of values that you'd like to include in the results

BETWEEN

  • Allows you to select only rows that are within a specific range
  • Has to be paired with the AND operator

IS NULL

  • Allows you to exclude rows with missing data from your results

AND

  • Allows you to select only rows that satisfy two conditions

OR

  • Allows you to select rows that satisfy either of two conditions

NOT

  • Can be put before any conditional statement to select rows for which that statement is false

ORDER BY

  • Allows you to reorder your results based on the data in one or more columns

SQL LIMIT

  • Why limit results?
    -  simply be to see what a particular table looks like
     
  • Use the LIMIT command

SQL Part 1

git pull

PHC7065-Spring2018-Lecture2

By Hui Hu

PHC7065-Spring2018-Lecture2

Slides for Lecture 2, Spring 2018, PHC7065 Critical Skills in Data Manipulation for Population Science

  • 580