Introduction to Relational Databases

A Brief History of Data Storage

Why does this matter?

Databases allow us to...

 

  1. Efficiently store data
  2. Quickly retrieve data

Relational Databases

Each row represents an "object" (e.g., person, country)

Each column provides information describing the "object"

Name City State Zipcode
Syracuse Syracuse NY 13244
UC Berkeley Berkeley CA 94720
Arizona State Tempe AZ 85001
Columbia New York NY 10025

Relational Database Example

Id
1
2
3
4
Zip Snow
13244 100
94720 1
85001 .5
04989 56
Degree_Id College_Id
1 1
2 1
1 2
2 3
1 4

M.A.

M.S.

M.A.

M.S.

M.A.

Working with Databases

Structured Query Language (SQL) is how we communicate with databases in order to (1) create, (2) update, (3) retrieve, and (4) delete data.

We retrieve data from a database using a query. A query is a written statement that uses key words to communicate with the database.

SELECT

FROM

WHERE

JOIN

Joining Data Tables

Your Turn

# Level - intro to databases - joins
# 10/10/2018

# load packages 
library(tidyverse)
library(RSQLite)
library(DBI)
library(knitr)

# create database on data science programs
source("create_db.R")

# let's take a look at our institutions table
tbl(ds_programs, "institutions")

## Exercise 1: Review how to select variables from a table ----

# use sql to build a query to select name and zipcode from the institutions table
dbGetQuery(conn = ds_programs, 
           "SELECT name, zipcode
             FROM institutions")

# use dplyr to build a query to select name and zipcode from the institutions table
ds_programs %>% 
  tbl("institutions") %>% 
    select(name, zipcode) 

## Exercise 2: Learn how to join two tables using a query ----

# use sql to select name and zipcode from the institutions table and to left_join all data from the weather table
dbGetQuery(conn = ds_programs, 
            "SELECT institutions.name, institutions.zipcode, weather.snow
             FROM institutions
             LEFT JOIN weather ON institutions.zipcode = weather.zip")

# use dplyr to select name and zipcode from the institutions table and to left_join all data from the weather table
ds_programs %>%
  tbl("institutions") %>%
    select(name, zipcode) %>%
    left_join(
      ds_programs %>% 
        tbl("weather") %>% 
        select(zip, snow),
      by = c("zipcode" = "zip")
  )

# close database connection to ds_programs database
dbDisconnect(ds_programs)
Made with Slides.com