SQL 

Relational Databases

Peter Compernolle

What's the point of a database?

  • I/O performance
    • Quickly insert/update data
    • Quickly retrieve data
    • Scalable
  • Ease of use
    • Retrieve data exactly as needed
    • Massage or process data on the fly
    • Scalable
  • Engineering costs
    • Ease of accessing data
    • Database maintenance
    • Scalable
  • Additional awesomeness
    • Data integrity
    • Redundancy
    • Scalable

Common Solutions:

Excel

Pros:

  • Easy to enter data
  • Easy to sort of filter data
  • Everyone can use it

Cons

  • Doesn't scale
  • Poor data integrity
  • Single user
  • Crashes all the time
 
  • Often guesses dates wrong
  • Changes strings "001" to be 1
  • Gives false sense of security with password protected files
  • Has a 63k row limit (≤Office 2003)
  • Assumes that people don't put commas anywhere in a file except to delimit fields
  • Stores all changes so files get huge with little benefit
  • Inefficient backups

Common Solutions:

$key => $value

Pros:

  • Crazy fast
  • Scalable
  • Unstructured data

Cons

  • Difficult to architect
  • Unstructured data
  • Requires manual data replication

Common Solutions:

The Old Fashioned

Pros:

  • Always with you
  • No computer skills needed
  • Extra awesome points!

Cons

  • I can't read my own writing
  • I can't find anything > 1wk old
  • Succeptable to water damage

RIP, outofpages.com :(

SQL + Relational Database

  • Fast
  • Very easy to 
    • use
    • develop
    • maintain
  • Scales reasonably well

Relational Data

users

id first_name last_name active created_at
1 Jack Donnaghy true 2001-03-06 12:34:56
2 Liz Lemon true 2001-03-08 09:43:01
3 Leo Spaceman false 2001-07-13 10:01:33
SELECT users.first_name , users.last_name
FROM users 
WHERE users.active = 1 
ORDER BY id ASC 
LIMIT 5

Relational Data

accounts

id user_id balance active created_at
1 1 56485156.24 true 2001-03-06 12:34:56
2 1 153438.45 true 2001-03-08 09:43:01
3 2 23.92 true 2001-07-13 10:01:33
SELECT accounts.id , accounts.balance
FROM accounts 
WHERE accounts.active = 1
ORDER BY accounts.amount DESC 

SQL (es-kyü-el)

Structured Querying Language

A Query's Components

  1. What you want

    SELECT the fields, aggregates, or other calculations

  2. Pool of data

    FROM, JOIN all the data required (usually far more)

  3. Conditions (on data pool)

    WHERE the pool of data satisfies certain conditions

  4. Combinators/filters

    GROUPed BY fields to combine rows returned, HAVING certain conditions applied after being grouped

  5. Presentation of results

    ORDERed BY field(s) and/or LIMITed to a specific number of records (starting at a position)

Pseudo-SQL

A list of five active users & net balance, whose net balance is less than $200, with the smallest balances first

What you want

User data & his/her net balance

SELECT users.name ,
       SUM(accounts.balance) AS total

Pseudo-SQL

A list of five active users & net balance, whose net balance is less than $200, with the smallest balances first

Pool of data

User & account data

FROM users
LEFT JOIN account ON accounts.user_id = users.id

Pseudo-SQL

A list of five active users & net balance, whose net balance is less than $200, with the smallest balances first

Conditions on data pool

Only active users

WHERE users.active = 1

Pseudo-SQL

A list of five active users & net balance, whose net balance is less than $200, with the smallest balances first

Combinators/filters

One row per user, and only for users with a net balance less than $200

GROUP BY users.id
HAVING total < 200

Pseudo-SQL

A list of five active users & net balance, whose net balance is less than $200, with the smallest balances first

Presentation

Only show the five smallest net balances (in order)

ORDER BY total ASC
LIMIT 5

The SQL

A list of five active users & net balance, whose net balance is less than $200, with the smallest balances first

SELECT users.name ,
       SUM(account.balance) AS total
FROM users
LEFT JOIN account ON accounts.user_id = users.id
WHERE users.active = 1
GROUP BY users.id
HAVING total < 200
ORDER BY total ASC
LIMIT 5

The Details

What about the details?!

  • Join type
  • Apply conditions to pool of data or after grouping?
  • or even to a join?
  • What is the main table and what is joined?
  • How do I group? By what?
  • What if some tables don't have related records?
  • What if I want to ignore some records?

It's in the data!

SQL itself is only a handful of building blocks.

How you use them depends entirely on the data you have, and what data you want out.

So let's look at the data...

Relationships

One to One

One driver's license belongs to one driver

One to Many

One person has many licenses

(but remember: one license belongs to only one driver)

Many to Many

Cars have many drivers, people have many cars
(notice the inconsistency between "people" and "drivers"?)

LEFT vs INNER vs RIGHT

Join types:

  • LEFT

    Records in the preceeding relevant table will be returned even if no records present in this table

  • INNER

    Only records will be returned if each relevant table has a record

  • RIGHT

    You're writing your query wrong. Flip table order and use LEFT

INNER

  • INNER is the default join

    SELECT *
    FROM a
    JOIN b ON a.field = b.field

  • A LEFT or RIGHT join doesn't really make sense if you have a condition on the joined table (so use INNER for clarity)

    SELECT *
    FROM a 
    LEFT JOIN  b ON a.field = b.field
    WHERE b.field IS NOT NULL

  • Performance

    If your left table is significantly larger than your right, a LEFT join will be a bit faster.

,

SELECT  *
FROM   a  
JOIN b ON  a .field = b .field
SELECT  *
FROM   a ,  b
WHERE   a .field =  b .field

Same, except...

  • No indexes
  • Faster to write
  • Slower to run

DB Schema

Person

id first_name last_name ...
1 Jack Donaghy
2 Liz Lemon

License

id person_id number expiration_dt ...
1 1 IL-C1234-56789 2015-05-05

One to One

Suppose we want to see every license
and its driver:

SELECT person.first_name , 
       person.last_name , 
       license.number
FROM license
LEFT JOIN person ON license.person_id = person.id
first_name last_name number
Jack Donaghy IL-C1234-56789

DB Schema

Person

id first_name last_name ...
1 Jack Donaghy
2 Liz Lemon

License

id person_id number expiration_dt ...
1 1 IL-C1234-56789 2015-05-05
1 1 TX-C5432-10987 2015-05-05

One to One

or none
or many

Suppose we want to see everyone's name
and license number:

SELECT person.first_name , 
       person.last_name , 
       license.number
FROM person
LEFT JOIN license ON license.person_id = person.id
first_name last_name number
Jack Donaghy IL-C1234-56789
Liz Lemon NULL
Jack Donaghy TX-C5432-10987

What Happened?

license
person

What Happened?

license
person
SELECT ...
FROM license
LEFT JOIN person ...

What Happened?

license
person
SELECT ...
FROM person
LEFT JOIN license ...

DB Schema (part ii)

Is there anything we can learn from just looking at the schema?

Person

id first_name last_name ...
1 Jack Donaghy
2 Liz Lemon

License

id person_id number expiration_dt ...
1 1 IL-C1234-56789 2015-05-05
2 1 TX-C5432-10987 2014-05-05

Eegads! References!

The relationship between licenses and people is merely

a foreign key reference on a license record pointing to a person's primary key

person.id license.person_id
  • The person table alone has no reference to licenses
  • The license table may contain zero, one, or many records referencing a single person
  • Apply constraints, but still be careful

Constraints

But what's the worry? Add constraints to person_id

  • Set person_id to NOT NULL
    This will assure that all licenses belong to a person
  • Create a UNIQUE INDEX on person_id
    This will assure that no person has more than one license
  • Use application logic to avoid putting bad data in

But!

You really should still write your queries correctly!

  • What if a person really does have more than one license?
  • Sometimes you may need to store a record with a null reference

Many-to-Many

(Reminder: A driver may drive many cars,
and each car can have many drivers)


Many-to-Many
=
One-to-many + One-to-many

Person

id first_name
1 Jack
2 Liz

Pivot!

(aka xref, cross-reference table)

person_id car_id
1 1
1 2
1 4

Car

id name
1 T
2 The Homer
3 Vespa
4 H2

Many-to-Many

SELECT person.first_name , 
       car.model 
FROM person
LEFT JOIN pivot ON pivot.person_id = person.id
LEFT JOIN car ON pivot.car_id = car.id
first_name model
Liz NULL
Jack T
Jack The Homer
Jack H2

Many-to-Many

SELECT person.first_name , 
       car.model 
FROM car
LEFT JOIN pivot ON pivot.car_id = car.id
LEFT JOIN person ON pivot.person_id = person.id
first_name model
Jack T
Jack The Homer
Jack H2
NULL Vespa

Many-to-Many

SELECT person.first_name , 
       car.model 
FROM pivot
LEFT JOIN person ON pivot.person_id = person.id
LEFT JOIN car ON pivot.car_id = car.id
first_name model
Jack T
Jack The Homer
Jack H2

Common Mistakes: Uniqueness

Suppose what makes a record unique changes...

Pivot!

person_id car_id
1 1
1 2
1 2
1 4
active ...
true
true
false
true

Common Mistakes: Uniqueness

SELECT 
       person.first_name , 
       car.model 
FROM person
LEFT JOIN pivot ON pivot.person_id = person.id
LEFT JOIN car ON pivot.car_id = car.id
first_name model
Liz NULL
Jack T
Jack The Homer
Jack The Homer
Jack H2

The wrong way (1):

SELECT DISTINCT
       person.first_name , 
       car.model 
FROM person
LEFT JOIN pivot ON pivot.person_id = person.id
LEFT JOIN car ON pivot.car_id = car.id
first_name model
Liz NULL
Jack T
Jack The Homer
Jack H2

The wrong way (2):

SELECT person.first_name , 
       car.model 
FROM person
LEFT JOIN pivot ON pivot.person_id = person.id
LEFT JOIN car ON pivot.car_id = car.id
GROUP BY person.id , car.model
first_name model
Liz NULL
Jack T
Jack The Homer
Jack H2

The wrong way (3):

SELECT person.first_name , 
       car.model 
FROM person
LEFT JOIN pivot ON pivot.person_id = person.id
LEFT JOIN car ON pivot.car_id = car.id
WHERE pivot.active = 1
first_name model
Jack T
Jack The Homer
Jack H2

The correct way:

SELECT person.first_name , 
       car.model 
FROM person
LEFT JOIN pivot ON pivot.person_id = person.id 
                AND pivot.active = 1
LEFT JOIN car ON pivot.car_id = car.id
first_name model
Liz NULL
Jack T
Jack The Homer
Jack H2

So. Good queries?

SQL is simple -- based on only a handful of keywords.

But minor nuances yield drastic differences.

  1. Know your data!
  2. Break the query into small pieces
  3. Consider what makes data unique
  4. Start with the data you want returned per row, and watch row count
  5. Remember: Your test data set isn't always complete

Questions?

Thanks!

This presentation is available at
slid.es/petercompernolle/sql

SQL + Relational Databases

By petercompernolle