Advanced SQL techniques

For simpler batch operations

About this presentation

  1. Tricks to optimize SQL batch operations
  2. Presentation of the Numbers table concept
  3. Presentation of the Calendar table concept

Misusage of computation power

  1. Get all data from database
  2. Crunch one entry at a time on local application server
  3. Apply the result back to the database

Why do a batch process unit by unit?

Problems with this approach?

  1. Getting all data and sending it to memory (copy)
  2. Slower to process each item at once, database is much faster
  3. Super long to process indexes one entry at a time when pushing results

First trick

The numbers table

What is a numbers table?

  • Simple table with one field
  • The only field and primary key is auto-incremented
  • Contains numbers from 0 to 100 000 or 1 to 100 000

What can you do with it?

  1. Find missing items in sequences
  2. Parse collated data into rows
  3. Duplicate data easily
  4. And much more

Creating the table

CREATE TABLE hlp_numbers (num INTEGER NOT NULL, PRIMARY KEY(num));
INSERT INTO hlp_numbers VALUES(1);
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers VALUES(0);
DELETE FROM hlp_numbers WHERE num > 100000;

Generating sequences

SELECT * FROM hlp_numbers 
WHERE num BETWEEN 1 AND 500 AND num % 5 = 0

Get the 5th numbers between 1 and 500

Yields

5, 10, 15, 20, 25, ...

Generating sequences

SELECT DATE('2015-02-05', '+' || num || ' days') FROM  hlp_numbers 
WHERE num < 365 AND num % 3 = 0

Get every 3rd day from now until next year

Yields

2015-02-05, 2015-02-08, 2015-02-11, ...

Finding missing items in sequences

SELECT num 
FROM hlp_numbers n 
    LEFT JOIN invoices i ON i.invoice_no = n.num 
WHERE i.invoice_no IS NULL
    AND n.num BETWEEN
        (SELECT MIN(invoice_no) FROM invoices) 
        AND (SELECT MAX(invoice_no) FROM invoices)

Join the numbers on existing data to find missing items:

Finding missing items in sequences

SELECT 
    DATE('2015-02-05', '+' || num || ' days') AS date, room_number 
FROM 
    rooms r, hlp_numbers n 
WHERE 
    DATE('2015-02-05', '+' || num || ' days') NOT IN (
        SELECT reservation_date FROM reservations 
        WHERE room_number = r.room_number
    ) 
    AND num BETWEEN 1 AND 5

Can you imagine a check for unbooked rooms in an hotel registry?

Parsing/splitting strings/rows

1,2,3,4,5
Mathieu,Eric,Valerie,Martin,John

Seen before, serialized data inside a database column that should have been in different rows or columns:

hello:9|world:17
i:6|love:4|programming:3

Parsing/splitting strings/rows

SELECT
    CASE
        WHEN INSTR(SUBSTR(d.textdata, n.num + 1), ',') = 0 
        THEN SUBSTR(SUBSTR(d.textdata, n.num + 1), 0)
        ELSE SUBSTR(SUBSTR(d.textdata, n.num + 1), 0, 
            INSTR(SUBSTR(d.textdata, n.num + 1), ','))
    END as data
FROM 
    csvdata AS d, hlp_numbers AS n
WHERE 
    SUBSTR(d.textdata, n.num, 1) = ',' num = 0
ORDER BY textdata, n.num

Split using numbers and intelligent substrings:

Original data:
    "1,2,3,4,5"

Result:
    "1"
    "2"
    "3"
    "4"
    "5"

Parsing/splitting strings/rows

SELECT
    textdata, num,
    SUBSTR(SUBSTR(d.textdata, n.num + 1), 0, INSTR(SUBSTR(d.textdata, n.num + 1), ':')) AS label,
    CASE
        WHEN INSTR(SUBSTR(d.textdata, n.num + 1), '|') = 0 
        THEN SUBSTR(SUBSTR(d.textdata, n.num + 1), INSTR(SUBSTR(d.textdata, n.num + 1), ':') + 1)
        ELSE SUBSTR(SUBSTR(d.textdata, n.num + 1), INSTR(SUBSTR(d.textdata, n.num + 1), ':') + 1, 
	INSTR(SUBSTR(d.textdata, n.num + 1), '|') - INSTR(SUBSTR(d.textdata, n.num + 1), ':') - 1)
    END as value
FROM 
    csvdata AS d, hlp_numbers AS n
WHERE 
    SUBSTR(d.textdata, n.num, 1) = '|' 
    OR num = 0
ORDER BY textdata, n.num

Multi-separator is more complex but works as well:

Original data:
    "hello:9|world:17"                  "i:6|love:4|programming:3"

Results in:
    hello          9     i              6
    world          17    love           4     programming    3

Duplicate data

INSERT INTO actions
SELECT action_fields, invoice_id
FROM actions, invoices
WHERE actions.id IN (1,2,3)
AND invoices.id BETWEEN 1 AND 10;

Seen before:

  1. Using unrelated tables = looking for trouble!
  2. You never know when table structure or content will change!

Duplicate data

INSERT INTO actions
SELECT action_fields, num
FROM actions, hlp_numbers
WHERE actions.id IN (1,2,3)
AND hlp_numbers.num BETWEEN 1 AND 10;

Better way:

  1. You know that hlp_numbers will always contain 1 to 100 000

Other uses?

  • Many things you'd loop or consider a batch processing can be used this way!
  • Remember, re-indexing one item at a time is long
  • Hard to get into, but great once you get the hang of it

Second trick

The calendar table

Calendar tables are...

  • Just like a numbers table
  • But with more information
  • Helps you abstract batch date processing

What can you do with calendars?

  • Compute or plan events
  • Find missing dates (Already demonstrated)
  • Compute streaks
  • Compute delivery dates or due dates
  • And much more

Creating the table

CREATE TABLE IF NOT EXISTS `hlp_calendar` (
    `dt` date NOT NULL,
    `isweekday` bit(1) DEFAULT NULL, `isholiday` bit(1) DEFAULT NULL,
    `yr` smallint(6) DEFAULT NULL, `qtr` tinyint(4) DEFAULT NULL,
    `mt` tinyint(4) DEFAULT NULL, `dy` tinyint(4) DEFAULT NULL,
    `dyw` tinyint(4) DEFAULT NULL, `wk` tinyint(4) DEFAULT NULL,
    `dywo` tinyint(4) DEFAULT NULL, PRIMARY KEY (`dt`)
);
CREATE INDEX `isweekday` ON hlp_calendar(`isweekday`);
CREATE INDEX `isholiday` ON hlp_calendar(`isholiday`);
CREATE INDEX `yr` ON hlp_calendar(`yr`);
CREATE INDEX `qtr` ON hlp_calendar(`qtr`);
CREATE INDEX `mt` ON hlp_calendar(`mt`);
CREATE INDEX `dy` ON hlp_calendar(`dy`);
CREATE INDEX `dyw` ON hlp_calendar(`dyw`);
CREATE INDEX `dywo` ON hlp_calendar(`dywo`);
CREATE INDEX `wk` ON hlp_calendar(`wk`);
  • It's not over, we need to fill it!
  • We'll need a numbers table
  • This example is over 50 years, but it's really fast to process up to 500 years and quite small

Creating the table

INSERT INTO hlp_calendar (dt) SELECT DATE('2000-01-01', '+' || num || ' days') 
    FROM hlp_numbers WHERE num BETWEEN 0 AND 365*50;

UPDATE hlp_calendar SET yr = STRFTIME('%Y', dt), mt = STRFTIME('%m', dt), 
    dy = STRFTIME('%d', dt), dyw = STRFTIME('%w', dt), 
    wk = STRFTIME('%W', dt), isweekday = 0, isholiday = 0;

UPDATE hlp_calendar SET isweekday = 1 WHERE dyw NOT IN (1, 7);

UPDATE hlp_calendar SET dywo = 1 WHERE dy BETWEEN 1 AND 7;
UPDATE hlp_calendar SET dywo = 2 WHERE dy BETWEEN 8 AND 14;
UPDATE hlp_calendar SET dywo = 3 WHERE dy BETWEEN 15 AND 21;
UPDATE hlp_calendar SET dywo = 4 WHERE dy BETWEEN 22 AND 28;
UPDATE hlp_calendar SET dywo = 5 WHERE dy BETWEEN 29 AND 31;

UPDATE hlp_calendar SET qtr = 1 WHERE mt IN (1,2,3);
UPDATE hlp_calendar SET qtr = 2 WHERE mt IN (4,5,6);
UPDATE hlp_calendar SET qtr = 3 WHERE mt IN (7,8,9);
UPDATE hlp_calendar SET qtr = 4 WHERE mt IN (10,11,12);
UPDATE hlp_calendar SET isholiday = 1 WHERE 
    (mt = 12 AND dy IN (25, 26)) OR (mt = 1 AND dy = 1) 
    OR (mt = 6 AND dy = 24);

Important information not covered in creation of calendars

  • Other useful columns: is_workday, open_hour, close_hour, total_work_time
  • Using multiple calendars can be useful in business scenarios, consider adding a calendar_id column

Calculating nth weekday

To setup holidays or special events, you may need to work with the "dywo" column:

SELECT dt FROM hlp_calendar 
WHERE dt BETWEEN '2015-09-01' AND '2015-09-30' 
AND dyw = 1 AND dywo = 1

Labor day!

What do we get with this sql?

Planning events

Some events are re-occuring in life such as birthdays or the dreaded "Get the kids from school" event!

SELECT
    firstname, birthdate, c2.dt AS next_birthday
FROM
    birthdays r join hlp_calendar c1 on c1.dt = r.birthdate, hlp_calendar c2
WHERE
    c1.mt = c2.mt AND c1.dy = c2.dy AND 
    c2.dt BETWEEN DATE('now') AND DATE('now', '+180 days')

Getting the upcoming birthdays of my relatives and friends:

firstname     birthdate     next_birthday
-----------------------------------------
John          2001-03-02    2015-03-02
Sue           2006-06-12    2015-06-12

Planning events

Planning some re-occuring events using an events table with typical re-occurence fields:

SELECT
    name, starting_date, ending_date, repeats_on, c2.dt AS event_date
FROM
    events e join hlp_calendar c1 ON c1.dt = e.starting_date, 
    hlp_calendar c2
WHERE 
    c2.dt between e.starting_date and e.ending_date
    AND (
        (repeats_on = 'daily')
        OR (repeats_on = 'weekly' AND c1.dyw = c2.dyw)
        OR (repeats_on = 'weekday' AND c2.isweekday = 1)
        OR (repeats_on = 'monthly' AND c1.dy = c2.dy)
    )

Calculating streaks

Calculating streaks require many sub-queries to generate internal data used to produce the end results:

SELECT *
FROM (
    SELECT
        c1.dt AS from_date,
        c2.dt AS to_date,
        (SELECT COUNT(amount) AS meet FROM sales WHERE 
            amount >= 4000 AND dt BETWEEN c1.dt AND c2.dt) AS meets,
        (SELECT COUNT(amount) AS meet FROM sales WHERE 
            amount < 4000 AND dt BETWEEN c1.dt AND c2.dt) AS fails
    FROM 
        sales i1 LEFT JOIN hlp_calendar c1 ON c1.dt = i1.dt, 
        sales i2 LEFT JOIN hlp_calendar c2 ON c2.dt = i2.dt
    WHERE
        i1.id <> i2.id
        AND i1.dt <= i2.dt
) as sourcedata
WHERE meets > 0 AND fails = 0
ORDER BY meets DESC

Calculating due dates by days

You need holidays, weekends and days off in a calendar to do this:

SELECT MIN(dt), days_expired
FROM (
    SELECT c2.dt, julianday(c2.dt) - julianday(c1.dt)
         - (
            SELECT COUNT(*) FROM hlp_calendar c3 
            WHERE c3.dt BETWEEN c1.dt AND c2.dt 
            AND c3.isworkday = 0
         ) AS days_expired
    FROM hlp_calendar c1, hlp_calendar c2
    WHERE  c1.dt = '2015-01-01' 
        AND c2.dt >= c1.dt AND c2.dt < '2015-02-01'
) a
WHERE days_expired >= 6
GROUP BY days_expired
ORDER BY days_expired

Calculating due dates by hours

You need holidays, weekends, days off but also the open_hour, closing_hour and total business hours to do this:

SELECT 
    DATETIME(dt, '+' || (openhour + (hours_expired - 46)) || ' hours') AS dt
FROM (
    SELECT c2.dt, c2.openhour, c2.totalworktime, 
    (
        SELECT SUM(totalworktime) 
        FROM hlp_calendar c3 
        WHERE c3.dt >= c1.dt AND c3.dt <= c2.dt 
            AND c3.isworkday = 1
    ) AS hours_expired
    FROM hlp_calendar c1, hlp_calendar c2
    WHERE  c1.dt = '2015-01-01' AND c2.dt >= c1.dt 
        AND c2.dt <= '2015-02-01' AND c2.isworkday = 1
) AS datasource
WHERE hours_expired >= 46
LIMIT 1

Other uses?

  • As soon as you need to compute something based on time spans, use a calendar table
  • Hard to get into, but great once you get the hang of it, it can be highly beneficial

Thank you

Questions?

Mathieu Dumoulin

Senior programmer-analyst consultant

 

        Twitter: @crazycoders

        LinkedIn: crazycoders

        Git-hub: crazycodr

        Website and email: crazyone@crazycoders.net

 

Active projects:

        Standard-Exceptions: GitHub

        Infect! (Board game): Facebook page

Advanced SQL techniques for simpler batch operations

By Mathieu Dumoulin

Advanced SQL techniques for simpler batch operations

Presents a few tricks to process data better in a batch in SQL: Numbers and calendars!

  • 999