Advanced SQL techniques
For simpler batch operations
About this presentation
- Tricks to optimize SQL batch operations
- Presentation of the Numbers table concept
- Presentation of the Calendar table concept
Misusage of computation power
- Get all data from database
- Crunch one entry at a time on local application server
- Apply the result back to the database
Why do a batch process unit by unit?
Problems with this approach?
- Getting all data and sending it to memory (copy)
- Slower to process each item at once, database is much faster
- Super long to process indexes one entry at a time when pushing results
First trick
The numbers table
What is a numbers table?
- Simple table with one field
- The only field and primary key is auto-incremented
- Contains numbers from 0 to 100 000 or 1 to 100 000
What can you do with it?
- Find missing items in sequences
- Parse collated data into rows
- Duplicate data easily
- And much more
Creating the table
CREATE TABLE hlp_numbers (num INTEGER NOT NULL, PRIMARY KEY(num));
INSERT INTO hlp_numbers VALUES(1);
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers SELECT num + (SELECT MAX(num) FROM hlp_numbers) FROM hlp_numbers;
INSERT INTO hlp_numbers VALUES(0);
DELETE FROM hlp_numbers WHERE num > 100000;
Generating sequences
SELECT * FROM hlp_numbers
WHERE num BETWEEN 1 AND 500 AND num % 5 = 0
Get the 5th numbers between 1 and 500
Yields
5, 10, 15, 20, 25, ...
Generating sequences
SELECT DATE('2015-02-05', '+' || num || ' days') FROM hlp_numbers
WHERE num < 365 AND num % 3 = 0
Get every 3rd day from now until next year
Yields
2015-02-05, 2015-02-08, 2015-02-11, ...
Finding missing items in sequences
SELECT num
FROM hlp_numbers n
LEFT JOIN invoices i ON i.invoice_no = n.num
WHERE i.invoice_no IS NULL
AND n.num BETWEEN
(SELECT MIN(invoice_no) FROM invoices)
AND (SELECT MAX(invoice_no) FROM invoices)
Join the numbers on existing data to find missing items:
Finding missing items in sequences
SELECT
DATE('2015-02-05', '+' || num || ' days') AS date, room_number
FROM
rooms r, hlp_numbers n
WHERE
DATE('2015-02-05', '+' || num || ' days') NOT IN (
SELECT reservation_date FROM reservations
WHERE room_number = r.room_number
)
AND num BETWEEN 1 AND 5
Can you imagine a check for unbooked rooms in an hotel registry?
Parsing/splitting strings/rows
1,2,3,4,5
Mathieu,Eric,Valerie,Martin,John
Seen before, serialized data inside a database column that should have been in different rows or columns:
hello:9|world:17
i:6|love:4|programming:3
Parsing/splitting strings/rows
SELECT
CASE
WHEN INSTR(SUBSTR(d.textdata, n.num + 1), ',') = 0
THEN SUBSTR(SUBSTR(d.textdata, n.num + 1), 0)
ELSE SUBSTR(SUBSTR(d.textdata, n.num + 1), 0,
INSTR(SUBSTR(d.textdata, n.num + 1), ','))
END as data
FROM
csvdata AS d, hlp_numbers AS n
WHERE
SUBSTR(d.textdata, n.num, 1) = ',' num = 0
ORDER BY textdata, n.num
Split using numbers and intelligent substrings:
Original data:
"1,2,3,4,5"
Result:
"1"
"2"
"3"
"4"
"5"
Parsing/splitting strings/rows
SELECT
textdata, num,
SUBSTR(SUBSTR(d.textdata, n.num + 1), 0, INSTR(SUBSTR(d.textdata, n.num + 1), ':')) AS label,
CASE
WHEN INSTR(SUBSTR(d.textdata, n.num + 1), '|') = 0
THEN SUBSTR(SUBSTR(d.textdata, n.num + 1), INSTR(SUBSTR(d.textdata, n.num + 1), ':') + 1)
ELSE SUBSTR(SUBSTR(d.textdata, n.num + 1), INSTR(SUBSTR(d.textdata, n.num + 1), ':') + 1,
INSTR(SUBSTR(d.textdata, n.num + 1), '|') - INSTR(SUBSTR(d.textdata, n.num + 1), ':') - 1)
END as value
FROM
csvdata AS d, hlp_numbers AS n
WHERE
SUBSTR(d.textdata, n.num, 1) = '|'
OR num = 0
ORDER BY textdata, n.num
Multi-separator is more complex but works as well:
Original data:
"hello:9|world:17" "i:6|love:4|programming:3"
Results in:
hello 9 i 6
world 17 love 4 programming 3
Duplicate data
INSERT INTO actions
SELECT action_fields, invoice_id
FROM actions, invoices
WHERE actions.id IN (1,2,3)
AND invoices.id BETWEEN 1 AND 10;
Seen before:
- Using unrelated tables = looking for trouble!
- You never know when table structure or content will change!
Duplicate data
INSERT INTO actions
SELECT action_fields, num
FROM actions, hlp_numbers
WHERE actions.id IN (1,2,3)
AND hlp_numbers.num BETWEEN 1 AND 10;
Better way:
- You know that hlp_numbers will always contain 1 to 100 000
Other uses?
- Many things you'd loop or consider a batch processing can be used this way!
- Remember, re-indexing one item at a time is long
- Hard to get into, but great once you get the hang of it
Second trick
The calendar table
Calendar tables are...
- Just like a numbers table
- But with more information
- Helps you abstract batch date processing
What can you do with calendars?
- Compute or plan events
- Find missing dates (Already demonstrated)
- Compute streaks
- Compute delivery dates or due dates
- And much more
Creating the table
CREATE TABLE IF NOT EXISTS `hlp_calendar` (
`dt` date NOT NULL,
`isweekday` bit(1) DEFAULT NULL, `isholiday` bit(1) DEFAULT NULL,
`yr` smallint(6) DEFAULT NULL, `qtr` tinyint(4) DEFAULT NULL,
`mt` tinyint(4) DEFAULT NULL, `dy` tinyint(4) DEFAULT NULL,
`dyw` tinyint(4) DEFAULT NULL, `wk` tinyint(4) DEFAULT NULL,
`dywo` tinyint(4) DEFAULT NULL, PRIMARY KEY (`dt`)
);
CREATE INDEX `isweekday` ON hlp_calendar(`isweekday`);
CREATE INDEX `isholiday` ON hlp_calendar(`isholiday`);
CREATE INDEX `yr` ON hlp_calendar(`yr`);
CREATE INDEX `qtr` ON hlp_calendar(`qtr`);
CREATE INDEX `mt` ON hlp_calendar(`mt`);
CREATE INDEX `dy` ON hlp_calendar(`dy`);
CREATE INDEX `dyw` ON hlp_calendar(`dyw`);
CREATE INDEX `dywo` ON hlp_calendar(`dywo`);
CREATE INDEX `wk` ON hlp_calendar(`wk`);
- It's not over, we need to fill it!
- We'll need a numbers table
- This example is over 50 years, but it's really fast to process up to 500 years and quite small
Creating the table
INSERT INTO hlp_calendar (dt) SELECT DATE('2000-01-01', '+' || num || ' days')
FROM hlp_numbers WHERE num BETWEEN 0 AND 365*50;
UPDATE hlp_calendar SET yr = STRFTIME('%Y', dt), mt = STRFTIME('%m', dt),
dy = STRFTIME('%d', dt), dyw = STRFTIME('%w', dt),
wk = STRFTIME('%W', dt), isweekday = 0, isholiday = 0;
UPDATE hlp_calendar SET isweekday = 1 WHERE dyw NOT IN (1, 7);
UPDATE hlp_calendar SET dywo = 1 WHERE dy BETWEEN 1 AND 7;
UPDATE hlp_calendar SET dywo = 2 WHERE dy BETWEEN 8 AND 14;
UPDATE hlp_calendar SET dywo = 3 WHERE dy BETWEEN 15 AND 21;
UPDATE hlp_calendar SET dywo = 4 WHERE dy BETWEEN 22 AND 28;
UPDATE hlp_calendar SET dywo = 5 WHERE dy BETWEEN 29 AND 31;
UPDATE hlp_calendar SET qtr = 1 WHERE mt IN (1,2,3);
UPDATE hlp_calendar SET qtr = 2 WHERE mt IN (4,5,6);
UPDATE hlp_calendar SET qtr = 3 WHERE mt IN (7,8,9);
UPDATE hlp_calendar SET qtr = 4 WHERE mt IN (10,11,12);
UPDATE hlp_calendar SET isholiday = 1 WHERE
(mt = 12 AND dy IN (25, 26)) OR (mt = 1 AND dy = 1)
OR (mt = 6 AND dy = 24);
Important information not covered in creation of calendars
- Other useful columns: is_workday, open_hour, close_hour, total_work_time
- Using multiple calendars can be useful in business scenarios, consider adding a calendar_id column
Calculating nth weekday
To setup holidays or special events, you may need to work with the "dywo" column:
SELECT dt FROM hlp_calendar
WHERE dt BETWEEN '2015-09-01' AND '2015-09-30'
AND dyw = 1 AND dywo = 1
Labor day!
What do we get with this sql?
Planning events
Some events are re-occuring in life such as birthdays or the dreaded "Get the kids from school" event!
SELECT
firstname, birthdate, c2.dt AS next_birthday
FROM
birthdays r join hlp_calendar c1 on c1.dt = r.birthdate, hlp_calendar c2
WHERE
c1.mt = c2.mt AND c1.dy = c2.dy AND
c2.dt BETWEEN DATE('now') AND DATE('now', '+180 days')
Getting the upcoming birthdays of my relatives and friends:
firstname birthdate next_birthday
-----------------------------------------
John 2001-03-02 2015-03-02
Sue 2006-06-12 2015-06-12
Planning events
Planning some re-occuring events using an events table with typical re-occurence fields:
SELECT
name, starting_date, ending_date, repeats_on, c2.dt AS event_date
FROM
events e join hlp_calendar c1 ON c1.dt = e.starting_date,
hlp_calendar c2
WHERE
c2.dt between e.starting_date and e.ending_date
AND (
(repeats_on = 'daily')
OR (repeats_on = 'weekly' AND c1.dyw = c2.dyw)
OR (repeats_on = 'weekday' AND c2.isweekday = 1)
OR (repeats_on = 'monthly' AND c1.dy = c2.dy)
)
Calculating streaks
Calculating streaks require many sub-queries to generate internal data used to produce the end results:
SELECT *
FROM (
SELECT
c1.dt AS from_date,
c2.dt AS to_date,
(SELECT COUNT(amount) AS meet FROM sales WHERE
amount >= 4000 AND dt BETWEEN c1.dt AND c2.dt) AS meets,
(SELECT COUNT(amount) AS meet FROM sales WHERE
amount < 4000 AND dt BETWEEN c1.dt AND c2.dt) AS fails
FROM
sales i1 LEFT JOIN hlp_calendar c1 ON c1.dt = i1.dt,
sales i2 LEFT JOIN hlp_calendar c2 ON c2.dt = i2.dt
WHERE
i1.id <> i2.id
AND i1.dt <= i2.dt
) as sourcedata
WHERE meets > 0 AND fails = 0
ORDER BY meets DESC
Calculating due dates by days
You need holidays, weekends and days off in a calendar to do this:
SELECT MIN(dt), days_expired
FROM (
SELECT c2.dt, julianday(c2.dt) - julianday(c1.dt)
- (
SELECT COUNT(*) FROM hlp_calendar c3
WHERE c3.dt BETWEEN c1.dt AND c2.dt
AND c3.isworkday = 0
) AS days_expired
FROM hlp_calendar c1, hlp_calendar c2
WHERE c1.dt = '2015-01-01'
AND c2.dt >= c1.dt AND c2.dt < '2015-02-01'
) a
WHERE days_expired >= 6
GROUP BY days_expired
ORDER BY days_expired
Calculating due dates by hours
You need holidays, weekends, days off but also the open_hour, closing_hour and total business hours to do this:
SELECT
DATETIME(dt, '+' || (openhour + (hours_expired - 46)) || ' hours') AS dt
FROM (
SELECT c2.dt, c2.openhour, c2.totalworktime,
(
SELECT SUM(totalworktime)
FROM hlp_calendar c3
WHERE c3.dt >= c1.dt AND c3.dt <= c2.dt
AND c3.isworkday = 1
) AS hours_expired
FROM hlp_calendar c1, hlp_calendar c2
WHERE c1.dt = '2015-01-01' AND c2.dt >= c1.dt
AND c2.dt <= '2015-02-01' AND c2.isworkday = 1
) AS datasource
WHERE hours_expired >= 46
LIMIT 1
Other uses?
- As soon as you need to compute something based on time spans, use a calendar table
- Hard to get into, but great once you get the hang of it, it can be highly beneficial
Thank you
Questions?
Mathieu Dumoulin
Senior programmer-analyst consultant
Twitter: @crazycoders
LinkedIn: crazycoders
Git-hub: crazycodr
Website and email: crazyone@crazycoders.net
Active projects:
Standard-Exceptions: GitHub
Infect! (Board game): Facebook page
Advanced SQL techniques for simpler batch operations
By Mathieu Dumoulin
Advanced SQL techniques for simpler batch operations
Presents a few tricks to process data better in a batch in SQL: Numbers and calendars!
- 999