Coping Without DB Normalization, the No-Win Scenario We All live With
I'm your host:
Tod Hansmann
Problem Solver, Horrible Person
Disclaimer, I'm a dev, not a DBA. I just pretend sometimes, and you can too!
Normalization, ACID, and you
Some History:
(in rant form)
Edgar Frank "Ted" Codd should be your favorite go-to when talking about databases.
Without it, your data isn't reliable, meaning it should not be trusted. ACID is not required, but you better know why it's not applicable to your domain without handwavy nonsense and laziness.
Many prominent figures in the dev world challenge normalization as even helpful, let alone needed. They are wrong, and they've been wrong since the 1970s
Your NoSQL is Cute
No really, it is and I love NoSQL tech's excitement, but it should be understood that NoSQL solutions tend to sacrifice ACID for the same reasons we've been touting for longer than most our careers, and the only reason it's viable today is we can throw more resources at it to mitigate issues we're expecting. This is akin to Google's planning for machines to fail. It really does have a place, but it is not the silver bullet many these days will tout.
Normalization
(third form is all you can really go for, unless you're full-time on this)
* Adapted from William Kent's descriptions from 1982
First Normal Form
Under first normal form, all occurrences of a record type must contain the same number of fields.
Translation: every row has to have the same number of fields. You can't have a person with two addresses in the same row, for instance.
2nd & 3rd Normal Form
Under second and third normal forms, a non-key field must provide a fact about the key, the whole key, and nothing but the key. In addition, the record must satisfy first normal form.
Translation: You have a key (not necessarily a primary key) and every field should be a description of just that key. So an address doesn't describe a person, unless you only have one person ever at an address. It's not unique to a person. A social security number is. A state, by extension, does not apply to a person, it applies to an address. This gets weird when you start thinking about names and stuff.
Who cares?
The three things we do.
Storage
While storage is cheap, storage ACCESS is not.
Embedded doesn't care about your resource waste.
Consistency
If your data is important, if it drives your business or your operations or even your application state, it better be consistent.
This has to do with verifiability as well, which is a deep subject we don't have time for, but suffice to say this is incredibly important in anything where audit trails help.
Querying
The one place devs seem to care about
Joins have to be thought out, relationships traversed etc, but the execution of the queries becomes more flexible, and a faster route getting ONLY the information you need can be found.
I can not stress enough that this has been studied to death. We don't HAVE to deal with it because we have resources to waste, but we're poor at our craft for it.
But Tod, My Database in Production Isn't Normalized!
A note of panic in his voice.
The two scenarios:
You have one codebase accessing the database
You have more than one codebase accessing the database
(whether you control them all or not)
The strategies:
Refactoring (big discussion!)
Deprecating, and all the horrors it entails, but it is often required for real life scenarios, especially in the field.
Triggers, Views, Materialized Views, Versioning, Comments Galore, and Abstractions
The Pains
- You will never be finished
- You will often have a lot of things that are just harder to think about when normalized. That's ok.
- The "normalize everything" dogma is just as bad as the "don't normalize anything" dogma. Sorry.
- Your database will make this harder, because you'll want to use X tool/method, and it won't have it
The Joys
- Your efforts have immediate rewards.
- Long term the investment also pays off.
- Performance Bottlenecks will be more likely to stay in code rather than the schema of your data.
- You gain a lot of knowledge of the data, something most teams will lack if new to the data (document things now!)
- Now things can be coupled to CONTEXT rather than SCHEMA of the data. Less coupling == less bugs
Always end on a high note.
Q&A
Because our vague discussion is never enough.
Further reading:
- "Refactoring Databases" - Scott J Ambler and Pramod J. Sadalage
- http://programmers.stackexchange.com/a/215694
- Any of Codd's papers (warning, 70s computer research, quite dry)
Coping Without DB Normalization, the No-Win Scenario We All live With
By Tod Hansmann
Coping Without DB Normalization, the No-Win Scenario We All live With
- 494