Marco Alabruzzo
Senior Developer @ Festicket
http://marcoala.com - marco.alabruzzo@gmail.com
London Django Meetup Group
11 April 2017
or
How I Learned to Stop Worrying
and Love Data Redundancy
2) Examples
3) Golden rules on how to make it right
1) What is denormalisation
Wikipedia
Denormalization is very similar to cache
The process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data.
Since when we normalize we're also organising the database, this means that when we denormalize we're also disorganising the data.
Denormalized data should be very organised.
A library app that list books and relative author
id | title | author |
---|---|---|
1 | Harry Potter and the Philosopher's Stone | J. K. Rowling |
2 | Harry Potter and the Chamber of Secrets | Joanne K. Rowling |
3 | Harry Potter and the Prisoner of Azkaban | Rowling, Joanne K. |
disorganised data
book table
id | title | author_id |
---|---|---|
1 | Harry Potter and the Philosopher's Stone | 1 |
2 | Harry Potter and the Chamber of Secrets | 1 |
3 | Harry Potter and the Prisoner of Azkaban | 1 |
normalised data
id | name |
---|---|
1 | J. K. Rowling |
book table
author table
id | title | author_id | cached_author_name |
---|---|---|---|
1 | Harry Potter and the Philosopher's Stone | 1 | J. K. Rowling |
2 | Harry Potter and the Chamber of Secrets | 1 | J. K. Rowling |
3 | Harry Potter and the Prisoner of Azkaban | 1 | J. K. Rowling |
denormalised data
id | name |
---|---|
1 | J. K. Rowling |
book table
author table
class Book(models.Model):
title = models.CharField()
author = models.ForeignKey(Author, reverse_name='books')
cached_author_name = models.CharField()
class Author(models.Model):
name = models.CharField()
def save(self, *args, **kwargs):
for book in self.books:
book.cached_author_name = self.name
book.save()
return super().save(*args, **kwargs)
An app to manage bank accounts and transactions
id | account_id | datetime | amount |
---|---|---|---|
1 | 1 | 2017-04-07 10:23:3 | +10000 |
2 | 1 | 2017-04-07 18:53:3 | -2000 |
3 | 1 | 2017-04-09 06:25:3 | -500 |
4 | 1 | 2017-04-09 10:34:3 | +600 |
normalised data
transaction table
get account balance
from django.db.models import F, Sum
def get_account_balance(account_id):
transactions = Transaction.objects.filter(account_id=account_id)
return transactions.aggregate(balance=Sum(F('amount'))
full balance
Fooclays bank
Welcome Jon Doe
Your current balance is £ 8.100,00
Date and time | Amount | Balance |
---|---|---|
2017-04-07 10:23:3 | £ 10.000,00 | £ 10.000,00 |
2017-04-07 18:53:3 | £ -2.000,00 | £ 8.000,00 |
2017-04-09 06:25:3 | £ -500,00 | £ 7.500,00 |
2017-04-09 10:34:3 | £ +600,00 | £ 8.100,00 |
Transactions for April 2017
Home
Statements
Logout
id | account_id | datetime | amount | cached_balance |
---|---|---|---|---|
1 | 1 | 2017-04-07 10:23:3 | +10000 | 10000 |
2 | 1 | 2017-04-07 18:53:3 | -2000 | 8000 |
3 | 1 | 2017-04-09 06:25:3 | -500 | 7500 |
4 | 1 | 2017-04-09 10:34:3 | +600 | 8100 |
denormalised data
transaction table
denormalized model
class Transaction(models.Model):
cached_balance = models.FloatField()
…
def calculate_balance(self):
transactions = Transaction.objects.filter(
account_id=self.account_id,
datetime_lte=self.datetime
)
return transactions.aggregate(balance=Sum(F('amount'))
def save(self, *args, **kwargs):
if self.balance is None:
self.balance = self.calculate_balance()
return super().save(*args, **kwargs)
Wikipedia
The process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data.
(again)
Denormalization helps when you have performance issues or you expect to have them soon (use the 10/100 rules)
Always define which is the original copy of the data, and what is the denormalized one (a prefix like "cached_" can help).
In this way you can easily amend the the denormalized version, and you avoid that other developers update the wrong copy of the data.
Create, read, update and delete.
Check if you’ve covered all these cases, unit test are good tool in this scenario since will also helps to avoid regression.
Denormalization is not bad, if you use when is necessary and if you implement it with a single source of truth and covering all the cases.
Thanks!
Questions?