Denormalization

Marco Alabruzzo

Senior Developer @ Festicket

http://marcoala.com - marco.alabruzzo@gmail.com

London Django Meetup Group

11 April 2017

How I Learned to Stop Worrying

and Love Data Redundancy

2) Examples

3) Golden rules on how to make it right

SUMMARY

1) What is denormalisation

DEFINITION

Wikipedia

Denormalization is very similar to cache

The process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data.

the big misundendestaning

Since when we normalize we're also organising the database, this means that when we denormalize we're also disorganising the data.

WRONG

Denormalized data should be very organised.

EXAMPLE: library app

A library app that list books and relative author

EXAMPLE: library app

id	title	author
1	Harry Potter and the Philosopher's Stone	J. K. Rowling
2	Harry Potter and the Chamber of Secrets	Joanne K. Rowling
3	Harry Potter and the Prisoner of Azkaban	Rowling, Joanne K.

disorganised data

book table

EXAMPLE: library app

id	title	author_id
1	Harry Potter and the Philosopher's Stone	1
2	Harry Potter and the Chamber of Secrets	1
3	Harry Potter and the Prisoner of Azkaban	1

normalised data

id	name
1	J. K. Rowling

book table

author table

EXAMPLE: library app

id	title	author_id	cached_author_name
1	Harry Potter and the Philosopher's Stone	1	J. K. Rowling
2	Harry Potter and the Chamber of Secrets	1	J. K. Rowling
3	Harry Potter and the Prisoner of Azkaban	1	J. K. Rowling

denormalised data

id	name
1	J. K. Rowling

book table

author table

EXAMPLE: library app

class Book(models.Model):
    title = models.CharField()
    author = models.ForeignKey(Author, reverse_name='books')
    cached_author_name = models.CharField()


class Author(models.Model):
    name = models.CharField()

   def save(self, *args, **kwargs):
       for book in self.books:
           book.cached_author_name = self.name
           book.save()
       return super().save(*args, **kwargs)

EXAMPLE 2: balance app

An app to manage bank accounts and transactions

EXAMPLE 2: balance app

id	account_id	datetime	amount
1	1	2017-04-07 10:23:3	+10000
2	1	2017-04-07 18:53:3	-2000
3	1	2017-04-09 06:25:3	-500
4	1	2017-04-09 10:34:3	+600

normalised data

transaction table

EXAMPLE 2: balance app

get account balance

from django.db.models import F, Sum

def get_account_balance(account_id):
    transactions = Transaction.objects.filter(account_id=account_id)
    return transactions.aggregate(balance=Sum(F('amount'))

EXAMPLE 2: balance app

full balance

Fooclays bank

Welcome Jon Doe

Your current balance is £ 8.100,00

Date and time	Amount	Balance
2017-04-07 10:23:3	£ 10.000,00	£ 10.000,00
2017-04-07 18:53:3	£ -2.000,00	£ 8.000,00
2017-04-09 06:25:3	£ -500,00	£ 7.500,00
2017-04-09 10:34:3	£ +600,00	£ 8.100,00

Transactions for April 2017

Home

Statements

Logout

EXAMPLE 2: balance app

id	account_id	datetime	amount	cached_balance
1	1	2017-04-07 10:23:3	+10000	10000
2	1	2017-04-07 18:53:3	-2000	8000
3	1	2017-04-09 06:25:3	-500	7500
4	1	2017-04-09 10:34:3	+600	8100

denormalised data

transaction table

EXAMPLE 2: balance app

denormalized model

class Transaction(models.Model):
    cached_balance = models.FloatField()
    …

    def calculate_balance(self):
        transactions = Transaction.objects.filter(
            account_id=self.account_id,
            datetime_lte=self.datetime
        )
        return transactions.aggregate(balance=Sum(F('amount'))

   def save(self, *args, **kwargs):
       if self.balance is None:
           self.balance = self.calculate_balance()
       return super().save(*args, **kwargs)

DEFINITION

Wikipedia

The process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data.

(again)

The 3 Golden rules

1) Never denormalize if you don't need to

Denormalization helps when you have performance issues or you expect to have them soon (use the 10/100 rules)

2) Define a single source of truth

Always define which is the original copy of the data, and what is the denormalized one (a prefix like "cached_" can help).

In this way you can easily amend the the denormalized version, and you avoid that other developers update the wrong copy of the data.

3) Be sure to cover all the CRUD methods

Create, read, update and delete.

Check if you’ve covered all these cases, unit test are good tool in this scenario since will also helps to avoid regression.

bottom line

Denormalization is not bad, if you use when is necessary and if you implement it with a single source of truth and covering all the cases.

Thanks!

Questions?