Denormalization

Marco Alabruzzo

Senior Developer @ Festicket

http://marcoala.com - marco.alabruzzo@gmail.com

London Django Meetup Group

11 April 2017

or

How I Learned to Stop Worrying

and Love Data Redundancy

2) Examples

3) Golden rules on how to make it right

SUMMARY

1) What is denormalisation

DEFINITION

Wikipedia

Denormalization is very similar to cache

The process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data.

the big misundendestaning

Since when we normalize we're also organising the database, this means that when we denormalize we're also disorganising the data.

WRONG

Denormalized data should be very organised.

EXAMPLE: library app

A library app that list books and relative author

EXAMPLE: library app

id title author
1 Harry Potter and the Philosopher's Stone J. K. Rowling
2 Harry Potter and the Chamber of Secrets Joanne K. Rowling
3 Harry Potter and the Prisoner of Azkaban Rowling, Joanne K.

disorganised data

book table

EXAMPLE: library app

id title author_id
1 Harry Potter and the Philosopher's Stone 1
2 Harry Potter and the Chamber of Secrets 1
3 Harry Potter and the Prisoner of Azkaban 1

normalised data

id name
1 J. K. Rowling

book table

author table

EXAMPLE: library app

id title author_id cached_author_name
1 Harry Potter and the Philosopher's Stone 1 J. K. Rowling
2 Harry Potter and the Chamber of Secrets 1 J. K. Rowling
3 Harry Potter and the Prisoner of Azkaban 1 J. K. Rowling

denormalised data

id name
1 J. K. Rowling

book table

author table

EXAMPLE: library app

class Book(models.Model):
    title = models.CharField()
    author = models.ForeignKey(Author, reverse_name='books')
    cached_author_name = models.CharField()


class Author(models.Model):
    name = models.CharField()

   def save(self, *args, **kwargs):
       for book in self.books:
           book.cached_author_name = self.name
           book.save()
       return super().save(*args, **kwargs)

EXAMPLE 2: balance app

An app to manage bank accounts and transactions

EXAMPLE 2: balance app

id account_id datetime amount
1 1 2017-04-07 10:23:3 +10000
2 1 2017-04-07 18:53:3 -2000
3 1 2017-04-09 06:25:3 -500
4 1 2017-04-09 10:34:3 +600

normalised data

transaction table

EXAMPLE 2: balance app

get account balance

from django.db.models import F, Sum

def get_account_balance(account_id):
    transactions = Transaction.objects.filter(account_id=account_id)
    return transactions.aggregate(balance=Sum(F('amount'))

EXAMPLE 2: balance app

full balance

Fooclays bank

Welcome Jon Doe

Your current balance is £ 8.100,00

Date and time Amount Balance
2017-04-07 10:23:3 £ 10.000,00 £ ​10.000,00
2017-04-07 18:53:3 £ -2.000,00 £ ​8.000,00
2017-04-09 06:25:3 £ -500,00 £ ​7.500,00
2017-04-09 10:34:3 £ +600,00 £ ​8.100,00

Transactions for April 2017

Home

Statements

Logout

EXAMPLE 2: balance app

id account_id datetime amount cached_balance
1 1 2017-04-07 10:23:3 +10000 10000
2 1 2017-04-07 18:53:3 -2000 8000
3 1 2017-04-09 06:25:3 -500 7500
4 1 2017-04-09 10:34:3 +600 8100

denormalised data

transaction table

EXAMPLE 2: balance app

denormalized model

class Transaction(models.Model):
    cached_balance = models.FloatField()
    …

    def calculate_balance(self):
        transactions = Transaction.objects.filter(
            account_id=self.account_id,
            datetime_lte=self.datetime
        )
        return transactions.aggregate(balance=Sum(F('amount'))

   def save(self, *args, **kwargs):
       if self.balance is None:
           self.balance = self.calculate_balance()
       return super().save(*args, **kwargs)

DEFINITION

Wikipedia

The process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data.

(again)

The 3 Golden rules

1) Never denormalize if you don't need to

Denormalization helps when you have performance issues or you expect to have them soon (use the 10/100 rules)

2) Define a single source of truth

Always define which is the original copy of the data, and what is the denormalized one (a prefix like "cached_" can help).

In this way you can easily amend the the denormalized version, and you avoid that other developers update the wrong copy of the data.

3) Be sure to cover all the CRUD methods

Create, read, update and delete.

Check if you’ve covered all these cases, unit test are good tool in this scenario since will also helps to avoid regression.

bottom line

Denormalization is not bad, if you use when is necessary and if you implement it with a single source of truth and covering all the cases.

Thanks!

Questions?

Denormalization

By Marco Alabruzzo

Denormalization

  • 1,651