Using PostgreSQL, not Rails, to make Rails faster

Potentially abusing PostgreSQL for fun and profit

Who am I?


Partner at DockYard
GitHub: danmcclain
Twitter: @_danmcclain


Ruby on Rails

  • Ruby MVC Framework
  • Gaining awesome PostgreSQL support
    • Aaron Patterson (@tenderlove) - Refactored the PostgreSQL adapter, and massive performance improvements
    • Many other contributors for Arrays, Hstore, Ranges


Ember.js

  • Javascript MVC Framework
  • Rich, client side applications
  • At DockYard, we use Rails for our API

PostgreSQL

  • Advanced Data Types
    • Arrays
    • Hstore (Hash Store, aka Key-Value store)
    • JSON
    • Ranges
  • Powerful Features
    • Windowing functions
    • Table inheritance
    • Schemas - Separate silos within a single database

What We'll Explore:
Arrays

  • PostgreSQL has built in support for arrays of any datatype
    • That includes user-defined types as well
  • Can be multi-dimensional
    • Child arrays must have the same number of elements as their siblings

What We'll Expore:

HStore

  • Hash Store
  • Key-Value store
  • Values are stored as strings, so plan on converting values if you want to store something else

What We'll Explore 

Rank and CTEs

  • PostgreSQL has window functions
    • perform a calculation over a set of records
  • rank - determines a record's rank over a specific order
    • takes ties into account
  • CTEs or Common Table Expressions allow you to declare a specific query as a temporary table to be used in a complex query

What We'll Explore

JSON Serialization for APIs

  • JSON functions
    • array_to_json & row_to_json
  • PostgreSQL serializes data to send to the client, why not just return JSON?

Patterns we'll examine: Tagging

Patterns we'll examine: Single Table Inheritance


Patterns we'll examine: Leaderboards

Tagging


  • We want to apply tags to notes entered into our system
  • We don't care about generating a tag cloud
  • We want to be able to find all notes with a given tag

Many to Many relation

  • Usually, we'll generate a many-to-many relation between notes and tags
  • We'll have a tag record for each tag applied to a note
  • We'll have to generate a new tag when a user wants to use one we have not previously encountered

Using Arrays for tags

  • PostgreSQL has native support for array columns
  • We can store the tags as an array column on our notes
  • We no longer have to worry about creating tag records when we encounter a new one

Adding our tags array to the model

class AddTagsToNotes < ActiveRecord::Migration
  def change
    add_column :notes, :tags, :string, array: true
  end
end

Finding Tagged Notes

  • Now that we are tagging our notes, we need a way of finding them
  • PostgreSQL has multiple ways of checking if an array contains a value
    • ANY(...)
    • @> - contains operator

ANY

  • We can use the ANY() function to see if a value is stored in any element of the array

SELECT notes.*
FROM notes
WHERE 'useful' = ANY(notes.tags)
  • This won't use any indexes 

@> - Contains

  • We can use the @> operator to see if an array on the left contains the array on the right of the operator (order matters)

SELECT notes.*
FROM notes
WHERE notes.tags @> '{useful}'
  • Without an index, this has the same performance as ANY()
  • When we index the array with GIN (Generalized Inverted Index), we can see up to ~50 X speed improvement

GIN indexing our Array

class AddTagsToNotes < ActiveRecord::Migration
  def change
    add_index :notes, :tags, using: :gin
  end
end

Some Benchmarking

  • 100,000 notes, each with 100 tags
  • 10,000 possible tags
  • Table was cached by scanning it once
  • Run on a MacBook Pro Retina with 16GB of RAM and SSD

Finding Via Has_Many

First Run - 370 ms

Nested Loop
   ->  Bitmap Heap Scan on post_tags
         Recheck Cond: (tag_id = 1)
         ->  Bitmap Index Scan on index_post_tags_on_tag_id
               Index Cond: (tag_id = 1)
   ->  Index Scan using posts_pkey on posts
         Index Cond: (id = post_tags.post_id) 
 Total runtime: 370.139 ms

Subsequent calls - 13 ms

Nested Loop
  ->  Bitmap Heap Scan on post_tags
        Recheck Cond: (tag_id = 2)
        ->  Bitmap Index Scan on index_post_tags_on_tag_id
              Index Cond: (tag_id = 2)
  ->  Index Scan using posts_pkey on posts
        Index Cond: (id = post_tags.post_id)
Total runtime: 13.511 ms

Array: Finding via ANY

First run: 503 ms

Aggregate  
  ->  Seq Scan on posts
        Filter: ('Tag #1'::text = ANY ((array_tags)::text[]))
        Rows Removed by Filter: 98989
Total runtime: 503.517 ms

Subsequent runs: 500 ms

Aggregate
  ->  Seq Scan on posts
        Filter: ('Tag #1'::text = ANY ((array_tags)::text[]))
        Rows Removed by Filter: 98989
Total runtime: 499.618 ms

Array: Finding via @>

First run - Pre index: 536 ms

Aggregate
  ->  Seq Scan on posts
        Filter: (array_tags @> '{"Tag #1"}'::character varying[])
        Rows Removed by Filter: 98989
Total runtime: 536.183 ms
First run - Post index: 8ms
 Aggregate
  ->  Bitmap Heap Scan on posts
       Recheck Cond: (gin_tags @> '{"Tag #1"}'::character varying[])
       ->  Bitmap Index Scan on index_posts_on_gin_tags
             Index Cond: (gin_tags @> '{"Tag #1"}'::character varying[])
 Total runtime: 8.056 ms

Results

Method Run 1 Run 2
has_many 370 ms 13 ms
Array: ANY 503 ms 500 ms
Array: @< (no index) 499 ms 500 ms
Array: @< (indexed) 8 ms 5 ms

Drawbacks

  • You lose the ability to efficiently create a list of tags
    • Getting a distinct list of tags from a series of arrays is slow
    • It's also fairly complex
  • GIN can be slow to update
    • Multiple inserts into index (one for each key extracted)
    • Can be mitigate somewhat using FASTUPDATE

Notes

  • Database-side Cache matters
    • You will see a performance increase on subsequent runs, as the results will be in memory
    • Results across the board were quicker after the first table scan, as the table was in memory after that

Single Table Inheritence

  • We are using single table inheritance to differentiate between multiple user types
  • We want to add a set of object specific attributes for each user type
  • Previously, we could move these attributes to a profile object and that profile would belong to a class

HStore

  • PostgreSQL has a Hstore or Hash Store column type
  • We can store a set of key-value properties on this column

Enabling Hstore

  • Hstore was added to PostgreSQL 9.2 as an extension
  • Rails 4.0 added enable_extension which will allow you to enable an extension via a migration
  • enable_extension 'hstore' will enable hstore in your database

An Example

  • Joe Him at DevMynd details using Hstore with Rails 3.2 via activerecord-postgres-hstore [1]
    • He extends ActiveRecord::Base to create an hstore_attribute for his models
  • We can apply the concepts to Rails 4




[1]: http://www.devmynd.com/blog/2013-3-single-table-inheritance-hstore-lovely-combination

    Our Member Class

    • The Member class will serve as the base class for our STI infrastructure
    
    

    class CreateMembers < ActiveRecord::Migration
      def change
        enable_extension 'hstore' # Let's enable Hstore
    
        create_table :members do |t|
          t.string :username
          t.string :type
          t.hstore :data # We'll store all our attributes in a column
                         # called data
          t.timestamps
        end
      end
    end

    Extending activeRecord

    module HstoreAccessor
        def hstore_accessor(hstore_attribute, *keys)
          Array(keys).flatten.each do |key|
            define_method("#{key}=") do |value|
              send("#{hstore_attribute}=", (send(hstore_attribute) || {}).merge(key.to_s => value))
              send("#{hstore_attribute}_will_change!")
            end
            define_method(key) do
              send(hstore_attribute) && send(hstore_attribute)[key.to_s]
            end
          end
        end
      end
    end
    ActiveRecord::Base.send(:extend, HstoreAccessor)

    • We define hstore_accessor, which will define a getter and setter against our hstore column

    Creating User class

    • Users have an email and age

    class User < Member
      hstore_accessor :data, :email, :age
    end
    • Now we can access User#email, and User#age
    • We can also validate these attributes
    class User < Member
      hstore_accessor :data, :email, :age
      validates :email, :age, presence: true
    end
    
    User.new.email #=> nil
    
    User.new.valid? #=> false
    

    Creating another class

    class TwitterUser < Member
      hstore_accessor :data, :twitter_handle
    end
    • Since we only set up twitter_handle, we don't have access to email, which we set up on our User model
    twitter_user = TwitterUser.new twitter_handle: '_danmcclain', username: 'dan'
    twitter_user.twitter_handle   #=> "_danmcclain"
    twitter_user.username         #=> "dan"
    twitter_user.email            #=> NoMethodError
    

    Limitations

    • HStore stores values as strings, so you have to convert values yourself
      • We could extend hstore_attribute so that it performs type casting
      • With 9.4, we could alternatively use JSONB for indexable, typed values

    Leaderboards


    • Need a leaderboard for each game
    • Need the ability to return a user's best score for a given game
    • Want to page through our scores, but maintain ranking

    Rank()

    • PostgreSQL provides the Rank windowing function
    • Takes ties into account when ranking
    SELECT rank() OVER(ORDER BY points DESC), username, points
    FROM scores
    WHERE game = 'Barking Bostons'
    • Rank will be scoped to the current query
    ┌──────┬──────────┬────────┐
    │ rank │ username │ points │
    ├──────┼──────────┼────────┤
    │    1 │ Brian    │    300 │
    │    2 │ Dan      │    200 │
    │    3 │ Dan      │    100 │
    │    4 │ Doug     │     30 │
    └──────┴──────────┴────────┘
    

    First attempt at getting a user's rank

    SELECT rank() OVER (ORDER BY scores.points DESC), username, points
    FROM scores
    WHERE game = 'Barking Bostons' AND username = 'Dan'
    • We get back the user's scores for the game
    • BUT the ranks are only for the user's scores, it does not take into account other users
    ┌──────┬──────────┬────────┐
    │ rank │ username │ points │
    ├──────┼──────────┼────────┤
    │    1 │ Dan      │    200 │
    │    2 │ Dan      │    100 │
    └──────┴──────────┴────────┘
    

    Using Common Table Expressions to get the ranks of a User's Scores

    • Common Table Expressions (CTEs) allow you to alias a subquery
    WITH top_scores_for_game AS (
      SELECT rank() OVER (ORDER BY points DESC), username, points
      FROM scores
      WHERE scores.game = 'Barking Bostons'
    )
    SELECT top_scores_for_game.*
    FROM top_scores_for_game
    • The above query will return all the results from the WITH statement

    Attempt #2

    • We'll use a CTE to get all the scores for a game, then find our user within those results

    WITH top_scores_for_game AS (
      SELECT rank() OVER (ORDER BY points DESC), username, points
      FROM scores
      WHERE scores.game = 'Barking Bostons'
    )
    SELECT top_scores_for_game.*
    FROM top_scores_for_game
    WHERE top_scores_for_game.username = 'Dan'
    ┌──────┬──────────┬────────┐
    │ rank │ username │ points │
    ├──────┼──────────┼────────┤
    │    2 │ Dan      │    200 │
    │    3 │ Dan      │    100 │
    └──────┴──────────┴────────┘
    

    Getting ranked results in ActiveRecord

    • We can drop this whole block of SQL into a find_by_sql call to get models back
    query = <<-SQL
    WITH top_scores_for_game AS (
      SELECT rank() OVER (ORDER BY points DESC), username, points
      FROM scores
      WHERE scores.game = 'Barking Bostons'
    )
    SELECT top_scores_for_game.*
    FROM top_scores_for_game
    WHERE top_scores_for_game.username = 'Dan'
    SQL
    
    Score.find_by_sql query
    => [#<Score ... >, #<Score ... >]
    

    Using postgres_ext for rank and CTE

    • postgres_ext has support for ranking results and querying against CTEs
    • Return the scores ranked over their points for the game named 'Barking Bostons'
    Score.ranked(points: :desc).where(game: 'Barking Bostons')

    • Now lets get the user's ranked scores for the game using a CTE

    Score.from_cte('top_scores_for_game',
      Score.ranked(points: :desc).where(game: 'Barking Bostons')).where(username: 'Dan')

    JSON Serialization

    Ember-Data accepts the following JSON format
    {
      posts: [
        {
          id: 1,
          title: 'My awesome blog post',
          content: 'Loren ipsum',
          tag_ids: [1,2]
        }
      ],
      tags: [
        {
          id: 1,
          name: 'Rails'
        },
        {
          id: 2,
          name: 'PostgreSQL'
        }
      ]
     }
    

    Warning

    These concepts have not been thoroughly tested in production. Performance gains highlighted were taken with naive benchmarks

    JSON Serialization

    Generating a JSON response for your API endpoint
    1. Rails receives request for JSON payload
    2. Rails uses ActiveRecord to generate query 
    3. ActiveRecord sends query to PostgreSQL
    4. PostgreSQL builds dataset, serializes it and sends to client
    5. ActiveRecord receives dataset
    6. ActiveRecord deserializes dataset
    7. ActiveRecord generates Ruby objects from dataset
    8. Rails converts Ruby objects to JSON string
    9. Rails sends JSON response to the browser

    JSON Serialization

    • ActiveRecord generates Ruby objects from dataset
      • Each row is one model object
      • Model objects have attribute object per column
        • Multiple objects created per attribute
    • Rails converts Ruby objects to JSON string
      • Iterates through Ruby objects and generates JSON string

    SLOW!!

    JSON Serialization

    Generating a JSON response for your API endpoint
    1. Rails receives request for JSON payload
    2. Rails uses ActiveRecord to generate query 
    3. ActiveRecord sends query to PostgreSQL
    4. PostgreSQL builds dataset, serializes it and sends to client
    5. ActiveRecord receives dataset
    6. ActiveRecord deserializes dataset
    7. ActiveRecord generates Ruby objects from dataset
    8. Rails converts Ruby objects to JSON string
    9. Rails sends JSON response to the browser

    JSON Serialization

    Tell PostgreSQL to return a JSON string instead of a set of rows
    WITH notes_ids AS (
      SELECT id 
      FROM "notes" 
    ), tag_ids_by_notes AS (
      SELECT "tags"."note_id", array_agg("tags"."id") AS tag_ids 
      FROM "tags" GROUP BY "tags"."note_id" 
      HAVING "tags"."note_id" IN (SELECT "notes_ids"."id" FROM "notes_ids")
    ), tags_ids AS (
      SELECT id   FROM "tags"
    ), tags_attributes_filter AS (
      SELECT "tags"."id", "tags"."name", "tags"."note_id" 
      FROM "tags" 
      WHERE "tags"."note_id" IN (SELECT "notes_ids"."id" FROM "notes_ids")
    ), tags_as_json_array AS (
      SELECT array_to_json(array_agg(row_to_json(tags_attributes_filter))) AS tags, 1 AS match
      FROM "tags_attributes_filter"
    ), notes_attributes_filter AS (
      SELECT "notes"."id", "notes"."title", "notes"."content", coalesce("tag_ids_by_notes"."tag_ids", '{}'::int[]) AS tag_ids 
      FROM "notes" 
      LEFT OUTER JOIN "tag_ids_by_notes" 
      ON "notes"."id" = "tag_ids_by_notes"."note_id"
    ), notes_as_json_array AS (
      SELECT array_to_json(array_agg(row_to_json(notes_attributes_filter))) AS notes, 1 AS match 
      FROM "notes_attributes_filter"
    ), jsons AS (
      SELECT "tags_as_json_array"."tags", "notes_as_json_array"."notes" 
      FROM "tags_as_json_array" 
      INNER JOIN "notes_as_json_array" 
      ON "tags_as_json_array"."match" = "notes_as_json_array"."match"
    )
    SELECT row_to_json(jsons) 
    FROM "jsons"
    

    Preparing the records we want

    WITH notes_ids AS (
      SELECT id 
      FROM "notes" 
      -- We'd filter our notes that we want back here
    ), tag_ids_by_notes AS (
      SELECT "tags"."note_id", array_agg("tags"."id") AS tag_ids 
      FROM "tags" GROUP BY "tags"."note_id" 
      HAVING "tags"."note_id" IN (SELECT "notes_ids"."id" FROM "notes_ids")
    ),
    

    Generating the proper Tag format

     tags_attributes_filter AS (
      SELECT "tags"."id", "tags"."name", "tags"."note_id" 
      FROM "tags" 
      WHERE "tags"."note_id" IN (SELECT "notes_ids"."id" FROM "notes_ids")
    ), tags_as_json_array AS (
      SELECT array_to_json(array_agg(row_to_json(tags_attributes_filter))) AS tags, 1 AS match
      FROM "tags_attributes_filter"
    )
    

    Generating the proper Note format

     notes_attributes_filter AS (
      SELECT "notes"."id", "notes"."title", "notes"."content", coalesce("tag_ids_by_notes"."tag_ids", '{}'::int[]) AS tag_ids 
      FROM "notes" 
      LEFT OUTER JOIN "tag_ids_by_notes" 
      ON "notes"."id" = "tag_ids_by_notes"."note_id"
    ), notes_as_json_array AS (
      SELECT array_to_json(array_agg(row_to_json(notes_attributes_filter))) AS notes, 1 AS match 
      FROM "notes_attributes_filter"
    )
    

    Combining our Notes and Tags

    WITH jsons AS (
      SELECT "tags_as_json_array"."tags", "notes_as_json_array"."notes" 
      FROM "tags_as_json_array" 
      INNER JOIN "notes_as_json_array" 
      ON "tags_as_json_array"."match" = "notes_as_json_array"."match"
    )
    SELECT row_to_json(jsons) 
    FROM "jsons"
    

    JSON Serialization

    • Lots of SQL to generate the proper payload
    • Let's benchmark the results

    (Naive) Benchmarks

    Ruby PostgreSQL
    1000 Notes, 10000 Tags 4407 ms 66 ms 67.2 X
    500 Notes, 5000 Tags
    2071 ms 26 ms 78.5 X
    100 Notes, 1000 Tags
    427 ms 12 ms 34.2 X
    10 Notes, 100 Tags
    46 ms 10 ms 4.6 X

    Concerns:

    • One huge query
      • Puts *all* the load on PostgreSQL
      • Could fall over if you have a model with multiple associations
      • Tested with a single association
      • Split queries and combine on the application side

    Next Steps

    • Unreasonable to generate the SQL command manually
    • Let's leverage ActiveModel::Serializers (AMS)
      • DSL for specifying JSON schema
      • We monkey patch this to generate the commands
    • PostgresExt-Serializers patch AMS to use PostgreSQL

    Wrapping Up

    • PostgreSQL has some really powerful features baked in
    • Your situation ultimately dictates the best approach
    • Play with different solutions to see what is the best performance/maintainability trade off for your specific use case
    • Don't just use it because "It's new/cool"

    Questions?

    Thank you!

    Links

    Made with Slides.com