Sql Performance for Rails

@h6165

Abhishek Yadav

ரூபீ ப்ரோக்ராமர்

Co-organizer: Chennai.rb

Topics covered

  • Indexing
  • N+1s
  • Joins
  • Pagination and Reports
  • Geoquerying
  • Serialization

Indexing

Indexing - the goal

The goal is the avoid full table scans

Indexing - the basics

  • An index is a on disk data structure
  • Like the index in a book, it can point to the row where our data sits
  • We have to create indexes manually, just like we create tables
  • An index is created for a table and a column or columns

Indexing - the basics

Example (Rails and mysql) -


-- Create an index

CREATE INDEX users_on_email ON users (email);


-- It is used for this query
select * from users where email="abhishek@example.com";

ActiveRecord::Migration.add_index :users, :email

User.where(email: "abhishek@example.com")

Indexing - the essentials

  • We need indexes for WHERE clause
  • We need indexes for ORDER BY clause
  • We need indexes for reference columns
  • We need compound indexes sometimes
  • We don't need low selectivity indexes
  • We need function indexes sometimes
  • We need partial indexes sometimes

Indexing - ORDER BY clause

A query requesting less data will also do a full table scan if the ORDER BY clause has no index


-- Full table scan if there is no index on created_at

EXPLAIN SELECT * from users ORDER BY created_at DESC LIMIT 10;

+----+-------------+-------+------------+------+---------------+------+---------+------+-------+----------+----------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra          |
+----+-------------+-------+------------+------+---------------+------+---------+------+-------+----------+----------------+
|  1 | SIMPLE      | users | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 50019 |   100.00 | Using filesort |
+----+-------------+-------+------------+------+---------------+------+---------+------+-------+----------+----------------+

Indexing - ORDER BY clause - composite index

A query with multiple columns in ORDER BY needs a composite index


-- Full table scan if there is no composite index

explain SELECT * from users ORDER BY created_at, first_name LIMIT 10;

+----+-------------+-------+------------+------+---------------+------+---------+------+-------+----------+----------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra          |
+----+-------------+-------+------------+------+---------------+------+---------+------+-------+----------+----------------+
|  1 | SIMPLE      | users | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 50019 |   100.00 | Using filesort |
+----+-------------+-------+------------+------+---------------+------+---------+------+-------+----------+----------------+

Indexing - ORDER BY clause - composite index

  • For a composite index sequence is also important
  • Best to keep high-cardinalty column first

-- create the compound index

CREATE INDEX users_on_created_at_first_name ON users (created_at, first_name);


-- Check explain

explain SELECT * from users ORDER BY created_at, first_name LIMIT 10;
+----+-------------+-------+------------+-------+---------------+--------------------------------+---------+------+------+----------+-------+
| id | select_type | table | partitions | type  | possible_keys | key                            | key_len | ref  | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+--------------------------------+---------+------+------+----------+-------+
|  1 | SIMPLE      | users | NULL       | index | NULL          | users_on_created_at_first_name | 773     | NULL |   10 |   100.00 | NULL  |
+----+-------------+-------+------------+-------+---------------+--------------------------------+---------+------+------+----------+-------+

Indexing - Pattern match 

  • A %abhishek% will not be able to use any index
  • A  abhishek% will be able to use an index

Indexing - Polymorphic associations - composite index

  • For a polymorphic association, we need a composite index on the (id,type) combination

Indexing - Reference IDs (foreign-keys)

  • The foreign-key ids can use indexes (belongs_to, habtm)
  • Need not add index if a foreign key constraint is present

Indexing - Function indexes

  • SQL functions applied on a column will not be able to use indexes
  • Mysql (pre 5.7) doesn't have function indexes
  • Postgresql has them

-- Does not use the index on email (lower is not needed in mysql)
select * from users where lower(email)="abhishek@example.com";

Indexing - Partial indexes

  • Partial indexes allow us to index part a subset of a table
  • Useful when we have a pseudo-delete implementation
  • Available in Postgres

Indexing - Low selectivity indexes

  • Indexes on columns that have few distinct values are useless
  • They also add to write cost
  • Example - gender

N+1s

Handling N+1s

  • Are the biggest cause of performance problems in Rails apps
  • Problem much bigger with APIs and reports
  • Easy to introduce, not as easy to fix

Handling N+1s - the basics

  • Happen when we run a query inside a loop
  • Can be fixed by eager-loading

# Leads to N+1
@users = User.featured.order(:rating).limit(20)

<% @users.each do |user| %>
  <%= user.name %> has <%= user.posts.size %> posts
<% end %>


# Eager loading
@users = User.featured.order(:rating).limit(20).includes(:posts)
@users = User.featured.order(:rating).limit(20).eager_load(:posts)

Handling N+1s - eager loading

  • includes fires another query, eager_load does an OUTER JOIN
  • includes generally sufficient 80% of times
  • eager_load can lead to a slow query

# Leads to N+1
@users = User.featured.order(:rating).limit(20)

<% @users.each do |user| %>
  <%= user.name %> has <%= user.posts.size %> posts
<% end %>


# Eager loading
@users = User.featured.order(:rating).limit(20).includes(:posts)
@users = User.featured.order(:rating).limit(20).eager_load(:posts)

Handling N+1s - unintended

  • Api building library generally require us to specify methods that get the needed data. And then the methods are run over a collection or records
  • If any of the methods contain a query, N+1s happen
  • Eager loading doesn't help
# This method will always fire a query. Eager loading or not
def ratings_count
  self.ratings.count
end

Handling N+1s - unintended

  • The only solution is to not use any query methods in such scenario (API building, reports etc)
  • Better to use Ruby code based alternatives in some places
  • Eg1 - count
# counting in Ruby is better than using SQL count

def ratings_count
  self.ratings.to_a.count
end

Handling N+1s - unintended

  • Eg2 - overuse of scopes

class User

  # Leads N+1s, despite the eager-loading,
  # because Rating.active is a scope that fires another query
  def active_ratings
    self.ratings.active
  end

  # A simple alternative, doesn't cause N+1
  def active_ratings2
    self.ratings{ |r| r.active? }
  end

end

Handling N+1s - unintended

  • Eg2 - avoid the pluck


# Pluck will always fire a query
def ratings_sum
  ratings.pluck(:stars).sum
end

# Here if ratings is eager-loaded, we avoid the extra query
def ratings_sum2
  ratings.select(&:stars).sum
end

Handling N+1s - eager-load overdone

  • Eager load calls can end-up loading too much data


# Restaurants in the city may be few
# but some of them may have too many comments or check-ins

Restaurant.in_city(city)
  .includes(:ratings => [:rater])
  .includes(:comments => [:commenter, :comment_ratings])
  .includes(:check_ins => [:customer])

Joins

Joins

  • Joins can easily become bottlenecks if not done right
  • Outer joins and inner joins work differently
  • Use of indexes is confusing
  • De-normalizing can help

Joins - the risk

  • The biggest risk is when we request a join of two large tables
  • Database may have to do MxN amount of work
  • Indexes can help somewhat, but at least one table will undergo full scan
  • With habtms, we might be joining three tables

# Get top twenty restaurants by ratings
# Doest show those without ratings

Restaurant.joins(:ratings)
    .select("restaurants.id, count(ratings.restaurant_id) as ratings_count")
    .group("ratings.restaurant_id")
    .order("ratings_count desc")
    .limit(20)

Joins - recommendations

  • Inner Joins work best when one set is small
  • We can unsure that by introducing some kind of scope, on at least one side

# Get top twenty Restaurants by recent ratings
# Doest show restaurants without ratings

Restaurant.joins(:ratings)
    .where("ratings.created_at > ?", 3.month_ago)
    .select("restaurants.id, count(ratings.restaurant_id) as ratings_count")
    .group("ratings.restaurant_id")
    .order("ratings_count desc")
    .limit(20)

Joins - sub-queries

  • Inner Joins can sometimes be avoided by sub-queries

# Get restaurants that were rated by the user'r friends

rating_ids = Rating.where(user_id: users.friend_ids)
                   .select(:restaurant_id)
Restaurant.where(id: rating_ids)

# The above goes as one query

Joins - denormalize

  • Add columns that keep the data we are getting from the joins

# Add a column ratings_count that keeps the count
# Can use Rails's counter-cache

User.order("ratings_count desc")
    .limit(20)

Pagination and reports

Pagination - exports

  • In some scenarios, like csv exports, pagination is not naturally required. This can lead to query performance degradation
  • Sometimes the degradation is noticed after a duration of use.
  • Also happened to me for non csv exports exports (API, we-report)

Pagination - exports

  • Solution-1: always have some kind of limits (like data from last one year). Product owners don't always agree to this, so having something arbitrary and sizeable is also ok.
  • Solution-2: do reports in background, send them via email

Pagination - pages

  • Page based pagination also suffers from performance problem on larger offsets (when page nos are high)
  • This is because the sql offset can't work without full table scan
  • When data-set grows in size, use alternatives like window-functions
  • Infinite scroll type pagination always performs fine, should be chosen when possible. 

Geo-quering

Geo-querying - closest to given location

  • 'Show me ten restaurants closest to my location' is a common query, specially in mobile applications
  • Simplest way of doing this is to calculate distance in a sql query
SELECT
restaurants.*,
  ROUND(
    3956 * 2 * ASIN ( SQRT (
      POWER( SIN((13.14 - restaurants.latitude)*pi()/180 / 2),2)
      + COS(orig_lat * pi()/180) 
      * COS(restaurants.latitude *pi()/180) 
      * POWER(SIN((80.12 - restaurants.longitude) *pi()/180 / 2), 2) 
    ) ),2) as distance
FROM
 restaurants;

Geo-querying - closest to given location

  • The calculation is slow on large datasets as it performs full table scan
  • Indexing is difficult, and database dependent
SELECT
restaurants.*,
  ROUND(
    3956 * 2 * ASIN ( SQRT (
      POWER( SIN((13.14 - restaurants.latitude)*pi()/180 / 2),2)
      + COS(orig_lat * pi()/180) 
      * COS(restaurants.latitude *pi()/180) 
      * POWER(SIN((80.12 - restaurants.longitude) *pi()/180 / 2), 2) 
    ) ),2) as distance
FROM
 restaurants;

Geo-querying - possible alternatives - 1

  • Add a max distance clause
SELECT
restaurants.*,
  ROUND(
    3956 * 2 * ASIN ( SQRT (
      POWER( SIN((13.14 - restaurants.latitude)*pi()/180 / 2),2)
      + COS(orig_lat * pi()/180) 
      * COS(restaurants.latitude *pi()/180) 
      * POWER(SIN((80.12 - restaurants.longitude) *pi()/180 / 2), 2) 
    ) ),2) as distance
FROM
 restaurants;
WHERE 
 distance < 100

Geo-querying - possible alternatives - 2

  • Use geo-spacial indexes
  • Postgres has them for a while
  • Mysql-5.7 has them

Geo-querying - possible alternatives - 3

  • Use Elasticsearch or similar

rails-sql-perf

By Abhishek Yadav

rails-sql-perf

  • 1,015