Sql Performance for Rails
@h6165
Abhishek Yadav
ரூபீ ப்ரோக்ராமர்
Co-organizer: Chennai.rb
Topics covered
- Indexing
- N+1s
- Joins
- Pagination and Reports
- Geoquerying
- Serialization
Indexing
Indexing - the goal
The goal is the avoid full table scans
Indexing - the basics
- An index is a on disk data structure
- Like the index in a book, it can point to the row where our data sits
- We have to create indexes manually, just like we create tables
- An index is created for a table and a column or columns
Indexing - the basics
Example (Rails and mysql) -
-- Create an index
CREATE INDEX users_on_email ON users (email);
-- It is used for this query
select * from users where email="abhishek@example.com";
ActiveRecord::Migration.add_index :users, :email
User.where(email: "abhishek@example.com")
Indexing - the essentials
- We need indexes for WHERE clause
- We need indexes for ORDER BY clause
- We need indexes for reference columns
- We need compound indexes sometimes
- We don't need low selectivity indexes
- We need function indexes sometimes
- We need partial indexes sometimes
Indexing - ORDER BY clause
A query requesting less data will also do a full table scan if the ORDER BY clause has no index
-- Full table scan if there is no index on created_at
EXPLAIN SELECT * from users ORDER BY created_at DESC LIMIT 10;
+----+-------------+-------+------------+------+---------------+------+---------+------+-------+----------+----------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+-------+----------+----------------+
| 1 | SIMPLE | users | NULL | ALL | NULL | NULL | NULL | NULL | 50019 | 100.00 | Using filesort |
+----+-------------+-------+------------+------+---------------+------+---------+------+-------+----------+----------------+
Indexing - ORDER BY clause - composite index
A query with multiple columns in ORDER BY needs a composite index
-- Full table scan if there is no composite index
explain SELECT * from users ORDER BY created_at, first_name LIMIT 10;
+----+-------------+-------+------------+------+---------------+------+---------+------+-------+----------+----------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+-------+----------+----------------+
| 1 | SIMPLE | users | NULL | ALL | NULL | NULL | NULL | NULL | 50019 | 100.00 | Using filesort |
+----+-------------+-------+------------+------+---------------+------+---------+------+-------+----------+----------------+
Indexing - ORDER BY clause - composite index
- For a composite index sequence is also important
- Best to keep high-cardinalty column first
-- create the compound index
CREATE INDEX users_on_created_at_first_name ON users (created_at, first_name);
-- Check explain
explain SELECT * from users ORDER BY created_at, first_name LIMIT 10;
+----+-------------+-------+------------+-------+---------------+--------------------------------+---------+------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+--------------------------------+---------+------+------+----------+-------+
| 1 | SIMPLE | users | NULL | index | NULL | users_on_created_at_first_name | 773 | NULL | 10 | 100.00 | NULL |
+----+-------------+-------+------------+-------+---------------+--------------------------------+---------+------+------+----------+-------+
Indexing - Pattern match
- A %abhishek% will not be able to use any index
- A abhishek% will be able to use an index
Indexing - Polymorphic associations - composite index
- For a polymorphic association, we need a composite index on the (id,type) combination
Indexing - Reference IDs (foreign-keys)
- The foreign-key ids can use indexes (belongs_to, habtm)
- Need not add index if a foreign key constraint is present
Indexing - Function indexes
- SQL functions applied on a column will not be able to use indexes
- Mysql (pre 5.7) doesn't have function indexes
- Postgresql has them
-- Does not use the index on email (lower is not needed in mysql)
select * from users where lower(email)="abhishek@example.com";
Indexing - Partial indexes
- Partial indexes allow us to index part a subset of a table
- Useful when we have a pseudo-delete implementation
- Available in Postgres
Indexing - Low selectivity indexes
- Indexes on columns that have few distinct values are useless
- They also add to write cost
- Example - gender
N+1s
Handling N+1s
- Are the biggest cause of performance problems in Rails apps
- Problem much bigger with APIs and reports
- Easy to introduce, not as easy to fix
Handling N+1s - the basics
- Happen when we run a query inside a loop
- Can be fixed by eager-loading
# Leads to N+1
@users = User.featured.order(:rating).limit(20)
<% @users.each do |user| %>
<%= user.name %> has <%= user.posts.size %> posts
<% end %>
# Eager loading
@users = User.featured.order(:rating).limit(20).includes(:posts)
@users = User.featured.order(:rating).limit(20).eager_load(:posts)
Handling N+1s - eager loading
- includes fires another query, eager_load does an OUTER JOIN
- includes generally sufficient 80% of times
- eager_load can lead to a slow query
# Leads to N+1
@users = User.featured.order(:rating).limit(20)
<% @users.each do |user| %>
<%= user.name %> has <%= user.posts.size %> posts
<% end %>
# Eager loading
@users = User.featured.order(:rating).limit(20).includes(:posts)
@users = User.featured.order(:rating).limit(20).eager_load(:posts)
Handling N+1s - unintended
- Api building library generally require us to specify methods that get the needed data. And then the methods are run over a collection or records
- If any of the methods contain a query, N+1s happen
- Eager loading doesn't help
# This method will always fire a query. Eager loading or not
def ratings_count
self.ratings.count
end
Handling N+1s - unintended
- The only solution is to not use any query methods in such scenario (API building, reports etc)
- Better to use Ruby code based alternatives in some places
- Eg1 - count
# counting in Ruby is better than using SQL count
def ratings_count
self.ratings.to_a.count
end
Handling N+1s - unintended
- Eg2 - overuse of scopes
class User
# Leads N+1s, despite the eager-loading,
# because Rating.active is a scope that fires another query
def active_ratings
self.ratings.active
end
# A simple alternative, doesn't cause N+1
def active_ratings2
self.ratings{ |r| r.active? }
end
end
Handling N+1s - unintended
- Eg2 - avoid the pluck
# Pluck will always fire a query
def ratings_sum
ratings.pluck(:stars).sum
end
# Here if ratings is eager-loaded, we avoid the extra query
def ratings_sum2
ratings.select(&:stars).sum
end
Handling N+1s - eager-load overdone
- Eager load calls can end-up loading too much data
# Restaurants in the city may be few
# but some of them may have too many comments or check-ins
Restaurant.in_city(city)
.includes(:ratings => [:rater])
.includes(:comments => [:commenter, :comment_ratings])
.includes(:check_ins => [:customer])
Joins
Joins
- Joins can easily become bottlenecks if not done right
- Outer joins and inner joins work differently
- Use of indexes is confusing
- De-normalizing can help
Joins - the risk
- The biggest risk is when we request a join of two large tables
- Database may have to do MxN amount of work
- Indexes can help somewhat, but at least one table will undergo full scan
- With habtms, we might be joining three tables
# Get top twenty restaurants by ratings
# Doest show those without ratings
Restaurant.joins(:ratings)
.select("restaurants.id, count(ratings.restaurant_id) as ratings_count")
.group("ratings.restaurant_id")
.order("ratings_count desc")
.limit(20)
Joins - recommendations
- Inner Joins work best when one set is small
- We can unsure that by introducing some kind of scope, on at least one side
# Get top twenty Restaurants by recent ratings
# Doest show restaurants without ratings
Restaurant.joins(:ratings)
.where("ratings.created_at > ?", 3.month_ago)
.select("restaurants.id, count(ratings.restaurant_id) as ratings_count")
.group("ratings.restaurant_id")
.order("ratings_count desc")
.limit(20)
Joins - sub-queries
- Inner Joins can sometimes be avoided by sub-queries
# Get restaurants that were rated by the user'r friends
rating_ids = Rating.where(user_id: users.friend_ids)
.select(:restaurant_id)
Restaurant.where(id: rating_ids)
# The above goes as one query
Joins - denormalize
- Add columns that keep the data we are getting from the joins
# Add a column ratings_count that keeps the count
# Can use Rails's counter-cache
User.order("ratings_count desc")
.limit(20)
Pagination and reports
Pagination - exports
- In some scenarios, like csv exports, pagination is not naturally required. This can lead to query performance degradation
- Sometimes the degradation is noticed after a duration of use.
- Also happened to me for non csv exports exports (API, we-report)
Pagination - exports
- Solution-1: always have some kind of limits (like data from last one year). Product owners don't always agree to this, so having something arbitrary and sizeable is also ok.
- Solution-2: do reports in background, send them via email
Pagination - pages
- Page based pagination also suffers from performance problem on larger offsets (when page nos are high)
- This is because the sql offset can't work without full table scan
- When data-set grows in size, use alternatives like window-functions
- Infinite scroll type pagination always performs fine, should be chosen when possible.
Geo-quering
Geo-querying - closest to given location
- 'Show me ten restaurants closest to my location' is a common query, specially in mobile applications
- Simplest way of doing this is to calculate distance in a sql query
SELECT
restaurants.*,
ROUND(
3956 * 2 * ASIN ( SQRT (
POWER( SIN((13.14 - restaurants.latitude)*pi()/180 / 2),2)
+ COS(orig_lat * pi()/180)
* COS(restaurants.latitude *pi()/180)
* POWER(SIN((80.12 - restaurants.longitude) *pi()/180 / 2), 2)
) ),2) as distance
FROM
restaurants;
Geo-querying - closest to given location
- The calculation is slow on large datasets as it performs full table scan
- Indexing is difficult, and database dependent
SELECT
restaurants.*,
ROUND(
3956 * 2 * ASIN ( SQRT (
POWER( SIN((13.14 - restaurants.latitude)*pi()/180 / 2),2)
+ COS(orig_lat * pi()/180)
* COS(restaurants.latitude *pi()/180)
* POWER(SIN((80.12 - restaurants.longitude) *pi()/180 / 2), 2)
) ),2) as distance
FROM
restaurants;
Geo-querying - possible alternatives - 1
- Add a max distance clause
SELECT
restaurants.*,
ROUND(
3956 * 2 * ASIN ( SQRT (
POWER( SIN((13.14 - restaurants.latitude)*pi()/180 / 2),2)
+ COS(orig_lat * pi()/180)
* COS(restaurants.latitude *pi()/180)
* POWER(SIN((80.12 - restaurants.longitude) *pi()/180 / 2), 2)
) ),2) as distance
FROM
restaurants;
WHERE
distance < 100
Geo-querying - possible alternatives - 2
- Use geo-spacial indexes
- Postgres has them for a while
- Mysql-5.7 has them
Geo-querying - possible alternatives - 3
- Use Elasticsearch or similar
rails-sql-perf
By Abhishek Yadav
rails-sql-perf
- 1,123