Migrating Search from MySQL to Elasticsearch

Context: Ad Search

  • Search for ads via UI
  • Search for ads via API
  • Enable Cassandra search

Legacy Search

  • 500+ sloc PHP function
    • query = string concatenation
    • multiple joins
  • 3 seconds+ response time

New Search

  • Clean code implementation
  • Extendable schema
  • Less than 1 second response time

Goals

  • Index 90MM+ objects in Elasticsearch

 

  • Sync 1.5MM changes per day

Solution

  • Index 90MM+ objects in Elasticsearch
    • Bulk Indexer Application

 

  • Sync 1.5MM changes per day
    • Live Indexer Application

Live Indexer

1. API: Emit change notification {CREATED, UPDATED, DELETED}

 

2. Kafka: stores change notifications

 

3. Live Indexer: Consume Change notifications

 

4. Live Indexer: Expands change notification to full object

 

5. Elasticsearch: Index object

Bulk Indexer

Cronjob that reads objects from MySQL using an unbuffered connection and streams inputs into Elasticsearch.

 

Indexing rate: ~9k/sec => ~3 hrs

Hardware/Software

  • Elasticsearch
    • 3 Nodes
    • 1 TB HDD
    • 60 Gb RAM
  • Kafka Topic
    • 3 partitions
    • 1 day retention
  • Live indexer/bulk indexer
    • Python, asyncio

Attempted solutions

Attempted solution #1: Logstash

Pros:

  • Minimal code set up
  • Plugin support
  • Builtin Kafka, MySQL, Elasticsearch

Cons:

  • HTTP plugin doesn't support token auth
    • Requires plugins written in Ruby
  • MySQL plugin doesn't support multiple queries

Logstash: Extract, Transform, Load tool

Attempted solution #2: FileBeats

Pros:

  • Minimal code set up
  • Minimal infrastructure

Cons:

  • Requires persisted volume mount on application

File Beat: Application log shipper

Attempted solution #3

Combine Bulk Indexer & Live Indexer

Attempted solution #3

Combine Bulk Indexer & Live Indexer

Attempted solution #3

Combine Bulk Indexer & Live Indexer

150k/sec

11k/sec

4k/sec

Attempted solution #3

Combine Bulk Indexer & Live Indexer

Approach Indexing Rate Index build time
Logstash ~1.2k/sec 21 hours
Live Indexer (python) ~4k/sec 6.25 hours
Bulk Indexer (unbuffered MySQL connection) ~9k/sec ~3 hours

Final solution

Bulk Indexer:

9k/sec => ~3hrs

 

Live Indexer:

max: 4k/sec

 

Index Cutover

Index Cutover

Index Cutover

Index Cutover

End

Search Project

By Simply Ahmazing

Search Project

  • 503