CMSC389L

Week 13

Search Engines w/ Elasticsearch

Friday, April 27, 2018

Demo Setup

First, let's create an Elasticsearch Domain. We'll need it later.

 

 

ter.ps/pccS18

Search Engines

Local Event App

Let's "build" an app to search for local events.

John Berryman: https://pyohio.org/schedule/presentation/258/

Local Event App

John Berryman: https://pyohio.org/schedule/presentation/258/

Local Event App

John Berryman: https://pyohio.org/schedule/presentation/258/

Local Event App

John Berryman: https://pyohio.org/schedule/presentation/258/

Local Event App

John Berryman: https://pyohio.org/schedule/presentation/258/

Local Event App

John Berryman: https://pyohio.org/schedule/presentation/258/

Why Search Engines

  • Databases are good for storing and retrieving data
    • but not searching

 

  • Want to find docs with specific terms and phrases?
  • Want to score and sort documents by relevance?
  • Want to perform complex query operations?

 

Then you need a search engine.

John Berryman: https://pyohio.org/schedule/presentation/258/

Search Engine Use Cases

  • Search Engines
    • Find all products for "running shoes".
  • Log Search/Analysis
    • Return all logs with user ID "12345" in them.
    • How many 500-errors in the past hour for that user?
  • Geo Search
    • Return all "Papa Johns" ordered by proximity to (38.989697, -76.937760).
  • Auto Completion
    • Auto complete "maryla..."
  • ...

Elasticsearch

Elasticsearch at a High Level

How it works: Documents

  • Indexable content are JSON documents
    • ​arbitrary, no schema required
  • Consists of fields (key-value pairs)
  • Contains an id
{
    "_id": "938hon049j4039f",
    "name": "John Dough",
    "birthday": "1970-07-01T11:50:16-05:00",
    "passions": [
        "water skiing",
        "coffee",
        "wood working"
    ],
    "address": {
        "line_1": "1 Margrove Rd.",
        "line_2": "",
        "city": "College Park",
        "country": "United States",
        "zip": 20742
    }
}

How it works: Types

  • Types:
    • ​Each document belongs to a type
    • Optionally specifies a type declaration
      • ​​​​​good for performance
      • specify field type, analyzer, ...etc.
"mappings": {
   "people": {
      "properties": {
         "name": {
            "type": "string",
         },
         "address": {
            "type": "string"
         }
      }
   },
   "transactions": {
      "properties": {
         "timestamp": {
            "type": "date",
            "format": "strict_date_optional_time"
         },
         "message": {
            "type": "string"
         }
      }
   }
}

How it works: Indexes

  • Indexes are just namespaces for your types
    • Nothing to do with database indexes!
http://localhost:9200/<index>/<type>/<id>
http://localhost:9200/data/transactions/<id>
http://localhost:9200/data/products/<id>
http://localhost:9200/colink/tweets/<id>
http://localhost:9200/colink/pictures/<id>

http://localhost:9200/johndough/tweets/<id>
http://localhost:9200/johndough/pictures/<id>

Low-Level Architecture

  • Can't store an entire index on 1 node
    • instead, split the index into smaller pieces (shards)
    • Use multiple nodes in a cluster
  • What happens if a shard crashes?
    • Replicate each shard multiple times
  • Default: 5 shards, 1 replica

Apache Lucene

  • Apache Lucene (Java library)
    • High-performance, full-text search engine
    • Single index on a single node
  • so why Elasticsearch?
    • ES provides a management layer on top of Lucene
    • Provides:
      • Replication
      • Traffic distribution
      • Consensus + failover
      • Data sharding
      • Support for multiple indexes
      • HTTP API
      • ...

AWS Elasticsearch Service

Why AWS ES?

  • AWS ES handles cluster management for you
    • ​​Detects and replaces failed nodes
    • Automatic cluster scaling
    • Data durability
    • Node monitoring
    • integrations with AWS

That's about it...

Elasticsearch Demo!

Worksheet Tasks:

  1. What is the title of the movie (from /omdb) with id: kP5uCGMBRZqaOuquTh81
  2. How many movies in this dataset were released in 2008?
  3. What is the id of the movie with the title "Ghostbusters"?

 

 

Postman collection: ter.ps/389lpostman

Elasticsearch Endpoint: (See Postman)

 

Submit a .txt file with your answers + queries to submit server.

Wrapping Up

Codelabs:

  • DynamoDB out this weekend
  • ECS out next week (last one!)

Final Project:

  • Checkpoint #2 this Sunday

 

Feedback form: ter.ps/feedback13

CMSC389L Week 13

By Colin King

CMSC389L Week 13

  • 848