elasticsearch

Elasticsearch

is real-time
is a distributed search and analytics engine
is a document store

Basics

Cluster

Node 1

Node 2

Node 3

Index A

Shard 1

Index A

Shard 2

Index A

Shard 3

Index A

Replica 2

Index A

Replica 3

Index A

Replica 1

Cluster 1

Index

Index with static size
- job, employee, candidate
Continuously growing index
- logs, transactioner etc
  - serilog-2018.01.01

Guidelines for indices

A shard should not be larger than 5Gb
- Defaults to 5 shards could often be reduced to 1
Number of replicas should be # of nodes -1

Data input

What's a document

{
    "name": "John Doe",
    "age": 42,
    "confirmed": true,
    "created": "2018-01-01T12:00:00",
    "adress": {
        "street": "Gatan 1",
        "zip": "12344",
        "city": "Farsta"
    },
    "tags": [
        { "type": "Category", "value": "IT" },
        { "type": "Employment period", "value": "Deltid" }
    ]
}

Field names can NOT include a .

Document metadata

_index: Name of the index the document lives in
_id: Unique id of a document
Settings
Analyzers
Mappings

Indexing a document

PUT /{index}/{_doc}/{id}
{
  "field": "value",
  ...
}

PUT /website/_doc/123
{
  "title": "My first blog entry",
  "text":  "Just trying this out...",
  "date":  "2014/01/01"
}

{
   "_index":    "website",
   "_type":     "blog",
   "_id":       "123",
   "_version":  1,
   "created":   true
}

Request

Response

DocumentId

Providing a document id
- ```
PUT /{index}/{_doc|_create}/{id}
```
Automatic documentId
- ```
POST /website/blog/
```
- 20 char long, URL-safe, Base64-enc. GUID strings

Concurrency control

Optimistic concurrency control

Each document has a version number
The versionnumber increases with 1 for every change
If no version number is provided an will add to latest
If you send the version nr then you send the last version nr:
```
PUT /website/blog/1?version=1
```
If the version number is to small a version conflict will occur

External versionsnummer

Optimistic concurrency control

0 based positivt number
When updating the version nr must be greater than last
Can also be set on create

PUT /website/blog/2?version=5&version_type=external

Large data volumes

When adding large volumes of data prefer to use Bulk api

{ action: { metadata }}\n
{ request body        }\n
{ action: { metadata }}\n
{ request body        }\n

POST /_bulk
{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }} 
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "My first blog post" }
{ "index":  { "_index": "website", "_type": "blog" }}
{ "title":    "My second blog post" }
{ "update": { "_index": "website", "_type": "blog", "_id": "123", "_retry_on_conflict" : 3} }
{ "doc" : {"title" : "My updated blog post"} }

Delete document

DELETE /website/blog/123

{
  "found" :    true,
  "_index" :   "website",
  "_type" :    "blog",
  "_id" :      "123",
  "_version" : 3
}

{
  "found" :    false,
  "_index" :   "website",
  "_type" :    "blog",
  "_id" :      "123",
  "_version" : 4
}

Request

OK Response (200)

Missing Response (404)

Data out

Get document by Id

GET /website/blog/123?pretty

{
  "_index" :   "website",
  "_type" :    "blog",
  "_id" :      "123",
  "_version" : 1,
  "found" :    true,
  "_source" :  {
      "title": "My first blog entry",
      "text":  "Just trying this out...",
      "date":  "2014/01/01"
  }
}

Request

Response

Get multiple document by Id

GET /website/blog/_mget
{
   "ids" : [ "2", "1" ]
}

{
  "docs" : [
    {
      "_index" :   "website",
      "_type" :    "blog",
      "_id" :      "2",
      "_version" : 10,
      "found" :    true,
      "_source" : { "title":   "My first external blog entry", "text":    "This is a piece of cake..."  }
    },
    {
      "_index" :   "website",
      "_type" :    "blog",
      "_id" :      "1",
      "found" :    false  
    }
  ]
}

Request

Response

Schemas

Different types of search

Boolean searches

Efficient
Match or no match
Like WHERE in sql
Does the data match?

Full text search

Slower then boolean searches (more efficient than % searches in sql)
Give result with a relevans to the search
How well does the data match

Combinations

Inverted index

How the data is stored in elastic explains searches

Given the following documents:

Den snabba bruna räven hoppar över den lata hunden
Snabba bruna rävar hoppar över lata hundar på sommaren

Inverted index

         |  1  |  2  |
----------------------
Den      |  x  |     |
---------|-----|-----|
snabba   |  x  |     |
---------|-----|-----|
bruna    |  x  |  x  |
---------|-----|-----|
räven    |  x  |     |
---------|-----|-----|
hoppar   |  x  |  x  |
---------|-----|-----|
över     |  x  |  x  |
---------|-----|-----|
den      |  x  |     |
---------|-----|-----|
lata     |  x  |  x  |
---------|-----|-----|
hunden   |  x  |     |
---------|-----|-----|
Snabba   |     |  x  |
---------|-----|-----|
rävar    |     |  x  |
---------|-----|-----|
hundar   |     |  x  |
---------|-----|-----|
på       |     |  x  |
---------|-----|-----|
sommaren |     |  x  |
----------------------

Query: snabba bruna

Index

Terms  |  1  |  2  |
--------------------
snabba |  x  |     |
-------|-----|-----|
bruna  |  x  |  x  |
--------------------
Total  |  2  |  1  |

Normalisering

         |  1  |  2  |
----------------------
den      |  x  |     |
---------|-----|-----|
snabb    |  x  |  x  |
---------|-----|-----|
bruna    |  x  |  x  |
---------|-----|-----|
räv      |  x  |  x  |
---------|-----|-----|
hoppa    |  x  |  x  |
---------|-----|-----|
över     |  x  |  x  |
---------|-----|-----|
lata     |  x  |  x  |
---------|-----|-----|
hund     |  x  |  x  |
---------|-----|-----|
på       |     |  x  |
---------|-----|-----|
sommaren |     |  x  |
----------------------

Query: snabba bruna

Index

Terms  |  1  |  2  |
--------------------
snabb  |  x  |  x  |
-------|-----|-----|
brun   |  x  |  x  |
--------------------
Total  |  2  |  2  |

Analysis

Character filters
- Per tecken tranformering
- rensa html, w -> v
Tokenizer
- Dela upp texten till ord
Token filters
- lowercase, synonyms, stemming etc

Standard analysers

Standard analyzer
- Word boundaries by unicode standard, erase most punctuations, lower case (generally best choice)
Simple analyzer
- Splits the text on anything that isn’t a letter, and lowercases the terms
Whitespace analyzer
- Split on whitespace, does not lowercase
Language analyzer
- Language specific analyzers

When are analyzers used?

On all full text fields

It is used when indexing and when searching on the search string

Testing analyzers (Analyze API)

GET _analyze
{
  "analyzer" : "standard",
  "text" : "this is a test"
}

GET _analyze
{
  "analyzer" : "standard",
  "text" : [
    "this is a test", 
    "the second text"
  ]
}

GET _analyze
{
  "tokenizer" : "keyword",
  "filter" : ["lowercase"],
  "text" : "this is a test"
}

GET _analyze
{
  "tokenizer" : "keyword",
  "filter" : ["lowercase"],
  "char_filter" : ["html_strip"],
  "text" : "this is a <b>test</b>"
}

Exempel

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"],
          "char_filter": ["html_strip"]
        }
      }
    }
  }
}

Mapping

{
    "name": "Maria Kihlgren",
    "birth_date": "1960-01_31",
    "adress": {
        "street": "Drevgatan 60, 6tr",
        "zipcode": 13500,
        "city": "Karlstad"
    },
    "contacts": {
        "home_phone": "015 – 15 15 15",
        "modile_phone": "070 – 15 15 16 ",
        "email": "maria@kihlgren.se"
    },
    "ambition": "Jag ser det om en utmaning att arbeta vidare 
med sådant som jag tycker är kul samtidigt som jag motiveras av
viljan att lära mig nya saker, att utvecklas och genom det 
bidra till att föra verksamheten vidare"    
}

Example

Mapping

{
    "name": string,
    "birth_date": date,
    "adress": {
        "street": text,
        "zipcode": number,
        "city": keyword
    },
    "contacts": {
        "home_phone": keyword,
        "modile_phone": keyword,
        "email": email
    },
    "ambition": text    
}

Types

Mapping

{
    "mappings": {
        "candidate": {
            "properties": {
                "name": { "type": "text" },
                "birth_date": { "type": "date" },
                "adress": {
                    "properties": {
                        "street": { "type": "text" },
                        "zipcode": { "type": "long" },
                        "city": { "type": "keyword" }
                    }
                },
                "contacts": {
                    "properties": {
                        "home_phone": { "type": "keyword" },
                        "modile_phone": { "type": "keyword" },
                        "email": { "type": "keyword" }
                    }
                },
                "ambition": {
                    "type": "text"
                }
            }
        }
    }
}

Index mapping

Queries

Match_All

POST candidates/_search

POST candidates/_search {}

POST candidates/_search
{
  "query": {
    "match_all": {}
  }
}

Match

// match on full text
POST candidates/candidate/_search
{
  "query": {
    "match": {
      "name": "Maria"
    }
  }
}

POST candidates/candidate/_search
{
  "query": {
    "match": {
      "adress.city": "Karlstad"
    }
  }
}

Term/Terms

// Ok
POST candidates/candidate/_search
{
  "query": { "term": { "name": "maria" } }
}

// Fail wrong casing
POST candidates/candidate/_search
{
  "query": { "term": { "name": "Maria" } }
}

// Ok
POST candidates/candidate/_search
{
  "query": { "term": { "adress.city": "Karlstad" } }
}

// Fail wrong casing
POST candidates/candidate/_search
{
  "query": { "term": { "adress.city": "karlstad" } }
}

Range

POST candidates/candidate/_search
{
  "query": { 
    "range": {
      "adress.zipcode": {
        "gte": 13501
      }
    }
  }
}

POST candidates/candidate/_search
{
  "query": { 
    "range": {
      "adress.zipcode": {
        "lte": 13500
      }
    }
  }
}

Bool

// AND
POST candidates/candidate/_search
{
  "query": { "bool": {
      "must": [
        { "range": { "adress.zipcode": { "lt": 13501 } } },
        { "term": { "name": "maria" } }
      ]
  }}
}
// OR
POST candidates/candidate/_search
{
  "query": { "bool": {
      "should": [
        { "range": { "adress.zipcode": { "lt": 13501 } } },
        { "term": { "name": "erik" } }
      ]
  }}
}
// NOT AND
POST candidates/candidate/_search
{
  "query": { "bool": {
      "must_not": [
        { "term": { "name": "erik" } }
      ]
  }}
}

Prefix

+ recall - precision

POST candidates/candidate/_search
{
  "query": {
    "prefix": {
      "name": {
        "value": "ma"
      }
    }
  }
}

Potentially slow query, has too loop through the inverted index to look for matches

Fuzzy

+ recall - precision


POST candidates/candidate/_search
{
    "query": {
       "fuzzy" : { "name" : "eric" }
    }
}

POST candidates/candidate/_search
{
    "query": {
       "fuzzy" : { 
          "name" : {
            "value": "marios",
            "fuzziness": 2
          }
       }
    }
}

Match phrase

+ recall - precision

POST candidates/candidate/_search
{
  "query": {
    "match_phrase" : {
      "ambition": {
        "query": "jag ser det som en"
      }
    }
  }
}

Aggregations

Types

Metrics
- Min, max, percentiles etc
Buckets
- Grupperingar så som: Terms, Histogram, Date histograms etc
Pipeline
- Nested aggs
Matrix

Aggregeringar together with search

Facetterad search
Combination of filters and search
Aggregations are used for filters
Most common aggregation is grouping of tokens (term agg)
Histogram (spread of numbers, ex age, salary or price)

Terms

GET /_search
{
    "aggs" : {
        "genres" : {
            "terms" : { "field" : "genre" }
        }
    }
}

Histogram

POST /sales/_search?size=0
{
    "aggs" : {
        "prices" : {
            "histogram" : {
                "field" : "price",
                "interval" : 50
            }
        }
    }
}

Sorting/Pagination

Sorting

GET /my_index/my_type/_search
{
    "sort" : [
        { "post_date" : {"order" : "asc"}},
        { "name" : "desc" }
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

RELEVANS IS LOST!

Missing values

GET /_search
{
    "sort" : [
        { "price" : {"missing" : "_last"} }
    ],
    "query" : {
        "term" : { "product" : "chocolate" }
    }
}

Pagination

GET /_search
{
    "from" : 0, "size" : 10,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

First 10000 items

Elasticsearch

By fhelje

Elasticsearch

fhelje

fhelje

elasticsearch

Elasticsearch

More from fhelje