elasticsearch
Elasticsearch
- is real-time
- is a distributed search and analytics engine
- is a document store
Basics
Cluster
Node 1
Node 2
Node 3
Index A
Shard 1
Index A
Shard 2
Index A
Shard 3
Index A
Replica 2
Index A
Replica 3
Index A
Replica 1
Cluster 1
Index
-
Index with static size
- job, employee, candidate
- Continuously growing index
-
logs, transactioner etc
- serilog-2018.01.01
-
logs, transactioner etc
Guidelines for indices
- A shard should not be larger than 5Gb
- Defaults to 5 shards could often be reduced to 1
- Number of replicas should be # of nodes -1
Data input
What's a document
{
"name": "John Doe",
"age": 42,
"confirmed": true,
"created": "2018-01-01T12:00:00",
"adress": {
"street": "Gatan 1",
"zip": "12344",
"city": "Farsta"
},
"tags": [
{ "type": "Category", "value": "IT" },
{ "type": "Employment period", "value": "Deltid" }
]
}Field names can NOT include a .
Document metadata
- _index: Name of the index the document lives in
- _id: Unique id of a document
- Settings
- Analyzers
- Mappings
Indexing a document
PUT /{index}/{_doc}/{id}
{
"field": "value",
...
}PUT /website/_doc/123
{
"title": "My first blog entry",
"text": "Just trying this out...",
"date": "2014/01/01"
}{
"_index": "website",
"_type": "blog",
"_id": "123",
"_version": 1,
"created": true
}Request
Response
DocumentId
- Providing a document id
-
PUT /{index}/{_doc|_create}/{id}
-
- Automatic documentId
-
POST /website/blog/
- 20 char long, URL-safe, Base64-enc. GUID strings
-
Concurrency control
Optimistic concurrency control
- Each document has a version number
- The versionnumber increases with 1 for every change
- If no version number is provided an will add to latest
-
If you send the version nr then you send the last version nr:
PUT /website/blog/1?version=1 - If the version number is to small a version conflict will occur
External versionsnummer
Optimistic concurrency control
- 0 based positivt number
- When updating the version nr must be greater than last
- Can also be set on create
PUT /website/blog/2?version=5&version_type=external
Large data volumes
When adding large volumes of data prefer to use Bulk api
{ action: { metadata }}\n
{ request body }\n
{ action: { metadata }}\n
{ request body }\nPOST /_bulk
{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title": "My first blog post" }
{ "index": { "_index": "website", "_type": "blog" }}
{ "title": "My second blog post" }
{ "update": { "_index": "website", "_type": "blog", "_id": "123", "_retry_on_conflict" : 3} }
{ "doc" : {"title" : "My updated blog post"} }Delete document
DELETE /website/blog/123{
"found" : true,
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 3
}{
"found" : false,
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 4
}Request
OK Response (200)
Missing Response (404)
Data out
Get document by Id
GET /website/blog/123?pretty{
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 1,
"found" : true,
"_source" : {
"title": "My first blog entry",
"text": "Just trying this out...",
"date": "2014/01/01"
}
}Request
Response
Get multiple document by Id
GET /website/blog/_mget
{
"ids" : [ "2", "1" ]
}{
"docs" : [
{
"_index" : "website",
"_type" : "blog",
"_id" : "2",
"_version" : 10,
"found" : true,
"_source" : { "title": "My first external blog entry", "text": "This is a piece of cake..." }
},
{
"_index" : "website",
"_type" : "blog",
"_id" : "1",
"found" : false
}
]
}Request
Response
Schemas
Different types of search
Boolean searches
- Efficient
- Match or no match
- Like WHERE in sql
- Does the data match?
Full text search
- Slower then boolean searches (more efficient than % searches in sql)
- Give result with a relevans to the search
- How well does the data match
Combinations
Inverted index
How the data is stored in elastic explains searches
Given the following documents:
- Den snabba bruna räven hoppar över den lata hunden
- Snabba bruna rävar hoppar över lata hundar på sommaren
Inverted index
| 1 | 2 |
----------------------
Den | x | |
---------|-----|-----|
snabba | x | |
---------|-----|-----|
bruna | x | x |
---------|-----|-----|
räven | x | |
---------|-----|-----|
hoppar | x | x |
---------|-----|-----|
över | x | x |
---------|-----|-----|
den | x | |
---------|-----|-----|
lata | x | x |
---------|-----|-----|
hunden | x | |
---------|-----|-----|
Snabba | | x |
---------|-----|-----|
rävar | | x |
---------|-----|-----|
hundar | | x |
---------|-----|-----|
på | | x |
---------|-----|-----|
sommaren | | x |
----------------------
Query: snabba bruna
Index
Terms | 1 | 2 |
--------------------
snabba | x | |
-------|-----|-----|
bruna | x | x |
--------------------
Total | 2 | 1 |
Normalisering
| 1 | 2 |
----------------------
den | x | |
---------|-----|-----|
snabb | x | x |
---------|-----|-----|
bruna | x | x |
---------|-----|-----|
räv | x | x |
---------|-----|-----|
hoppa | x | x |
---------|-----|-----|
över | x | x |
---------|-----|-----|
lata | x | x |
---------|-----|-----|
hund | x | x |
---------|-----|-----|
på | | x |
---------|-----|-----|
sommaren | | x |
----------------------
Query: snabba bruna
Index
Terms | 1 | 2 |
--------------------
snabb | x | x |
-------|-----|-----|
brun | x | x |
--------------------
Total | 2 | 2 |
Analysis
- Character filters
- Per tecken tranformering
- rensa html, w -> v
- Tokenizer
- Dela upp texten till ord
- Token filters
- lowercase, synonyms, stemming etc
Standard analysers
- Standard analyzer
- Word boundaries by unicode standard, erase most punctuations, lower case (generally best choice)
- Simple analyzer
- Splits the text on anything that isn’t a letter, and lowercases the terms
- Whitespace analyzer
- Split on whitespace, does not lowercase
- Language analyzer
- Language specific analyzers
When are analyzers used?
On all full text fields
It is used when indexing and when searching on the search string
Testing analyzers (Analyze API)
GET _analyze
{
"analyzer" : "standard",
"text" : "this is a test"
}GET _analyze
{
"analyzer" : "standard",
"text" : [
"this is a test",
"the second text"
]
}GET _analyze
{
"tokenizer" : "keyword",
"filter" : ["lowercase"],
"text" : "this is a test"
}GET _analyze
{
"tokenizer" : "keyword",
"filter" : ["lowercase"],
"char_filter" : ["html_strip"],
"text" : "this is a <b>test</b>"
}Exempel
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase"],
"char_filter": ["html_strip"]
}
}
}
}
}
Mapping
{
"name": "Maria Kihlgren",
"birth_date": "1960-01_31",
"adress": {
"street": "Drevgatan 60, 6tr",
"zipcode": 13500,
"city": "Karlstad"
},
"contacts": {
"home_phone": "015 – 15 15 15",
"modile_phone": "070 – 15 15 16 ",
"email": "maria@kihlgren.se"
},
"ambition": "Jag ser det om en utmaning att arbeta vidare
med sådant som jag tycker är kul samtidigt som jag motiveras av
viljan att lära mig nya saker, att utvecklas och genom det
bidra till att föra verksamheten vidare"
}Example
Mapping
{
"name": string,
"birth_date": date,
"adress": {
"street": text,
"zipcode": number,
"city": keyword
},
"contacts": {
"home_phone": keyword,
"modile_phone": keyword,
"email": email
},
"ambition": text
}Types
Mapping
{
"mappings": {
"candidate": {
"properties": {
"name": { "type": "text" },
"birth_date": { "type": "date" },
"adress": {
"properties": {
"street": { "type": "text" },
"zipcode": { "type": "long" },
"city": { "type": "keyword" }
}
},
"contacts": {
"properties": {
"home_phone": { "type": "keyword" },
"modile_phone": { "type": "keyword" },
"email": { "type": "keyword" }
}
},
"ambition": {
"type": "text"
}
}
}
}
}
Index mapping
Queries
Match_All
POST candidates/_search
POST candidates/_search {}
POST candidates/_search
{
"query": {
"match_all": {}
}
}Match
// match on full text
POST candidates/candidate/_search
{
"query": {
"match": {
"name": "Maria"
}
}
}
POST candidates/candidate/_search
{
"query": {
"match": {
"adress.city": "Karlstad"
}
}
}
Term/Terms
// Ok
POST candidates/candidate/_search
{
"query": { "term": { "name": "maria" } }
}
// Fail wrong casing
POST candidates/candidate/_search
{
"query": { "term": { "name": "Maria" } }
}
// Ok
POST candidates/candidate/_search
{
"query": { "term": { "adress.city": "Karlstad" } }
}
// Fail wrong casing
POST candidates/candidate/_search
{
"query": { "term": { "adress.city": "karlstad" } }
}
Range
POST candidates/candidate/_search
{
"query": {
"range": {
"adress.zipcode": {
"gte": 13501
}
}
}
}
POST candidates/candidate/_search
{
"query": {
"range": {
"adress.zipcode": {
"lte": 13500
}
}
}
}
Bool
// AND
POST candidates/candidate/_search
{
"query": { "bool": {
"must": [
{ "range": { "adress.zipcode": { "lt": 13501 } } },
{ "term": { "name": "maria" } }
]
}}
}
// OR
POST candidates/candidate/_search
{
"query": { "bool": {
"should": [
{ "range": { "adress.zipcode": { "lt": 13501 } } },
{ "term": { "name": "erik" } }
]
}}
}
// NOT AND
POST candidates/candidate/_search
{
"query": { "bool": {
"must_not": [
{ "term": { "name": "erik" } }
]
}}
}
Prefix
+ recall - precision
POST candidates/candidate/_search
{
"query": {
"prefix": {
"name": {
"value": "ma"
}
}
}
}
Potentially slow query, has too loop through the inverted index to look for matches
Fuzzy
+ recall - precision
POST candidates/candidate/_search
{
"query": {
"fuzzy" : { "name" : "eric" }
}
}
POST candidates/candidate/_search
{
"query": {
"fuzzy" : {
"name" : {
"value": "marios",
"fuzziness": 2
}
}
}
}
Match phrase
+ recall - precision
POST candidates/candidate/_search
{
"query": {
"match_phrase" : {
"ambition": {
"query": "jag ser det som en"
}
}
}
}Aggregations
Types
- Metrics
- Min, max, percentiles etc
- Buckets
- Grupperingar så som: Terms, Histogram, Date histograms etc
- Pipeline
- Nested aggs
- Matrix
Aggregeringar together with search
- Facetterad search
- Combination of filters and search
- Aggregations are used for filters
- Most common aggregation is grouping of tokens (term agg)
- Histogram (spread of numbers, ex age, salary or price)


Terms
GET /_search
{
"aggs" : {
"genres" : {
"terms" : { "field" : "genre" }
}
}
}Histogram
POST /sales/_search?size=0
{
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 50
}
}
}
}Sorting/Pagination
Sorting
GET /my_index/my_type/_search
{
"sort" : [
{ "post_date" : {"order" : "asc"}},
{ "name" : "desc" }
],
"query" : {
"term" : { "user" : "kimchy" }
}
}RELEVANS IS LOST!
Missing values
GET /_search
{
"sort" : [
{ "price" : {"missing" : "_last"} }
],
"query" : {
"term" : { "product" : "chocolate" }
}
}Pagination
GET /_search
{
"from" : 0, "size" : 10,
"query" : {
"term" : { "user" : "kimchy" }
}
}First 10000 items
Elasticsearch
By fhelje
Elasticsearch
- 601