{json} logging platform

at scale

aws athena vs Elasticsearch

Florian DAMBRINE - Senior DevOps Engineer

{"codes": [{"code": "SUCCESS", "ver": "3.0", "company": "gumgum-sports", "labels": [], "annotator": "machine", "server": "TeamClassifier", "date": "20170203125137", "tstamps": [], "id": "TeamClassifier_20170203125137"}], "url_original": "https://www.facebook.com/NYKnicks/photos/a.71815779615.98182.21410634615/10157985435429616/", "url": "https://sports-cdn.ggops.com/social-posts-images/5ffbd0bed8b1db279dbf8c257b28a4eb.jpg", "h": 300, "media_summary": [{"regions": [{"contour": [{"y": 0.744556, "x": 0.078593}, {"y": 0.668499, "x": 0.764749}, {"y": 0.287995, "x": 0.627788}, {"y": 0.207455, "x": 0.175434}], "sub_regions": null, "features": null, "props": [{"relationships": null, "confidence": 0.5766, "confidence_min": 0.4852, "ver": "3.0", "company": "gumgum", "value": "Dog", "server": "TeamClassifier", "footprint_id": "TeamClassifier_20170203125137", "fraction": 1, "module_id": 0, "property_type": "rightsholder_name", "value_verbose": "", "property_id": 0}]}], "t2": 0, "t1": 0}], "tracks_summary": null, "w": 8, "frames_annotation": [{"regions": [{"contour": [{"y": 0.639081, "x": 0.139355}, {"y": 0.642554, "x": 0.029847}, {"y": 0.502516, "x": 0.158739}, {"y": 0.670725, "x": 0.527886}], "sub_regions": null, "features": null, "props": [{"relationships": null, "confidence": 0.85, "confidence_min": 0.7792, "ver": "3.0", "company": "gumgum", "value": "Cat", "server": "TeamClassifier", "footprint_id": "TeamClassifier_20170203125137", "fraction": 1, "module_id": 0, "property_type": "rightsholder_name", "value_verbose": "", "property_id": 0}]}], "t": 0}]}

> Context

{
  "codes": [
    {
      "code": "SUCCESS",
      "ver": "sports-production",
      "company": "gumgum-sports",
      "labels": [
        "Cat",
        "Dog"
      ],
      "annotator": "unknown",
      "server": "HAM",
      "date": "20170626041248",
      "tstamps": [],
      "id": "HAM_20170626041248"
    }
  ],
  "url_original": "https://www.facebook.com/NYKnicks/photos/a.71815779615.98182.21410634615/10157985435429616/",
  "url": "https://sports-cdn.ggops.com/social-posts-images/5ffbd0bed8b1db279dbf8c257b28a4eb.jpg",
  "h": 400,
  "media_summary": [
    {
      "regions": [
        {
          "contour": [
            {
              "y": 0.502205,
              "x": 0.338184
            },
            {
              "y": 0.503508,
              "x": 0.714891
            },
            {
              "y": 0.607232,
              "x": 0.346239
            },
            {
              "y": 0.109788,
              "x": 0.467408
            }
          ],
          "sub_regions": null,
          "features": null,
          "props": [
            {
              "relationships": null,
              "confidence": 1,
              "confidence_min": 0,
              "ver": "sports-production",
              "company": "gumgum",
              "value": "Cat",
              "server": "HAM",
              "footprint_id": "HAM_20170626041248",
              "fraction": 1,
              "module_id": 0,
              "property_type": "rightsholder_name",
              "value_verbose": "",
              "property_id": 0
            }
          ]
        }
      ],
      "t2": 0,
      "t1": 0
    }
  ],
  "tracks_summary": null,
  "w": 88,
  "frames_annotation": [
    {
      "regions": [
        {
          "contour": [
            {
              "y": 0.241438,
              "x": 0.889341
            },
            {
              "y": 0.598113,
              "x": 0.217485
            },
            {
              "y": 0.948145,
              "x": 0.513462
            },
            {
              "y": 0.020427,
              "x": 0.908873
            }
          ],
          "sub_regions": null,
          "features": null,
          "props": [
            {
              "relationships": null,
              "confidence": 1,
              "confidence_min": 0,
              "ver": "sports-production",
              "company": "gumgum",
              "value": "Cat",
              "server": "HAM",
              "footprint_id": "HAM_20170626041248",
              "fraction": 1,
              "module_id": 0,
              "property_type": "rightsholder_name",
              "value_verbose": "",
              "property_id": 0
            }
          ]
        }
      ],
      "t": 0
    }
  ]
}
  • Complex nested Json logs
     
  • Ability to run complex queries
     
  • Important Data volume:
    • ~110G / day (now)
    • ~200G / day (in one year)
       
  • Long retention period (90 days)

> Experiments with Elasticsearch

Filebeat

forwarders

Elasticsearch

cluster

Analysis

interfaces

> Conclusion with Elasticsearch

 

 

  • Straight forward data ingestion using filebeat
     
  • Control of the Mapping for data structure changes and control of dynamic fields
     
  • Possible use of Tokenizers and Analyzers
     
  • Really good for data exploration thanks to the design of the engine
GET filebeat-*/_search
{
  "query": {
    "nested": {
      "path": "frames_annotation",
      "query": {
        "nested": {
          "path": "frames_annotation.regions",
          "query": {
            "nested": {
              "path": "frames_annotation.regions.props",
              "query": {
                "bool": {
                  "must": [
                    {
                      "match": {
                        "frames_annotation.regions.props.property_type": "rig"
                      }
                    }
                  ]
                }
              }
            }
          }
        }
      }
    }
  }
}

> Experiments with AWS Athena and Glue

> CONCLUSION WITH AWS Athena and glue

  • Need to use Glue and run ETL process in order to Relationalize the data
     
  • Strong Data typing decreasing flexibility (confidence translates into two columns double and int)
     
  • SQL denormalisation makes it harder to retrieve objects (one per line instead of having a single object returned)
SELECT *
FROM root, codes
WHERE root.codes = codes.id
AND codes.server = 'nbalogosdetector'
AND codes.ver = '3.1'
Writing to S3 bucket:  root_codes.val.labels
+---+-----+--------------------+
| id|index|codes.val.labels.val|
+---+-----+--------------------+
|  1|    0|                 Cat|
|  1|    1|                 Dog|
|  2|    0|       NewYorkKnicks|
|  2|    1|      ApparelSponsor|
|  2|    2|         Squarespace|
|  3|    0|                 Cat|
|  3|    1|                 Dog|
|  4| null|                    |
|  5| null|                    |
|  6| null|                    |
| 14|    0| GoldenStateWarriors|
| 14|    1|         JerseyPatch|
| 14|    2|         Squarespace|
+---+-----+--------------------+

> conclusion

AWS Athena Elasticsearch
Not appropriate in that case Better fits our needs
Probably "cheeper" Probably "more" expensive
 Self managed   Self managed or In-House 

> What's next...

Elasticsearch 6.x With Spotinst Elastigroup

  • Reading about running ES @scale
  • Reviewing ES Automations for 6.X
  • Benchmark to define important cluster settings and cluster size
  • Testing High availability of the cluster with Spotinst

{JSON} Logging Platform at scale

By Florian Dambrine

{JSON} Logging Platform at scale

  • 1,473