Monitoring & Logging

Research Results

What

Detect machines disappearing
Detect containers disappearing
Detect infrastructure resources disappearing
Detect all of the above becoming unhealthy
A record of what is "normal"
Investigate cause of failures
Correlate log messages

Why

Moving away from monolithic solutions to ones comprised of cooperating services greatly increases operational complexity
Having customers know when our systems are down before we do is unacceptable, negatively impacting our reputation
A solution is only as fast as its least performing component
Pro active response to stressed servers is more efficient than midnight fire fighting
An Operations Database can help with sizing questions for both us and customers who want to self-host

How

Use monitoring as a service (MaaS) providers to quickly get us running
Use MaaS to monitor the boxes, including Docker daemons
Use MaaS to monitor shared infrastructure, RabbitMQ, MySQL, MongoDB, etc.
Use MaaS to monitor microservice health
Use logging as a service (LaaS)to aggregate logs

The Simulation

Virtual machine to run the Docker daemon
Docker containers for all microservices and infrastructure
Client that continually stimulates the cooperating set of services
Enable/disable services and infrastructure to simulate network partitions and dead hardware
Use specially crafted messages from the client to introduce latency into the system and simulate ailing components
See how quickly failures are detected
See how quickly the system failure can be isolated

Simulated System

Simulation Environment

Amazon EC2 based
All components available on GitHub
All components Docker 1.8.2 based
All provisioning is automated via Ansible, Docker Compose and Terraform
t2.large instance
2 CPUs
8GB RAM
8GB disk
relaxed network rules -- all traffic allowed
simulate 10 active machines making calls every half-second

How We Tested Monitoring

Install necessary monitoring software
Set alerts in console
Monitor both dashboard and e-mail for alerts
Pump requests through the system
Turn individual containers on and off to simulate network partitions and dead hardware
Note the latency between the actual event and when the alert is registered in the dashboard and in e-mail
Note how difficult it is to configure the components to be monitor-ready

The Candidates

Attribute	New Relic	Datadog
Agent Based	[x]	[x]
Docker Support	[x]	[x]
Free Tier Available	[x] *	[x] ^
Infrastructure Plug-ins	[x]	[x]
Web Dashboard	[x]	[x]
Mobile Application	[x]	[ ]

* 24-hour data retention, unlimited hosts

^ 24-hour data retention, 5 hosts

Host Monitoring

Scenario	New Relic	Datadog
Dead Instance	[x]	[x]
90% CPU	[x]	[x]
90% Disk Space	[x]	[x]
90% RAM	[N/A ]	[ N/A]
90% Disk I/O	[N/A ]	[ N/A]
90% Network I/O	[N/A]	[N/A]

N/A: not attempted

New Relic Server

Data Dog Server

Data Dog Correlation

Container Monitoring

Scenario	New Relic	Datadog
CPU Usage	[x]	[x]
RAM Usage	[x]	[x]
Disk Usage	[ ]	[ ]
Network Usage	[ ]	[ ]
Container Events	[ ]	[x]

New Relic Docker

Application Monitoring

Scenario	New Relic	Datadog
Available	[x]	[x]
Getting Slow	[^]	[*]
Too Many Errors	[x]	[*]
Record Response Times	[^]	[*]

* possible but not tested

^ requires premium tier

New Relic Application

Infrastructure Monitoring

Scenario	New Relic	Datadog
Redis down	[x]	[*]
MySQL down	[x]	[*]
PostgreSLQ down	[x]	[*]
RabbitMQ down	[x]	[x]
Queue Growing	[ ]	[x]

* possible but not tested

- New Relic relies on our health checks

New Relic Notes

agent is easy to install
host monitoring is very good
Docker monitoring is very good
application monitoring requires extra library *
plug-ins are craptaculous
can write our own easy enough
not possible to alert via plug-in attributes
alerting options are very good
Android application helps with mobile monitorinng
set up is manually done at the dashboard

* only tested JVM applications

Datadog Notes

agent is easy to install
host monitoring is very good
Docker monitoring is excellent
application monitoring requires extra library *
plug-ins are excellent
can alert on infrastructure attributes, like queue length
alerting options are very good
availability checks are done from the agent ^
set up can be automated via REST API (not verified)

* only tested JVM applications

^ New Relic checks from their data center

Sysdig Notes

two forms: open source CLI tool and cloud monitoring
tracks relationships between components
appears to have deep insight into containers
appears to integrate with our technology stack
pricing is $20 per host per month
probably worth digging into at some point

Sysdig Historical

Sysdig Alerting

Monitoring Conclusion

if all you want is 'are you alive?', then either solution works
if you want to alert on infrastructure attributes, then Datadog is the choice
Datadog is $15/host, New Relic is $150/host (monthly)
New Relic has more features but is more expensive
Both systems feed into other systems, like Pager Duty and Hip Chat so a hybrid solution is possible
Generating an Operations Database is a higher end feature and was not investigated
Datadog probably makes sense to focus on right now

How We Tested Logging

applications were configured to write to stdout and stderror
message format modified to spit out JSON
Docker was instructed to forward console streams to the LaaS provider
messages were examined to see if they could be used to help troubleshoot down or dying servers
rudimentary searching capabilities were tested
message correlation searches

The Candidates

Attribute	Loggly	Found	ELK
Message Format	syslog	multiple	multiple
Free Tier	[x]	[ ]	[x]
Alerting	[*]	[^]	[^]
Automatic Parsing	[*]	[#]	[x]
Integrations	[*]	[#]	[#]

* available but not tested

^ coming soon

# hand configured

Loggly

containers configured to use syslog driver
rsyslog configured updated via automation
messages began flowing almost immediately
both application and infrastructure messages sent
consumes JSON format extremely well
plenty of power available but requires investment to learn the system
logs on disk need to be accounted for on a case-by-case basis
logging via HTTP also supported
emission of JSON messages is highly recommended

Loggly Console

Loggly Console (Java)

Loggly Console (Nginx)

Loggly Console (MySQL)

Loggly Correlation

Found

14 day trial
documentation poor
probably assumes expertise with Logstash
focus on availability (AWS-based ?)

ELK

components are deployed via containers
requires investment to build out infrastructure
requires expertise with Logstash
multi-format handling is powerful
probably requires standardizing on message formats to ease configuration
Docker helps to normalize logging but services that log to disk need to be addressed individually
might provide a growth path to Found, if needed

{
  "timestamp": "2015-10-02T15:02:12.455+00:00",
  "message": "Just processed message with the command fast sent from monitor-api-gateway",
  "component": "org.kurron.example.rest.inbound.MessageProcessor",
  "level": "WARN",
  "service-code": "monitor-rabbitmq",
  "realm": "Nashua Endurance Lab",
  "service-instance": "1",
  "message-code": "2008",
  "correlation-id": "54356aa8-ea9a-4dc8-946c-73b3b26b33f6",
  "tags": [
    "QA"
  ]
}

Normalized Message Format (JSON)

{
  "log": "2015-09-30 19:36:45 16 [Note] InnoDB: The InnoDB memory heap is disabled\n",
  "stream": "stderr",
  "time": "2015-09-30T19:36:45.033111964Z"
}

Docker JSON Log Format

{ 
    "timestamp": "2013-10-11T22:14:15.003123Z",
    "travel": {
        "airplane": "jumbo",
        "mileage": 2034
    }
}

Loggly JSON Log Format

Logging Conclusion

all solutions require investment in configuration
normalizing on application message formats required to reduce the configuration burden
use of side-car containers can keep logging concerns out of the application/service container
all solutions have decent web consoles
Loggly ranges between $45 and $350 a month
Found ranges based on configuration, $85 a month appears to be the minimum
ELK is free but requires investment in personnel and resources
Loggly looks promising so we should focus our attention here and trial it on a project or two

monitoring

By Ronald Kurr

monitoring

Proposal for application monitoring and diagnostic solutions

1,432

Ronald Kurr

Long time software developer.

Monitoring & Logging

Research Results

What

Why

How

The Simulation

Simulated System

Simulation Environment

How We Tested Monitoring

The Candidates

Host Monitoring

New Relic Server

Data Dog Server

Data Dog Correlation

Container Monitoring

New Relic Docker

Application Monitoring

New Relic Application

New Relic Application

Infrastructure Monitoring

New Relic Notes

Datadog Notes

Sysdig Notes

​Sysdig Historical

​Sysdig Alerting

Monitoring Conclusion

How We Tested Logging

The Candidates

Loggly

Loggly Console

Loggly Console

Loggly Console (Java)

Loggly Console (Nginx)

Loggly Console (MySQL)

Loggly Correlation

Found

ELK

Normalized Message Format (JSON)

Docker JSON Log Format

Loggly JSON Log Format

Logging Conclusion

monitoring

More from Ronald Kurr

Sysdig Historical

Sysdig Alerting