Logging 103

Nir Cohen

VP Engineering@                              

@nir0s

Not logging 104

Logging Problems

Non-standardized formats

66.249.65.3 - - [06/Nov/2014:19:11:24 +0600] "GET /?q=w00t HTTP/1.1" 200 4223 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

 

PROBLEM

What's the problem here?

[Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test
2018/01/04 07:45:34 [DEBUG] memberlist: Stream connection from=172.16.26.33:39660
time="2018-01-31T09:31:56.358177267Z" level=error msg="Handler for POST /containers/cae096f6abdc4024c914c4e72763f905cd3039895c7d28454f69bbc73c72b507/stop returned error: Container cae096f6abdc4024c914c4e72763f905cd3039895c7d28454f69bbc73c72b507 is already stopped"
t=2018-01-16T00:21:44+0000 lvl=info msg="Request Completed" logger=context userId=0 orgId=0 uname= method=GET path=/ status=302 remote_addr=172.16.2.149 time_ms=0 size=29 referer=

 

[2018-02-21T17:29:48,099][INFO ][o.e.n.Node               ] [] initializing ...

Feb 01 11:54:39 ip-172-16-23-55 dockerd[1058]: ...
(DEBUG) April 27th 2015AC, Three and 39 minutes: message, key=value

JSON

(or msgpack, thrift, GELF)

SOLUTION

66.249.65.3 - - [06/Nov/2014:19:11:24 +0600] 
"GET /?q=w00t HTTP/1.1" 200 4223 "-" 
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
{
  "host": "66.249.65.3",
  "user": "null",
  "timestamp": "2014-11-06T19:11:24Z0600",
  "method": "GET",
  "request": "/?q=w00t",
  "protocol": "HTTP/1.1",
  "status": "200",
  "body_bytes_sent": "4223",
  "user_agent": "Mozilla/5.0 (compatible; Googlebot/2.1)"
}

Contextual Logging

PROBLEM

logger.info('Logging in...')
logger.debug('Creating session...')
...
logger.debug('Retrieving user token')
...
logger.debug('Authentication')
...

What's wrong with this?

Add Context

SOLUTION

logger.info('Logging in...', user_id=user.id, email=user.email)
logger.debug('Creating session...', user_id=user.id, sid=session.id)
...
logger.debug('Retrieving user token', user_id=user.id, sid=session.id)
...
logger.debug('Token retrieved!', user_id=user.id, sid=session.id, token=token.id)
logger.debug('Authenticating...', user_id=user.id, token=token.id)
...

Great!

What's wrong with this?

Context Repetition

PROBLEM

logger.info('Logging in...', user_id=user.id, email=user.email) 
logger.debug('Creating session...', user_id=user.id, sid=session.id) <-- again?
...
logger.debug('Retrieving user token', user_id=user.id, sid=session.id)  <-- AGAIN?!
...
logger.debug('Token retrieved!', user_id=user.id, sid=session.id)  <-- AHHHHH!
logger.debug('Authenticating...', user_id=user.id, token=token.id)  <-- Death.
...

Context binding

SOLUTION

logger.bind(user_id=user.id, email=user.email)
logger.info('Logging in...', user_id=user.id, email=user.email, sid=session.id)
logger.debug('Creating session...')
...
logger.debug('Retrieving user token')
...
logger.bind(token=token.id)
logger.debug('Token retrieved!')
logger.debug('Authenticating...')

logger.unbind('token', 'user_id', 'email')
...

Debug vs. Analysis*

PROBLEM

{
  "host": "66.249.65.3",
  "user": "null",
  "timestamp": "2014-11-06T19:11:24Z0600",
  "method": "GET",
  "request": "/?q=w00t",
  "protocol": "HTTP/1.1",
  "status": "200",
  ...
}

Uncomfortable to debug like this

Log to console also

SOLUTION

{
  "host": "66.249.65.3",
  "user": "null",
  "timestamp": "2014-11-06T19:11:24Z0600",
  "method": "GET",
  "request": "/?q=w00t",
  "protocol": "HTTP/1.1",
  "status": "200",
  ...
}
2014-11-06T19:11:24Z0600 - INFO - Process Request
    host=66.249.65.3
    user=null
    method=GET
    request="/?q=w00t"
    protocol=HTTP/1.1
    status=200
    ...

PERMISSIVE! No musts!

User vs. System

PROBLEM

logger.info('Logging in...')
logger.debug('Creating session...')
...
logger.debug('Retrieving user token')
...
logger.debug('Authentication')
...

What's most interesting here?

Events vs. Logs

SOLUTION

logger.info('Logging in...')   <-- EVENT!
logger.debug('Creating session...')  <-- Contextual Log
...
logger.debug('Retrieving user token')  <-- Contextual Log
...
logger.debug('Authentication')  <-- Contextual Log
...
  • Differentiate events and logs
  • Analyze events all the time
  • Analyze logs when there's a problem or something to optimize (perf, etc..)

Errors that aren't errors

def do_something():
    logger.error('Everything is ok!)

PROBLEM

  • Log fatigue
  • No idea when something actually goes wrong
  • Can't debug because WHAT?!

Err, when err

def do_something():
    logger.error('THIS IS ACTIONABLE!)

SOLUTION

  • Only ACTIONABLE.
  • Only when it potentially affects the user
  • https://stackoverflow.com/questions/2031163/when-to-use-the-different-log-levels

Like so

def login(user):
    logger.event('User logging in', user_id=user)

    logger.debug('Requesting login token', user_id=user)
    try:
        token = request_token(user)
        logger.debug('Login token received', user_id=user)
        logger.error('Maybe token not received!')  <-- NOT ERROR! WARNING AT BEST!
    except TokenFailure as e:
        logger.error('Failed to receive login token', user_id=user)
        raise ...

    db.register(user_id=user, token=token)
    logger.info('Login successful', user_id=user)

Library Abstraction

PROBLEM

  • Should we develop something ourselves?
  • Should we use open-source?
  • Should we abstract the logger?

Lowest layer over OS

SOLUTION

  • Don't invent the wheel. Someone else already did it.
  • Open-source is better than what you can do
  • Create lowest layer of abstraction to not break things when updating a version

Heavy Context Formatting

logger.debug('My Amazing Event', context=context)

PROBLEM

Lazy Formatting

SOLUTION

if is_debug:
    logger.debug('My Amazing Event', context=context)

*Local Disk Storage

PROBLEM

  • Log files grow
  • Disk can be a CPU blocker
  • Application performance can degrade

Log File Rotation

SOLUTION

  • Use a rotating file appender
  • Rotate by size, not time
  • Preferably log to a different partition

Blocking Logger

PROBLEM

  • Logger is non-functional
  • Logger is potentially network dependent
  • Can potentially degrade service

Async Appenders

SOLUTION

  • Can log in the background
  • Even if something goes wrong, who cares?*

Too many log messages


logger.info('Starting...')
logger.debug('Doing...')
logger.debug('Continuing to do...')
logger.debug('Almost done...')
if err:
    logger.error('Error %s', err)

logger.info('Done')

PROBLEM

When do we need them?

Dynamic Levels

SOLUTION 1

  • Perform action
  • on-error, enable debug
  • retry
  • on-success, disable debug

Retroactive Logging

SOLUTION 2

  • Log to memory
  • on-transaction-success: flush to /dev/null
  • on-transaction-fail: write to appender
  • Useful when you can't do dynamic levels
  • *Notice memory limits

Following an event


logger.event('User logging in...')

PROBLEM

How to analyze all logs contextual to this event?

Context ID

cid = logger.event('User logging in', user_id=user.id)
logger.bind(cid)
logger.info('more things in the context of the request')
...

logger.unbind('cid')

SOLUTION

More

  • Logger performance
  • Infrastructure
  • Using logs in the application
  • Global vs. Local
  • Integrations

Some clients

  • Javascript: bunyan (+bunyan-debug-stream!), pino
  • Java: log4j2, logback
  • C#: serilog
  • Python: WRYTE! :) (structlog)

Logging 103

By Nir Cohen

Logging 103

A summary of logging problems and solutions

  • 1,264