Logging 103
Nir Cohen
VP Engineering@
@nir0s
Not logging 104
Logging Problems
Non-standardized formats
66.249.65.3 - - [06/Nov/2014:19:11:24 +0600] "GET /?q=w00t HTTP/1.1" 200 4223 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
PROBLEM
What's the problem here?
[Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test
2018/01/04 07:45:34 [DEBUG] memberlist: Stream connection from=172.16.26.33:39660
time="2018-01-31T09:31:56.358177267Z" level=error msg="Handler for POST /containers/cae096f6abdc4024c914c4e72763f905cd3039895c7d28454f69bbc73c72b507/stop returned error: Container cae096f6abdc4024c914c4e72763f905cd3039895c7d28454f69bbc73c72b507 is already stopped"
t=2018-01-16T00:21:44+0000 lvl=info msg="Request Completed" logger=context userId=0 orgId=0 uname= method=GET path=/ status=302 remote_addr=172.16.2.149 time_ms=0 size=29 referer=
[2018-02-21T17:29:48,099][INFO ][o.e.n.Node ] [] initializing ...
Feb 01 11:54:39 ip-172-16-23-55 dockerd[1058]: ...
(DEBUG) April 27th 2015AC, Three and 39 minutes: message, key=value
JSON
(or msgpack, thrift, GELF)
SOLUTION
66.249.65.3 - - [06/Nov/2014:19:11:24 +0600]
"GET /?q=w00t HTTP/1.1" 200 4223 "-"
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
{
"host": "66.249.65.3",
"user": "null",
"timestamp": "2014-11-06T19:11:24Z0600",
"method": "GET",
"request": "/?q=w00t",
"protocol": "HTTP/1.1",
"status": "200",
"body_bytes_sent": "4223",
"user_agent": "Mozilla/5.0 (compatible; Googlebot/2.1)"
}
Contextual Logging
PROBLEM
logger.info('Logging in...')
logger.debug('Creating session...')
...
logger.debug('Retrieving user token')
...
logger.debug('Authentication')
...
What's wrong with this?
Add Context
SOLUTION
logger.info('Logging in...', user_id=user.id, email=user.email)
logger.debug('Creating session...', user_id=user.id, sid=session.id)
...
logger.debug('Retrieving user token', user_id=user.id, sid=session.id)
...
logger.debug('Token retrieved!', user_id=user.id, sid=session.id, token=token.id)
logger.debug('Authenticating...', user_id=user.id, token=token.id)
...
Great!
What's wrong with this?
Context Repetition
PROBLEM
logger.info('Logging in...', user_id=user.id, email=user.email)
logger.debug('Creating session...', user_id=user.id, sid=session.id) <-- again?
...
logger.debug('Retrieving user token', user_id=user.id, sid=session.id) <-- AGAIN?!
...
logger.debug('Token retrieved!', user_id=user.id, sid=session.id) <-- AHHHHH!
logger.debug('Authenticating...', user_id=user.id, token=token.id) <-- Death.
...
Context binding
SOLUTION
logger.bind(user_id=user.id, email=user.email)
logger.info('Logging in...', user_id=user.id, email=user.email, sid=session.id)
logger.debug('Creating session...')
...
logger.debug('Retrieving user token')
...
logger.bind(token=token.id)
logger.debug('Token retrieved!')
logger.debug('Authenticating...')
logger.unbind('token', 'user_id', 'email')
...
Debug vs. Analysis*
PROBLEM
{
"host": "66.249.65.3",
"user": "null",
"timestamp": "2014-11-06T19:11:24Z0600",
"method": "GET",
"request": "/?q=w00t",
"protocol": "HTTP/1.1",
"status": "200",
...
}
Uncomfortable to debug like this
Log to console also
SOLUTION
{
"host": "66.249.65.3",
"user": "null",
"timestamp": "2014-11-06T19:11:24Z0600",
"method": "GET",
"request": "/?q=w00t",
"protocol": "HTTP/1.1",
"status": "200",
...
}
2014-11-06T19:11:24Z0600 - INFO - Process Request
host=66.249.65.3
user=null
method=GET
request="/?q=w00t"
protocol=HTTP/1.1
status=200
...
PERMISSIVE! No musts!
User vs. System
PROBLEM
logger.info('Logging in...')
logger.debug('Creating session...')
...
logger.debug('Retrieving user token')
...
logger.debug('Authentication')
...
What's most interesting here?
Events vs. Logs
SOLUTION
logger.info('Logging in...') <-- EVENT!
logger.debug('Creating session...') <-- Contextual Log
...
logger.debug('Retrieving user token') <-- Contextual Log
...
logger.debug('Authentication') <-- Contextual Log
...
- Differentiate events and logs
- Analyze events all the time
- Analyze logs when there's a problem or something to optimize (perf, etc..)
Errors that aren't errors
def do_something():
logger.error('Everything is ok!)
PROBLEM
- Log fatigue
- No idea when something actually goes wrong
- Can't debug because WHAT?!
Err, when err
def do_something():
logger.error('THIS IS ACTIONABLE!)
SOLUTION
- Only ACTIONABLE.
- Only when it potentially affects the user
- https://stackoverflow.com/questions/2031163/when-to-use-the-different-log-levels
Like so
def login(user):
logger.event('User logging in', user_id=user)
logger.debug('Requesting login token', user_id=user)
try:
token = request_token(user)
logger.debug('Login token received', user_id=user)
logger.error('Maybe token not received!') <-- NOT ERROR! WARNING AT BEST!
except TokenFailure as e:
logger.error('Failed to receive login token', user_id=user)
raise ...
db.register(user_id=user, token=token)
logger.info('Login successful', user_id=user)
Library Abstraction
PROBLEM
- Should we develop something ourselves?
- Should we use open-source?
- Should we abstract the logger?
Lowest layer over OS
SOLUTION
- Don't invent the wheel. Someone else already did it.
- Open-source is better than what you can do
- Create lowest layer of abstraction to not break things when updating a version
Heavy Context Formatting
logger.debug('My Amazing Event', context=context)
PROBLEM
Lazy Formatting
SOLUTION
if is_debug:
logger.debug('My Amazing Event', context=context)
*Local Disk Storage
PROBLEM
- Log files grow
- Disk can be a CPU blocker
- Application performance can degrade
Log File Rotation
SOLUTION
- Use a rotating file appender
- Rotate by size, not time
- Preferably log to a different partition
Blocking Logger
PROBLEM
- Logger is non-functional
- Logger is potentially network dependent
- Can potentially degrade service
Async Appenders
SOLUTION
- Can log in the background
- Even if something goes wrong, who cares?*
Too many log messages
logger.info('Starting...')
logger.debug('Doing...')
logger.debug('Continuing to do...')
logger.debug('Almost done...')
if err:
logger.error('Error %s', err)
logger.info('Done')
PROBLEM
When do we need them?
Dynamic Levels
SOLUTION 1
- Perform action
- on-error, enable debug
- retry
- on-success, disable debug
Retroactive Logging
SOLUTION 2
- Log to memory
- on-transaction-success: flush to /dev/null
- on-transaction-fail: write to appender
- Useful when you can't do dynamic levels
- *Notice memory limits
Following an event
logger.event('User logging in...')
PROBLEM
How to analyze all logs contextual to this event?
Context ID
cid = logger.event('User logging in', user_id=user.id)
logger.bind(cid)
logger.info('more things in the context of the request')
...
logger.unbind('cid')
SOLUTION
More
- Logger performance
- Infrastructure
- Using logs in the application
- Global vs. Local
- Integrations
Some clients
- Javascript: bunyan (+bunyan-debug-stream!), pino
- Java: log4j2, logback
- C#: serilog
- Python: WRYTE! :) (structlog)
Logging 103
By Nir Cohen
Logging 103
A summary of logging problems and solutions
- 1,361