ISO Emeritus, NTPSec
Senior Systems Analyst, IU CACR
@HedgeMage // http://security.engineering
“Never doubt that a small group of thoughtful, committed, citizens can change the world. Indeed, it is the only thing that ever has.”
― Margaret Mead
What is NTP?
- Network Time Protocol: the primary way most computers throughout the world find out what time it is, and maintain synchronization with one another and the actual passage of time.
- The reference implementation, in software, of that protocol: both the server and client side, plus the algorithms that use that information to regulate system clocks.
In February 2015, NTP was also a gigantic mess.
Not yet C99 compliant.
Fragile build system.
Documentation between six and thirty years out of date.
Code locked up in a proprietary SCM system.
Technical debt dating back decades.
The Security Nightmare
NTP was Critical
- Cryptography & Authentication
- Logging: Systems Administration & Security
- Navigation & Location
NTP was Insecure
Vulnerability patches going public on a months-to-years response cycle.
Patches circulated in private weaponized and used to exploit servers across the internet.
Lack of access to development history made it difficult to audit the code and/or take on improvements as a drive-by contributor.
The overall state of the software, build infrastructure, and community made NTP brittle, full of vulnerabilities, and difficult to improve.
“given enough eyeballs, all bugs are shallow”
I learned how deep the rabbit hole went...
No OSS gets broken to the point of crisis without a driving set of systemic social problems. If these are not addressed, any repair will be short-term, as the underlying cause of the original technical problems will continue to cause new technical problems.
To his credit...
NTP's maintainer asked for help.
In NTP's Case:
Poor resource allocation
Hostility to new contributors
- Clinging to broken process and
tooling as a mechanism of control.
Bringing Order to Chaos
Decide you are going to be responsible.
Any Critical Software Rescue:
- Set a clear, concrete, finite scope.
- Expect drama. Forgive drama.
- Spend time with people -- split technical and social leadership positions if needed to make this investment possible.
- Keep perspective: the purpose of a rescue is long-term sustainability. Any other goal may be sacrificed to support this one.
How do you set a scope when you know there are unseen bugs lurking everywhere, and you are not deeply familiar with the code base?
The code's needs are the clearest part of the project scope.
Fixing bugs is temporary. More bugs are coming.
Long-term impact comes from making bugs easier to fix, and eliminating or preventing classes of bugs.
A good rescue results in a long tail of bug fixing.
High-Return Technical Improvements:
- Code Access
- Build Process
- Testing Infrastructure and Automation
Refactors that accomplish:
- Major code reduction
- Major improvements in internal compartmentation
- Major tightening of internal APIs
- Migration away from dangerous dependencies
- Bugs that are immediate security crises.
What this meant for the NTP rescue's technical goals:
- Migrate from Bitkeeper to git
- Replace brittle build system with a modern, WAF-based build.
- Update documentation enough to start onboarding new developers.
- Fix as many security problems as possible before our time and money ran out.
- Repository & Access
- Build System
- Communication Channels
People, Drama, and Project Sustainability
...by which I mean that I know everybody, including people who are better coders than I, and better wielders-of-bureaucracy, and people who know people...
I was lucky.
We needed programmers
Familiar with ancient C code
Experienced in Linux/UNIX systems programming
Capable of working on highly critical code
With some idea how time works
Who care about open source and security
Who can spend a lot of time on this.
We also needed:
A way to keep those programmers fed
Help with documentation and toolchain work
Means to demonstrate to the existing NTP community that we weren't abandoning them
An understanding of the existing install base that we didn't have
The means to maintain the code, documentation, and community post-rescue
Some way to convince people to actually deploy the thing
Harlan Stenn - NTP Classic Maintainer
Adam Nuwer - Volunteer Sysadmin, Community Member
Von Welch - My Boss, CACR Director, CTSC PI
Anita Nikolich - NSF PM for CTSC
Members of the NTP Classic Community
Tim Minick - then of Gemini Observatory
Eric Raymond - (yes, that ESR) GPSd maintainer, Software Architect
Gary Miller -- GPSd Software Architect
Amar Takhar - former NTP Classic team member, build system geek
Leslee Cooper - CACR Admin Director, got me an awesome student intern (NaLette Brodnax) for docs work!
Many, many people who answered nosy questions about their NTP usage.
Mark Atwood -- Took the handoff as NTPSec Project Manager
Daniel Franke -- Took the handoff in as NTPSec ISO
Many other people I've failed to name.
Two administrative staff.
2-4 semi-active community members.
Susan Sons, PM / ISO
Eric Raymond, lead dev
Gary Miller, developer
NaLette Brodnax, docs
Amar Takhar. tools dev
...and a handful of concerned community members.
Much to my personal disappointment...
...I didn't find myself writing code on this one.
It turned out that I had some great (read: better than me) systems programmers to hand, including a more experienced software architect.
I was able to help out with some specific information security concerns, in my role as Information Security Officer, and play Security Architect as needed, but my biggest impact was undoubtedly making the project run...
What it takes to manage a
critical software rescue:
- Deep understanding of the problem domain, of software engineering process in general, and of people.
The worst mistake one can make is to misidentify the problem.
- Relationships: find the right people at the right time.
- A little resilience: always be calm, ready to adapt, and between your team and as much of the chaos as possible.
- EITHER coding and software architecture expertise OR a close, long-standing working relationship with a coder and software architect who will be key to the rescue.
I can't teach you my whole process in this talk, but...
- When Sputnik crashes down on your head, resist the urge to react immediately, unless it's to prevent immediate loss of life. Gather information, start identifying the problem and scoping a response, and talk to people.
- Do not try to make a smooth-running project with no margin for error. Planning for drama and messiness and being able to absorb it is a winning strategy.
- Write. Write down your background planning, your thinking, your project scope. Then, communicate people face-to-face (or by teleconference) and follow up in writing.
- Be kinder to everyone than you need to be, be empathetic even when people are being wrong. Not because you're a sap, because it's how you get people to do things you want.
So, how did the story end?
As of October 2016...
NTPSec has a team of two senior developers, one experienced project manager, one junior developer, one information security officer, and one toolchain maintainer-slash-sysadmin, aided by about a dozen interested and engaged community members.
Due to a reduction in code of over 2/3 (from 227kLOC to 74kLOC), NTPSec was immune to over 50% of NTP Classic vulns BEFORE discovery in the last year.
NTPSec patches security vulnerabilities, on average, within less than 12 hours after discovery. Note that publication is sometimes slowed to coordinate with NTP Classic releases.
- NTPSec's vulnerability response has pressured NTP Classic to speed up their response from months-to-years to days-to-weeks upon threats of funders pulling out.
I moved on...
NTPSec's core team has been through a lot, but we still meet up about once a year and hang out, because it was a wild ride with good people. I was given an emeritus title when I stepped down last spring, in the hope that I'd remain "part of the family".
There is still so much vulnerable infrastructure software...
How many currently active committers account for >50% of the code base?
Breakdown by Dave Nalley:
Why does it matter?
- OpenSSL (think Heartbleed)
- Bash (think Shellshock)
- Costs of personnel turnover
- Costs of neglect
- Risk of malicious compromise
Active committers in widely used OSS projects:
Image credit: Dave Nalley
The sky is falling...
...but it's going to be okay.
This is what I do.
What I want from you is a little bit of help.
Do something about crumbling, insecure internet infrastructure
This deck is at: https://slides.com/hedgemage/savingtime
To Wikimedia Foundation for their awesome library of freely reusable media, which spared you from my toddler-like drawing ability.
To Indiana University's Center for Applied Cybersecurity Research, and specifically the NSF-funded Center for Trustworthy Scientific Cyberinfrastructure, who funded the NTP Rescue project. Also to the Internet Civil Engineering Institute, who aided with organization and developer resources.
To O'Reilly, for bringing me here to tell you this story.
To the NTP Security Project team, who made sure the rescue effort didn't go to waste. NTPSec is poised to replace NTP classic in the coming year in installations around the world.
To the countless individual humans along the way who did NOT say
"this is somebody else's problem".
Using and Sharing This Work:
Saving Time by Susan Sons is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Permissions beyond the scope of this license may be available; send inquiries to firstname.lastname@example.org.
By Susan Sons