Building your own database


Steven Rich

The Washington Post



I would prefer not to have to give this talk

But not everything is already in a database or spreadsheet

Why should I build my own database?

There is nothing in this world as satisfying as building your own homemade, artisinal database

Also, you have a data set that is unique and newsworthy

Why shouldn't I build my own database?

  • The database already exists in a database format

  • Someone else has already done it

  • The work involved is insane and will kill you


  • There is no one-size-fits-all solution

  • The work is often tedious and exacting

  • You need to have a full game plan when you start

Why do we need this session at all?


Ways to build a database from scratch

  • Web scraping
  • Public records
  • News stories
  • By hand

The most important step in this whole process is determining what fields will be kept

Utility vs. Feasibility

Early vs. Late Grouping


Scope vs. Time

Thinking about the story

There are a bunch of ways to do this from a technical standpoint

  • Microsoft Excel

  • Google Sheets

  • Google Forms

  • SQL databases

  • Django

How do we build it?

A team approach

Create fail safes

Creating clear standards for data entry


A programatic approach

Create fail safes

Creating clear standards for data entry


Ideal, if possible

May not be necessary

Basic strategies to get into a good place


-density 300 input.pdf -depth 8 -background white -alpha Off output.tiff


output.tiff input_ocr -l eng pdf





Huge investment

Could be competitive

It's a gamble

It's fragile

Do you continue to keep it?

Any questions?

