B.Y.O.D.

Building your own database

 

Steven Rich

The Washington Post

#NICAR19

bit.ly/byodatabase

I would prefer not to have to give this talk

But not everything is already in a database or spreadsheet

Why should I build my own database?

There is nothing in this world as satisfying as building your own homemade, artisinal database

Also, you have a data set that is unique and newsworthy

Why shouldn't I build my own database?

  • The database already exists in a database format

  • Someone else has already done it

  • The work involved is insane and will kill you

WARNINGS

  • There is no one-size-fits-all solution

  • The work is often tedious and exacting

  • You need to have a full game plan when you start

Why do we need this session at all?

Examples

Ways to build a database from scratch

  • Web scraping
  • Public records
  • News stories
  • By hand

The most important step in this whole process is determining what fields will be kept

Utility vs. Feasibility

Early vs. Late Grouping

Standardization

Scope vs. Time

Thinking about the story

There are a bunch of ways to do this from a technical standpoint

  • Microsoft Excel

  • Google Sheets

  • Google Forms

  • SQL databases

  • Django

How do we build it?

A team approach

Create fail safes

Creating clear standards for data entry

Spot-checking

A programatic approach

A programatic approach

Create fail safes

Creating clear standards for data entry

Spot-checking

Ideal, if possible

May not be necessary

Basic strategies to get into a good place

OCR

-density 300 input.pdf -depth 8 -background white -alpha Off output.tiff

 

output.tiff input_ocr -l eng pdf

Tabula

tabula.technology

Scripting

Pitfalls

Huge investment

Could be competitive

It's a gamble

It's fragile

Do you continue to keep it?

Any questions?

Building your own database

By Steven Rich

Building your own database

  • 3,675