B.Y.O.D.
Building your own database
Steven Rich
The Washington Post
#NICAR19
bit.ly/byodatabase
I would prefer not to have to give this talk
But not everything is already in a database or spreadsheet
Why should I build my own database?
There is nothing in this world as satisfying as building your own homemade, artisinal database
Also, you have a data set that is unique and newsworthy
Why shouldn't I build my own database?
-
The database already exists in a database format
-
Someone else has already done it
-
The work involved is insane and will kill you
WARNINGS
-
There is no one-size-fits-all solution
-
The work is often tedious and exacting
-
You need to have a full game plan when you start
Why do we need this session at all?
Examples
Ways to build a database from scratch
- Web scraping
- Public records
- News stories
- By hand
The most important step in this whole process is determining what fields will be kept
Utility vs. Feasibility
Early vs. Late Grouping
Standardization
Scope vs. Time
Thinking about the story
There are a bunch of ways to do this from a technical standpoint
-
Microsoft Excel
-
Google Sheets
-
Google Forms
-
SQL databases
-
Django
How do we build it?
A team approach
Create fail safes
Creating clear standards for data entry
Spot-checking
A programatic approach
A programatic approach
Create fail safes
Creating clear standards for data entry
Spot-checking
Ideal, if possible
May not be necessary
Basic strategies to get into a good place
OCR
-density 300 input.pdf -depth 8 -background white -alpha Off output.tiff
output.tiff input_ocr -l eng pdf
Tabula
tabula.technology
Scripting
Pitfalls
Huge investment
Could be competitive
It's a gamble
It's fragile
Do you continue to keep it?
Any questions?
Building your own database
By Steven Rich
Building your own database
- 4,023