Investigative reporting with databases

By Luke Rosiak
The Daily Caller
Previously:  Washington Examiner, Washington Times, Sunlight Foundation


"You see Lucille there? You ought to see her when they play 'Tequila Makes Her Clothes Fall Off,'" Mr. Byron says.

Data can be your secret weapon to find stories others miss, and work faster than your competitors.

  • Data is just a window into the real world. Use it to generate story leads, and then report them out.
  • Newspaper articles aren't academic papers. Don't bog them down with numbers.
  • Your anecdotal leads will actually mean something, and your trend pieces will represent actual trends instead of just something a writer noticed.




I used math and GIS (spatial software) to identify the most-gerrymandered Congressional districts. Lets you be authoritative, not subjective--and lets you automate redundant tasks.

Challenging assumptions

  • Data can test conventional wisdom
  • Anyone can have an ideology-driven opinion, but data helps you get at the truth
  • Spokesmen and anyone else can manipulate traditional statements, but if they don't know how to make sense of data, they don't know what they should try to censor or hide from you.


Recognize patterns, create methodologies and apply them en masse

  • Data is especially helpful in writing at national publications with readers everywhere.
  • You write some formula that could show something interesting, run it on every member of a group at once (all members of Congress, all counties, etc.). 
  • The odds are good that at least some will be outliers, and you can write about them.
  • If you can make web apps, you can make something that allows readers across the country to localize your findings--essentially writing 50 or more stories at once.

Government salaries

mean bosses in congress

what computer-assisted reporting is 

  • It's been around for 30 years, and its output is often traditional print investigative or feature stories with quotes and all the rest, and maybe some charts
  • That meant you did a lot of work that never made it in the paper, so these days, robust "news apps" interactives are optional and can let you show your audience more. Before, you had to write for a general audience, but that lets you selectively display sub-stories based on niche audiences' interests
  • Right now, there are TONS of jobs for journalists with real data and programming skills

What it isn't

  • Being "good at Twitter and Facebook"
  • Knowing how to add links to stories with HTML
  • Fixing people's broken computers
  • Being a "web producer" who uses admin interfaces to post others' stories to the home page and add in links

more positively, what it isn't

  • Something you need a formal education to do.
  • Something you should be afraid of. There's a lot to learn, but you do it a little at a time, to solve immediate problems that you have in front of you.
  • The only thing you have to fear is fear itself. There is no excuse for the old saying, "I became a journalist because I can't do math." You can, and you must, or you are doing your readers a disservice.
  • Something you're behind on. Technology changes all the time, so people are always learning.
  • There is a great email list full of helpful people, called NICAR-L. You can sign up at ire.org.

things to learn: GETTING DATA

  • Create your own: Take notes in Excel or Google Spreadsheets, using columns appropriately, when information takes a certain format.
  • This can be information from interviews, web research, or documents. Staying organized and working in a structured way is important.
  •  Numbers in their own columns
  • Define and apply categories... Make sure the spelling is the exact same each time so you can group by them.

Applying your own "cat'(EGORY)

GETTING DATA: NATIONAL POLITICS

  • Legislation: Federal: govtrack.us State: openstates.org
  • Campaign finance and lobbying: Federal: OpenSecrets.org State: followthemoney.org
  • Federal contracts: USASpending.gov
  • Census: censusreporter.org
  • Nonprofits: CitizenAudit.org
  • Unions: UnionFacts.org
  • Farm subsidies: farm.ewg.org
  • Search for miscellaneous federal data: Data.gov


GETTING DATA: THROUGH FOIA

  • Format of data: No matter how you get data, the best format to get it in is the simplest, CSV, because it can be opened by any program. The WORST is PDFs.
  • FOIA requires government to give you data in any format they have it in. They definitely have it in structured format (SQL, CSV), but you really have to fight to make sure they give it to you that way, partly due to ignorance of FOIA officers.
  • If you find a blank government form that would contain interesting information, then you know there's a dataset somewhere.

things to learn: Databases. 


  • Excel can do simple tasks with single tables.
  • Databases--the SQL language--is the next step.
  • Databases can handle larger tables and more importantly, usually involve multiple inter-connected tables.
  • Microsoft Access is a visual version of SQL that you may have on your computer as part of Microsoft Office, and if you do, you can do 95% of computer-assisted reporting there.
  • Google Fusion Tables has some SQL features and also allows you to easily display info on the web.

OPTIONAL THINGS TO LEARN: PROGRAMMING

  • All you need is the Python language. learnpython.org
  • Automate redundant tasks. It's like having 100s of interns!
  • Programming primarily comes in handy to get data into a format good for databases.
  • "Scraping" means hitting different pages of a web site thousands of times and using patterns to extract the data from HTML and put it into fields.
  • You can use advanced algorithms to detect suspicious things.
  • You can also use it to power the "back end" of interactive features for newspaper web sites. Look into Django, which is a way to use Python, to create large data-driven web sites. 

Optional things to learn: FRONT-END WEB DESIGN

  • This means HTML and Javascript, often with the help of Javascript libraries like D3.js
  • Interactive graphics


SQL concepts

Regardless of whether you're writing SQL commands manually or using a tool like Microsoft Access, the commands are SELECT columns, filtering with WHERE, ORDERing, GROUPing and JOINING. 

Computing the honor roll:
  • SELECT firstname, lastname FROM students WHERE gpa>3 ORDER BY gpa DESC;

Wildcard "LIKE"s to show students whose names start with JA:
SELECT firstname, lastname FROM students WHERE name LIKE 'ja*';

USING 'GROUP BY" TO SHOW NEPOTISM

SELECT union_name, last_name, COUNT(*) 
FROM union_salaries 
GROUP BY union_name, last_name
ORDER BY COUNT(*);


  • Grouping by is also called 'aggregating.' You can do it in Excel with the Pivot Tables function, or in Google Fusion Tables with the Summarize function.
  • It means lumping things together into groups based on one or more values, and computing some number for the whole group, such as averaging or totaling a numeric value, or just giving you the number of rows in that category.
  • Big tables won't seem so overwhelming when you realize that you'll probably want to condense them into much smaller and more illuminating summary tables this way.


Finding fake mailing addresses used by hundreds of companies to gain preferences:

SELECT address, city, COUNT(*) FROM contracts GROUP BY address, city ORDER BY COUNT(*) DESC LIMIT 10;

JOINING

A ton of my best stories are literally just comparing two lists together and finding interesting relationships. For example, a list of votes where members of Congress were absent, and a list of fundraisers that they attended.

These lists can be millions of rows long, but you can do it in milliseconds. It's important to have a unique identifier for concepts... in other words, if both tables represent members of Congress, do they format the names the same way? Ideally they'll both use the same numerical ID.


SELECT member_id, member_name, date, vote_name, fundraiser_location
FROM missed_votes INNER JOIN fundraisers
ON (missedvotes.member_id=fundraisers.member_id AND missedvotes.date=fundraisers.date)

Or in Microsoft Access:

SQL specifics

  • Definitely use Microsoft Access if you have it. Import external data and drag the tables onto the pane, and go to the Design tab. Select your fields and put in WHERE and ORDER criteria at the bottom, and visually join fields on different tables together to do JOINs.
  • Otherwise, use SqLite, which you can download as a Mozilla Firefox plug-in and other easy ways.
  • There are also PostgreSQL and MySQL. The language for all of them is basically exactly the same. 
  • Do this tutorial: http://www.peteraldhous.com/CAR/Database_SQLite.pdf

with great power comes great responsibility

  • Data is often messy. You may have to clean it up. For example, phone company employees listing their jobs as AT&T, A T T, AT+T, A T and T. The computer will not know these are the same. Use the tool Open Refine to "standardize" them.
  • You need to learn about the topic the data is representing. That means asking questions, calling people, acquiring domain knowledge.
  • If something seems so extreme that it would make for a crazy story, there's a good chance it's some kind of error instead.

You can do this

  • Seriously, all you need to know is SQL. You can forget everything else.
  • Or if another component interests you, start there. Find a topic that interests you first, and then simply learn whatever you have to do to pursue that data story. Do that a bunch of times and you know everything!
  • You will be able to get lots of jobs if you have data and/or programming skills.
  • But that's not why you should do it, you should do it because you can root out fraud and send people to jail with it.

contact me

@lukerosiak
luke@rosiak.us
www.rosiak.co

This presentation: 
slides.com/lukerosiak/

Investigative reporting with databases

By Luke Rosiak

Investigative reporting with databases

  • 1,513