finding the bad guys

Tricks for joining lists of people



Adam Playford, Newsday (@adamplayford)

Michael LaForgia, Tampa Bay times (@laforgia_)




NICAR 2013



Why?



  1. Gives your story sweep, precision
  2. Great stories built on this technique



Whiz-bang tech is cool, but we got into this
to hold people accountable


WHY HARD?


Lack of whiz-bang tech:
Existing tools are not great at this.

Two competing masters:
  • Don't want to miss great examples.
  • Also don't want to screw up.


What we want to do



Not going to teach you how to use everything

  • Access, python classes start 9 a.m. tomorrow
  • IRE boot camps, codeacademy.com

Let's talk about principles.


not a technology problem



a journalism problem

that requires technology




EXCEPT...



ONE TECHNICAL PROBLEM YOU CAN'T AVOID:

A JOIN CAN ALWAYS ADD ROWS.




FIRST STEPs:

ZERO CLICKING



WHAT DOES EACH

ROW REPRESENT?




WHAT FIELDS DOES EACH ROW

CONTAIN?

UNIQUE ID


Social Security # (SSN)
Gov't employee ID


Great if you have it in both data sets.

But that never happens.

:(


FIRST/LAST


  • John Smith
  • Nicknames
  • Married women/Ron Artest


middle

  • So many wrinkles.
  • In Access/SQL, lose-lose.
  • In Excel, sometimes easier.


=IF(NOT(ISBLANK(B1)),B1=C1,"")


ADDRESS


  • 31 Main Avenue
  • 31 Main Ave
  • 31 Main Av Apt. 33
  • 31 Main Av.#33

  1. First digits.
  2. If you can script, cleaning program.


DOB


  • Great if you have it (move to Florida)
  • But not enough to avoid John Smith


race/sex


Not usually very useful

Except when it is.


Zip


Actually quite helpful.

Zip structure: More digits, more specific



combinations of fields



first, last, DOB


  • Handles most John Smiths but not all



first, last & address


  • Father/Son things.
  • This happens a LOT.



first, last, address & dob


  • The closest you get to perfect.
  • Still may be imperfect.
  • Only as good as your data.



partial matches

can be interesting, too



first, address & dob


  • Women who've changed names.



last, sex and county


  • Familial relations.


lots of combinations.


all do different things.


journalism problem.







TWO tricks to try




address by digits & first-3 zip




name matches by commonness


Weight uncommon names
higher than common names.

  • Social Security Agency: First names [link]
  • Census: Last names [link]

We are surprisingly bad at guessing
whether a name is common.



and then:

do reporting

inmate visitors



CANTEEN




letter to the judge


to the bar!



adam.playford@newsday.com
@adamplayford

mlaforgia@tampabay.com
@laforgia_

http://bit.ly/badguys2013

finding the bad guys

By Adam Playford

finding the bad guys

  • 2,057