deck

New York City Directories, 1850-1890

Nicholas Wolf

New York University

NYPL Space/Time Directory

NYU/NYPL Team (Bert Spaan, Stephen Balogh, Nick Wolf)
Processing of 1849-1879 directories (largely Trow, Wilson):
- OCR (Tesseract 4)
- Conditional Random Fields supervised ML to parse subject/occupation/address fields
- Geocoded against Space/Time "addresses" dataset, street addresses derived from insurance addresses, 1850s
- 4.7 million entries, 1.3 million geocoded
- See http://spacetime.nypl.org/#data

NYU Refine and Expand

NYU Team (Nick Wolf, Wesley Chioh)
Seek better accuracy and cover 1850-1890
Results:
- 7.9 million entries
- 5.7 million entries (73%) with highest-score occupation and address
- 7.6 million/96% of all occupations at highest score
- 7.8 million/78% of all addresses at highest score
Flags for individuals marked "widow" (647,911) and "colored" (10,253)
Data available at http://hdl.handle.net/2451/61521

Better Column/Indent Detection

Old approach: x-val only detection with nearest Cartesian distance (+ k-means clustering)

New approach: x- and y-val determination of head line versus indent line (+ k-means clustering)

Postprocessing Using Bi-Gram Fingerprinting

"physician", Bi-gram >> "ancihyiaicphsiys"

"ph'ysician", Bi-gram >> "ancihyiaicphsiys"

{ "ancihyiaicphsiys": [

"physician", … "physician" … "physician", x 9000

"ph'ysician",

"ph%ysician @",

]

}

Iterative Levenshtein Matching

{ "ancihyiaicphsiys": [

"physician",

"ph'ysician",

"ph%ysician @",

]

}

{ "anciiaiiispisyyc": [

"piisician",

"_*pi isician",

"&pii.sician",

]

}

physician

piisician

Step 4: Iterative Levenshtein Matching

{ "ancihyiaicphsiys": [

"physician",

"ph'ysician",

"ph%ysician @",

]

}

{ "anciiaiiispisyyc": [

"piisician",

"_*pi isician",

"&pii.sician",

]

}

physician

piisician

Assign Score

Score

1 = occupation or address occurs only once across whole corpus (very unlikely, almost certainly has errors)

2 = occupation or address occurs only twice

15 = occurs 15 or more times, and final selected value for any swapped-in replacements could be confirmed by hand