New York City Directories, 1850-1890
Nicholas Wolf
New York University
NYPL Space/Time Directory
- NYU/NYPL Team (Bert Spaan, Stephen Balogh, Nick Wolf)
- Processing of 1849-1879 directories (largely Trow, Wilson):
- OCR (Tesseract 4)
- Conditional Random Fields supervised ML to parse subject/occupation/address fields
- Geocoded against Space/Time "addresses" dataset, street addresses derived from insurance addresses, 1850s
- 4.7 million entries, 1.3 million geocoded
- See http://spacetime.nypl.org/#data
NYU Refine and Expand
- NYU Team (Nick Wolf, Wesley Chioh)
- Seek better accuracy and cover 1850-1890
- Results:
- 7.9 million entries
- 5.7 million entries (73%) with highest-score occupation and address
- 7.6 million/96% of all occupations at highest score
- 7.8 million/78% of all addresses at highest score
- Flags for individuals marked "widow" (647,911) and "colored" (10,253)
- Data available at http://hdl.handle.net/2451/61521
Better Column/Indent Detection
Old approach: x-val only detection with nearest Cartesian distance (+ k-means clustering)
New approach: x- and y-val determination of head line versus indent line (+ k-means clustering)
Postprocessing Using Bi-Gram Fingerprinting
"physician", Bi-gram >> "ancihyiaicphsiys"
"ph'ysician", Bi-gram >> "ancihyiaicphsiys"
{ "ancihyiaicphsiys": [
"physician", … "physician" … "physician", x 9000
"ph'ysician",
"ph%ysician @",
]
}
Iterative Levenshtein Matching
{ "ancihyiaicphsiys": [
"physician",
"ph'ysician",
"ph%ysician @",
]
}
{ "anciiaiiispisyyc": [
"piisician",
"_*pi isician",
"&pii.sician",
]
}
physician
piisician
Step 4: Iterative Levenshtein Matching
{ "ancihyiaicphsiys": [
"physician",
"ph'ysician",
"ph%ysician @",
]
}
{ "anciiaiiispisyyc": [
"piisician",
"_*pi isician",
"&pii.sician",
]
}
physician
piisician
Assign Score
Score
1 = occupation or address occurs only once across whole corpus (very unlikely, almost certainly has errors)
2 = occupation or address occurs only twice
3
4
5
.
.
.
15 = occurs 15 or more times, and final selected value for any swapped-in replacements could be confirmed by hand
deck
By Nicholas Wolf
deck
- 320