New York City Directories, 1850-1890

Nicholas Wolf

New York University

NYPL Space/Time Directory

  • NYU/NYPL Team (Bert Spaan, Stephen Balogh, Nick Wolf)
  • Processing of 1849-1879 directories (largely Trow, Wilson):
    • OCR (Tesseract 4)
    • Conditional Random Fields supervised ML to parse subject/occupation/address fields
    • Geocoded against Space/Time "addresses" dataset, street addresses derived from insurance addresses, 1850s
    • 4.7 million entries, 1.3 million geocoded
    • See http://spacetime.nypl.org/#data

NYU Refine and Expand

  • NYU  Team (Nick Wolf, Wesley Chioh)
  • Seek better accuracy and cover 1850-1890
  • Results:
    • 7.9 million entries
    • 5.7 million entries (73%) with highest-score occupation and address
    • 7.6 million/96% of all occupations at highest score
    • 7.8 million/78% of all addresses at highest score
  • Flags for individuals marked "widow" (647,911) and "colored" (10,253)
  • Data available at http://hdl.handle.net/2451/61521

Better Column/Indent Detection

Old approach: x-val only detection with nearest Cartesian distance (+ k-means clustering)

New approach: x- and y-val determination of head line versus indent line (+ k-means clustering)

Postprocessing Using Bi-Gram Fingerprinting

"physician", Bi-gram >> "ancihyiaicphsiys"

"ph'ysician", Bi-gram >> "ancihyiaicphsiys"

 

 

{ "ancihyiaicphsiys": [

                          "physician", … "physician""physician",    x 9000

                          "ph'ysician",

                          "ph%ysician @",

          ]

}

Iterative Levenshtein Matching

{ "ancihyiaicphsiys": [

                          "physician",

                          "ph'ysician",

                          "ph%ysician @",

          ]

}

{ "anciiaiiispisyyc": [

                             "piisician",

                             "_*pi isician",

                            "&pii.sician",

                 ]

 }

 

physician

 

piisician

Step 4: Iterative Levenshtein Matching

{ "ancihyiaicphsiys": [

                          "physician",

                          "ph'ysician",

                          "ph%ysician @",

          ]

}

{ "anciiaiiispisyyc": [

                             "piisician",

                             "_*pi isician",

                            "&pii.sician",

                 ]

 }

 

physician

 

piisician

Assign Score

Score

1  = occupation or address occurs only once across whole corpus (very unlikely, almost certainly has errors)

2 = occupation or address occurs only twice

3

4

5

.

.

.

15 = occurs 15 or more times, and final selected value for any swapped-in replacements could be confirmed by hand

deck

By Nicholas Wolf