Nicholas Wolf
New York University
Old approach: x-val only detection with nearest Cartesian distance (+ k-means clustering)
New approach: x- and y-val determination of head line versus indent line (+ k-means clustering)
"physician", Bi-gram >> "ancihyiaicphsiys"
"ph'ysician", Bi-gram >> "ancihyiaicphsiys"
{ "ancihyiaicphsiys": [
"physician", … "physician" … "physician", x 9000
"ph'ysician",
"ph%ysician @",
]
}
{ "ancihyiaicphsiys": [
"physician",
"ph'ysician",
"ph%ysician @",
]
}
{ "anciiaiiispisyyc": [
"piisician",
"_*pi isician",
"&pii.sician",
]
}
physician
piisician
{ "ancihyiaicphsiys": [
"physician",
"ph'ysician",
"ph%ysician @",
]
}
{ "anciiaiiispisyyc": [
"piisician",
"_*pi isician",
"&pii.sician",
]
}
physician
piisician
Score
1 = occupation or address occurs only once across whole corpus (very unlikely, almost certainly has errors)
2 = occupation or address occurs only twice
3
4
5
.
.
.
15 = occurs 15 or more times, and final selected value for any swapped-in replacements could be confirmed by hand