(use the Space key to navigate through all slides)
Prof. Andrea Gallegati |
Prof. Dario Abbondanza |
Google still finds the right results.
This magic happens thanks to fuzzy search — a way to find things even when they don’t match exactly.
"Jon"
"John"
"Jonathan"
"Joan"
compare words based on similarity:
thefuzz
Librarypip install thefuzz
from thefuzz import fuzz
similarity = fuzz.ratio("Jon", "John")
print("Similarity Score:", similarity)
find_best_matches
fuzzy search to find the best match between
search_names
→features
→key_field
→threshold
→scorer
→"n/a"
(if present).key_field
valuessearch_names
score > threshold
.thefuzz
provides multiple scoring methodsScorer | How It Works | Weaknesses |
---|---|---|
ratio |
Levenshtein Distance (simple character similarity) |
Sensitive to order, whitespace, and minor typos. |
partial_ratio |
similarity between a substring and the full string |
Can return high scores for misleading matches. |
token_sort_ratio |
Sorts words alphabetically before comparison. |
Sensitive to missing or extra words. |
token_set_ratio |
Like above, but removes duplicate words. |
Can overcorrect in some cases. |
Use Case | Best Scorer | Reason |
---|---|---|
Simple typos and short names | ratio |
Fast, general-purpose. |
Checking if a name is part of another | partial_ratio |
Works well for abbreviations. |
Handling word order variations | token_sort_ratio |
Fixes swapped words. |
Matching complex names (with extra/missing words) |
token_set_ratio |
Best for multi-word names.
|
find_best_matches
?import re
def normalize_text(text):
return re.sub(r"[^a-zA-Z0-9 ]", "", text).lower()
normalized_name = normalize_text(feature_name)
This removes punctuation, lowercases text, and improves accuracy.
from thefuzz import fuzz, process
# Test cases
search_names = ["St. Michael's Cathedral", "Washington Square Park", "Central Plaza"]
geojson_features = [
{
"properties": {"name": "Saint Michael Cathedral"},
"geometry": {"coordinates": [10.0, 20.0]}
},
{
"properties": {"name": "Washington Sq. Park"},
"geometry": {"coordinates": [15.0, 25.0]}
},
{"properties": {"name": "Plaza Central"},
"geometry": {"coordinates": [30.0, 40.0]}
},
]
# Test different scorers
scorers = ["ratio", "partial_ratio", "token_sort_ratio", "token_set_ratio"]
for scorer in scorers:
print(f"\n### Testing with {scorer} ###")
for feature in geojson_features:
feature_name = feature["properties"]["name"]
best_match, score = process.extractOne(feature_name, search_names,
scorer=getattr(fuzz, scorer))
print(f"Match for '{feature_name}': '{best_match}' → Score: {score}")
### Testing with ratio ###
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 87
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 88
Match for 'Plaza Central': 'Central Plaza' → Score: 54
### Testing with partial_ratio ###
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 90
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 86
Match for 'Plaza Central': 'Central Plaza' → Score: 70
### Testing with token_sort_ratio ###
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 89
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 90
Match for 'Plaza Central': 'Central Plaza' → Score: 100
### Testing with token_set_ratio ###
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 89
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 91
Match for 'Plaza Central': 'Central Plaza' → Score: 100
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 87
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 88
Match for 'Plaza Central': 'Central Plaza' → Score: 54
✅ Strengths
❌ Weaknesses
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 90
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 86
Match for 'Plaza Central': 'Central Plaza' → Score: 70
✅ Strengths
❌ Weaknesses
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 89
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 90
Match for 'Plaza Central': 'Central Plaza' → Score: 100
✅ Strengths
❌ Weaknesses
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 89
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 91
Match for 'Plaza Central': 'Central Plaza' → Score: 100
✅ Strengths
❌ Weaknesses
find_best_matches
?import re
def normalize_text(text):
return re.sub(r"[^a-zA-Z0-9 ]", "", text).lower()
normalized_name = normalize_text(feature_name)
To remove punctuation, lowercases text, and improve accuracy.
A Framework created by Hakim El Hattab and contributors
to make stunning HTML presentations