(use the Space key to navigate through all slides)
|
Prof. Andrea Gallegati |
Prof. Dario Abbondanza |
Google still finds the right results.
This magic happens thanks to fuzzy search — a way to find things even when they don’t match exactly.
"Jon""John""Jonathan""Joan"compare words based on similarity:
thefuzz Librarypip install thefuzzfrom thefuzz import fuzz
similarity = fuzz.ratio("Jon", "John")
print("Similarity Score:", similarity)find_best_matches
fuzzy search to find the best match between
search_names →features →key_field →threshold →scorer →"n/a" (if present).key_field valuessearch_names
score > threshold.thefuzz provides multiple scoring methods| Scorer | How It Works | Weaknesses |
|---|---|---|
ratio |
Levenshtein Distance (simple character similarity) |
Sensitive to order, whitespace, and minor typos. |
partial_ratio |
similarity between a substring and the full string |
Can return high scores for misleading matches. |
token_sort_ratio |
Sorts words alphabetically before comparison. |
Sensitive to missing or extra words. |
token_set_ratio |
Like above, but removes duplicate words. |
Can overcorrect in some cases. |
| Use Case | Best Scorer | Reason |
|---|---|---|
| Simple typos and short names | ratio |
Fast, general-purpose. |
| Checking if a name is part of another | partial_ratio |
Works well for abbreviations. |
| Handling word order variations | token_sort_ratio |
Fixes swapped words. |
| Matching complex names (with extra/missing words) |
token_set_ratio |
Best for multi-word names.
|
find_best_matches?import re
def normalize_text(text):
return re.sub(r"[^a-zA-Z0-9 ]", "", text).lower()
normalized_name = normalize_text(feature_name)This removes punctuation, lowercases text, and improves accuracy.
from thefuzz import fuzz, process
# Test cases
search_names = ["St. Michael's Cathedral", "Washington Square Park", "Central Plaza"]
geojson_features = [
{
"properties": {"name": "Saint Michael Cathedral"},
"geometry": {"coordinates": [10.0, 20.0]}
},
{
"properties": {"name": "Washington Sq. Park"},
"geometry": {"coordinates": [15.0, 25.0]}
},
{"properties": {"name": "Plaza Central"},
"geometry": {"coordinates": [30.0, 40.0]}
},
]
# Test different scorers
scorers = ["ratio", "partial_ratio", "token_sort_ratio", "token_set_ratio"]
for scorer in scorers:
print(f"\n### Testing with {scorer} ###")
for feature in geojson_features:
feature_name = feature["properties"]["name"]
best_match, score = process.extractOne(feature_name, search_names,
scorer=getattr(fuzz, scorer))
print(f"Match for '{feature_name}': '{best_match}' → Score: {score}")### Testing with ratio ###
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 87
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 88
Match for 'Plaza Central': 'Central Plaza' → Score: 54
### Testing with partial_ratio ###
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 90
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 86
Match for 'Plaza Central': 'Central Plaza' → Score: 70
### Testing with token_sort_ratio ###
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 89
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 90
Match for 'Plaza Central': 'Central Plaza' → Score: 100
### Testing with token_set_ratio ###
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 89
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 91
Match for 'Plaza Central': 'Central Plaza' → Score: 100Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 87
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 88
Match for 'Plaza Central': 'Central Plaza' → Score: 54✅ Strengths
❌ Weaknesses
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 90
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 86
Match for 'Plaza Central': 'Central Plaza' → Score: 70✅ Strengths
❌ Weaknesses
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 89
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 90
Match for 'Plaza Central': 'Central Plaza' → Score: 100✅ Strengths
❌ Weaknesses
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 89
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 91
Match for 'Plaza Central': 'Central Plaza' → Score: 100✅ Strengths
❌ Weaknesses
find_best_matches?import re
def normalize_text(text):
return re.sub(r"[^a-zA-Z0-9 ]", "", text).lower()
normalized_name = normalize_text(feature_name)To remove punctuation, lowercases text, and improve accuracy.
A Framework created by Hakim El Hattab and contributors
to make stunning HTML presentations