INTRO O
PROBLEM SOLVING AND
PROGRAMMING IN PYTHON
(use the Space key to navigate through all slides)

Prof. Andrea Gallegati |
Prof. Dario Abbondanza |

CIS 1051 - FUZZY SEARCH

Fuzzy Search
in Python
Understanding
Finding Things Even When
misspelled
Google still finds the right results.
This magic happens thanks to fuzzy search — a way to find things even when they don’t match exactly.
Example:
Fuzzy Searching for "Jon"
might also find
"John"
"Jonathan"
"Joan"
How Does Fuzzy Search Work?
compare words based on similarity:
-
Levenshtein Distance
(counts how many edits turn one word into another).
-
Approximate String Matching
(finds words that are close enough).
-
Token-Based Matching
(matches words even if order varies).
Using Python
thefuzz
Library
pip install thefuzz
First, install it with:
Example:
Comparing Two Strings
Let’s see if "Jon" and "John" are similar.
from thefuzz import fuzz
similarity = fuzz.ratio("Jon", "John")
print("Similarity Score:", similarity)
- The score ranges from 0 to 100.
- 100 means an exact match.
- 85 means that are quite similar.
In-Depth Analysis of find_best_matches
fuzzy search to find the best match between
- a list of search names
- a set of GeoJSON features.
Key Parameters
-
search_names
→
list of names to search for. -
features
→
list of GeoJSON features to match against. -
key_field
→
key field in the feature’s properties to compare. -
threshold
→
minimum similarity score. -
scorer
→
fuzzy matching function.
How It Works
-
Removes
"n/a"
(if present). - Resolves the scorer function dynamically.
-
Iterates over features
- extractkey_field
values
- compare tosearch_names
- Stores results with
score >
.threshold
- Finds the highest-scoring match.
- Extracts coordinates.
- Returns the best match.
Available Scorers
thefuzz
provides multiple scoring methods
Scorer | How It Works | Weaknesses |
---|---|---|
ratio |
Levenshtein Distance (simple character similarity) |
Sensitive to order, whitespace, and minor typos. |
partial_ratio |
similarity between a substring and the full string |
Can return high scores for misleading matches. |
token_sort_ratio |
Sorts words alphabetically before comparison. |
Sensitive to missing or extra words. |
token_set_ratio |
Like above, but removes duplicate words. |
Can overcorrect in some cases. |
Choosing the Right scorer
Use Case | Best Scorer | Reason |
---|---|---|
Simple typos and short names | ratio |
Fast, general-purpose. |
Checking if a name is part of another | partial_ratio |
Works well for abbreviations. |
Handling word order variations | token_sort_ratio |
Fixes swapped words. |
Matching complex names (with extra/missing words) |
token_set_ratio |
Best for multi-word names.
|
How to Improve find_best_matches
?
import re
def normalize_text(text):
return re.sub(r"[^a-zA-Z0-9 ]", "", text).lower()
normalized_name = normalize_text(feature_name)
This removes punctuation, lowercases text, and improves accuracy.
Benchmarking Different scorers
to understand their strengths and weaknesses.
from thefuzz import fuzz, process
# Test cases
search_names = ["St. Michael's Cathedral", "Washington Square Park", "Central Plaza"]
geojson_features = [
{
"properties": {"name": "Saint Michael Cathedral"},
"geometry": {"coordinates": [10.0, 20.0]}
},
{
"properties": {"name": "Washington Sq. Park"},
"geometry": {"coordinates": [15.0, 25.0]}
},
{"properties": {"name": "Plaza Central"},
"geometry": {"coordinates": [30.0, 40.0]}
},
]
# Test different scorers
scorers = ["ratio", "partial_ratio", "token_sort_ratio", "token_set_ratio"]
for scorer in scorers:
print(f"\n### Testing with {scorer} ###")
for feature in geojson_features:
feature_name = feature["properties"]["name"]
best_match, score = process.extractOne(feature_name, search_names,
scorer=getattr(fuzz, scorer))
print(f"Match for '{feature_name}': '{best_match}' → Score: {score}")
### Testing with ratio ###
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 87
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 88
Match for 'Plaza Central': 'Central Plaza' → Score: 54
### Testing with partial_ratio ###
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 90
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 86
Match for 'Plaza Central': 'Central Plaza' → Score: 70
### Testing with token_sort_ratio ###
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 89
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 90
Match for 'Plaza Central': 'Central Plaza' → Score: 100
### Testing with token_set_ratio ###
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 89
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 91
Match for 'Plaza Central': 'Central Plaza' → Score: 100
1️⃣ ratio
(Simple Character Similarity)
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 87
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 88
Match for 'Plaza Central': 'Central Plaza' → Score: 54
✅ Strengths
- Works well when names are mostly similar.
❌ Weaknesses
- abbreviations reduce similarity.
- swapping words further reduce similarity!
2️⃣ partial_ratio
(Substring Matching)
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 90
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 86
Match for 'Plaza Central': 'Central Plaza' → Score: 70
✅ Strengths
- Performs better for abbreviations.
- Handles minor differences well.
❌ Weaknesses
- Can overestimate matches.
3️⃣ token_sort_ratio
(Word Order Independence)
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 89
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 90
Match for 'Plaza Central': 'Central Plaza' → Score: 100
✅ Strengths
- Handles word reordering perfectly .
❌ Weaknesses
- Sensitive to minor character differences.
4️⃣ token_set_ratio
(Ignores Duplicates & Order)
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 89
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 91
Match for 'Plaza Central': 'Central Plaza' → Score: 100
✅ Strengths
- Best for long, unordered names.
❌ Weaknesses
- Might overestimate similar words, but unrelated.
How to Improve
find_best_matches
?
Normalize Text for Better Matching
import re
def normalize_text(text):
return re.sub(r"[^a-zA-Z0-9 ]", "", text).lower()
normalized_name = normalize_text(feature_name)
To remove punctuation, lowercases text, and improve accuracy.
This was crafted with
A Framework created by Hakim El Hattab and contributors
to make stunning HTML presentations
fuzzy-search
By Andrea Gallegati
fuzzy-search
- 84