INTRO O

PROBLEM SOLVING AND

PROGRAMMING IN PYTHON

(use the Space key to navigate through all slides)

Prof. Andrea Gallegati
tuj81353@temple.edu

Prof. Dario Abbondanza
tuk96119@temple.edu

CIS 1051 - FUZZY SEARCH

Fuzzy Search

in Python

Understanding

Finding Things Even When
misspelled

Google still finds the right results.

This magic happens thanks to fuzzy search — a way to find things even when they don’t match exactly.

Example:
Fuzzy Searching for `"Jon"`
might also find

"John"
"Jonathan"
"Joan"

How Does Fuzzy Search Work?

compare words based on similarity:

Levenshtein Distance
(counts how many edits turn one word into another).
Approximate String Matching
(finds words that are close enough).
Token-Based Matching
(matches words even if order varies).

Using Python
`thefuzz` Library

pip install thefuzz

First, install it with:

https://pypi.org/project/thefuzz/

Example:
Comparing Two Strings

Let’s see if "Jon" and "John" are similar.

from thefuzz import fuzz

similarity = fuzz.ratio("Jon", "John")
print("Similarity Score:", similarity)

The score ranges from 0 to 100.
100 means an exact match.
85 means that are quite similar.

In-Depth Analysis of `find_best_matches`

fuzzy search to find the best match between

a list of search names
a set of GeoJSON features.

Key Parameters

search_names →
list of names to search for.
features →
list of GeoJSON features to match against.
key_field →
key field in the feature’s properties to compare.
threshold →
minimum similarity score.
scorer →
fuzzy matching function.

How It Works

Removes "n/a" (if present).
Resolves the scorer function dynamically.
Iterates over features
- extract key_field values
- compare to search_names
Stores results with
score > threshold.
Finds the highest-scoring match.
Extracts coordinates.
Returns the best match.

Available Scorers

`thefuzz` provides multiple scoring methods

Scorer	How It Works	Weaknesses
`ratio`	Levenshtein Distance (simple character similarity)	Sensitive to order, whitespace, and minor typos.
`partial_ratio`	similarity between a substring and the full string	Can return high scores for misleading matches.
`token_sort_ratio`	Sorts words alphabetically before comparison.	Sensitive to missing or extra words.
`token_set_ratio`	Like above, but removes duplicate words.	Can overcorrect in some cases.

Choosing the Right scorer

Use Case	Best Scorer	Reason
Simple typos and short names	`ratio`	Fast, general-purpose.
Checking if a name is part of another	`partial_ratio`	Works well for abbreviations.
Handling word order variations	`token_sort_ratio`	Fixes swapped words.
Matching complex names (with extra/missing words)	`token_set_ratio`	Best for multi-word names.

How to Improve `find_best_matches`?

import re

def normalize_text(text):
    return re.sub(r"[^a-zA-Z0-9 ]", "", text).lower()

normalized_name = normalize_text(feature_name)

This removes punctuation, lowercases text, and improves accuracy.

Benchmarking Different scorers

to understand their strengths and weaknesses.

from thefuzz import fuzz, process

# Test cases
search_names = ["St. Michael's Cathedral", "Washington Square Park", "Central Plaza"]
geojson_features = [
    {
      "properties": {"name": "Saint Michael Cathedral"}, 
     "geometry": {"coordinates": [10.0, 20.0]}
    },
    {
      "properties": {"name": "Washington Sq. Park"}, 
     "geometry": {"coordinates": [15.0, 25.0]}
    },
    {"properties": {"name": "Plaza Central"}, 
     "geometry": {"coordinates": [30.0, 40.0]}
    },
]

# Test different scorers
scorers = ["ratio", "partial_ratio", "token_sort_ratio", "token_set_ratio"]

for scorer in scorers:
    print(f"\n### Testing with {scorer} ###")
    
    for feature in geojson_features:
        feature_name = feature["properties"]["name"]
        best_match, score = process.extractOne(feature_name, search_names, 
                                               scorer=getattr(fuzz, scorer))
        
        print(f"Match for '{feature_name}': '{best_match}' → Score: {score}")

### Testing with ratio ###
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 87
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 88
Match for 'Plaza Central': 'Central Plaza' → Score: 54

### Testing with partial_ratio ###
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 90
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 86
Match for 'Plaza Central': 'Central Plaza' → Score: 70

### Testing with token_sort_ratio ###
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 89
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 90
Match for 'Plaza Central': 'Central Plaza' → Score: 100

### Testing with token_set_ratio ###
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 89
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 91
Match for 'Plaza Central': 'Central Plaza' → Score: 100

1️⃣ ratio

(Simple Character Similarity)

Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 87
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 88
Match for 'Plaza Central': 'Central Plaza' → Score: 54

✅ Strengths

Works well when names are mostly similar.

❌ Weaknesses

abbreviations reduce similarity.
swapping words further reduce similarity!

2️⃣ partial_ratio

(Substring Matching)

Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 90
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 86
Match for 'Plaza Central': 'Central Plaza' → Score: 70

✅ Strengths

Performs better for abbreviations.
Handles minor differences well.

❌ Weaknesses

Can overestimate matches.

3️⃣ token_sort_ratio

(Word Order Independence)

Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 89
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 90
Match for 'Plaza Central': 'Central Plaza' → Score: 100

✅ Strengths

Handles word reordering perfectly .

❌ Weaknesses

Sensitive to minor character differences.

4️⃣ token_set_ratio

(Ignores Duplicates & Order)

Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 89
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 91
Match for 'Plaza Central': 'Central Plaza' → Score: 100

✅ Strengths

Best for long, unordered names.

❌ Weaknesses

Might overestimate similar words, but unrelated.

How to Improve

`find_best_matches`?

Normalize Text for Better Matching

import re

def normalize_text(text):
    return re.sub(r"[^a-zA-Z0-9 ]", "", text).lower()

normalized_name = normalize_text(feature_name)

To remove punctuation, lowercases text, and improve accuracy.

This was crafted with

A Framework created by Hakim El Hattab and contributors
to make stunning HTML presentations

INTRO O

PROBLEM SOLVING AND

PROGRAMMING IN PYTHON

CIS 1051 - FUZZY SEARCH

Fuzzy Search

in Python

Understanding

Finding Things Even When misspelled

Example: Fuzzy Searching for "Jon" might also find

How Does Fuzzy Search Work?

Using Python thefuzz Library

First, install it with:

Example: Comparing Two Strings

Let’s see if "Jon" and "John" are similar.

In-Depth Analysis of find_best_matches

Key Parameters

How It Works

Available Scorers

thefuzz provides multiple scoring methods

Choosing the Right scorer

How to Improve find_best_matches?

Benchmarking Different scorers

to understand their strengths and weaknesses.

1️⃣ ratio

(Simple Character Similarity)

2️⃣ partial_ratio

(Substring Matching)

3️⃣ token_sort_ratio

(Word Order Independence)

4️⃣ token_set_ratio

(Ignores Duplicates & Order)

How to Improve

find_best_matches?

Normalize Text for Better Matching

This was crafted with

Finding Things Even When
misspelled

Example:
Fuzzy Searching for `"Jon"`
might also find

Using Python
`thefuzz` Library

Example:
Comparing Two Strings

In-Depth Analysis of `find_best_matches`

`thefuzz` provides multiple scoring methods

How to Improve `find_best_matches`?

`find_best_matches`?