INTRO   O

PROBLEM SOLVING AND

PROGRAMMING IN PYTHON

 

(use the Space key to navigate through all slides)

Prof. Andrea Gallegati
tuj81353@temple.edu

Prof. Dario Abbondanza
tuk96119@temple.edu

CIS 1051 - FUZZY SEARCH

Fuzzy Search

in Python

Understanding

 Finding Things Even When
misspelled

Google still finds the right results.

 

This magic happens thanks to fuzzy search — a way to find things even when they don’t match exactly.

 

Example:
Fuzzy Searching for "Jon"
might also find

  • "John"
  • "Jonathan"
  • "Joan"

How Does Fuzzy Search Work?

compare words based on similarity:

 

  • Levenshtein Distance 
    (counts how many edits turn one word into another).
     
  • Approximate String Matching 
    (finds words that are close enough).
     
  • Token-Based Matching 
    (matches words even if order varies).

Using Python
 thefuzz Library

pip install thefuzz

First, install it with:

Example:
Comparing Two Strings

Let’s see if "Jon" and "John" are similar.

from thefuzz import fuzz

similarity = fuzz.ratio("Jon", "John")
print("Similarity Score:", similarity)
  • The score ranges from 0 to 100.
  • 100 means an exact match.
  • 85 means that are quite similar.

In-Depth Analysis of find_best_matches

fuzzy search to find the best match between

  • a list of search names 
  • a set of GeoJSON features.

Key Parameters

  • search_names →
    list of names to search for.
  • features →
    list of GeoJSON features to match against.
  • key_field →
    key field in the feature’s properties to compare.
  • threshold →
    minimum similarity score.
  • scorer →
    fuzzy matching function.

How It Works

  1. Removes "n/a" (if present).
  2. Resolves the scorer function dynamically.
  3. Iterates over features
    - extract key_field 
    values
    - compare to search_names
  4. Stores results with
    score > threshold.
  5. Finds the highest-scoring match.
  6. Extracts coordinates.
  7. Returns the best match.

Available Scorers

thefuzz provides multiple scoring methods

Scorer How It Works Weaknesses
ratio Levenshtein Distance
(simple character similarity)
Sensitive to order, whitespace,
and minor typos.
partial_ratio similarity between a substring
and the full string
Can return high scores for misleading matches.
token_sort_ratio Sorts words alphabetically
before comparison.
Sensitive to missing or
extra words.
token_set_ratio Like above, but removes
duplicate words.
Can overcorrect in some cases.

Choosing the Right scorer

Use Case Best Scorer Reason
Simple typos and short names ratio Fast, general-purpose.
Checking if a name is part of another partial_ratio Works well for abbreviations.
Handling word order variations token_sort_ratio Fixes swapped words.
Matching complex names 
(with extra/missing words)
token_set_ratio Best for multi-word names.
 

How to Improve find_best_matches?

import re

def normalize_text(text):
    return re.sub(r"[^a-zA-Z0-9 ]", "", text).lower()

normalized_name = normalize_text(feature_name)

This removes punctuation, lowercases text, and improves accuracy.

Benchmarking Different scorers

to understand their strengths and weaknesses.

from thefuzz import fuzz, process

# Test cases
search_names = ["St. Michael's Cathedral", "Washington Square Park", "Central Plaza"]
geojson_features = [
    {
      "properties": {"name": "Saint Michael Cathedral"}, 
     "geometry": {"coordinates": [10.0, 20.0]}
    },
    {
      "properties": {"name": "Washington Sq. Park"}, 
     "geometry": {"coordinates": [15.0, 25.0]}
    },
    {"properties": {"name": "Plaza Central"}, 
     "geometry": {"coordinates": [30.0, 40.0]}
    },
]

# Test different scorers
scorers = ["ratio", "partial_ratio", "token_sort_ratio", "token_set_ratio"]

for scorer in scorers:
    print(f"\n### Testing with {scorer} ###")
    
    for feature in geojson_features:
        feature_name = feature["properties"]["name"]
        best_match, score = process.extractOne(feature_name, search_names, 
                                               scorer=getattr(fuzz, scorer))
        
        print(f"Match for '{feature_name}': '{best_match}' → Score: {score}")
### Testing with ratio ###
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 87
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 88
Match for 'Plaza Central': 'Central Plaza' → Score: 54

### Testing with partial_ratio ###
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 90
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 86
Match for 'Plaza Central': 'Central Plaza' → Score: 70

### Testing with token_sort_ratio ###
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 89
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 90
Match for 'Plaza Central': 'Central Plaza' → Score: 100

### Testing with token_set_ratio ###
Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 89
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 91
Match for 'Plaza Central': 'Central Plaza' → Score: 100

1️⃣ ratio 

(Simple Character Similarity)

Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 87
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 88
Match for 'Plaza Central': 'Central Plaza' → Score: 54

Strengths

  • Works well when names are mostly similar.

Weaknesses

  • abbreviations reduce similarity.
  • swapping words further reduce similarity!

2️⃣ partial_ratio

(Substring Matching)

Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 90
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 86
Match for 'Plaza Central': 'Central Plaza' → Score: 70

Strengths

  • Performs better for abbreviations.
  • Handles minor differences well.

Weaknesses

  • Can overestimate matches.

3️⃣ token_sort_ratio 

(Word Order Independence)

Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 89
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 90
Match for 'Plaza Central': 'Central Plaza' → Score: 100

Strengths

  • Handles word reordering perfectly .

Weaknesses

  • Sensitive to minor character differences.

4️⃣ token_set_ratio 

(Ignores Duplicates & Order)

Match for 'Saint Michael Cathedral': 'St. Michael's Cathedral' → Score: 89
Match for 'Washington Sq. Park': 'Washington Square Park' → Score: 91
Match for 'Plaza Central': 'Central Plaza' → Score: 100

Strengths

  • Best for long, unordered names.

Weaknesses

  • Might overestimate similar words, but unrelated.

How to Improve

find_best_matches?

Normalize Text for Better Matching

import re

def normalize_text(text):
    return re.sub(r"[^a-zA-Z0-9 ]", "", text).lower()

normalized_name = normalize_text(feature_name)

To remove punctuation, lowercases text, and improve accuracy.

This was crafted with

A Framework created by Hakim El Hattab and contributors
to make stunning HTML presentations

Made with Slides.com