Text Embeddings for Entity Resolution and Name Disambiguation in the Library Catalog

Network of interconnected nodes and lines on a gradient background that transitions from a warm golden hue at the bottom to a clear blue at the top. The nodes vary in size and are connected by thin lines, creating a web-like structure. Some nodes appear as filled circles while others are outlined, and the density of the network varies across the image, with some areas more clustered and others more sparse.

Whats in a Name?

A Name without Life Dates in Any Other Record Would Be Just as Ambiguous

Network of interconnected nodes and lines on a gradient background that transitions from a warm golden hue at the bottom to a clear blue at the top. The nodes vary in size and are connected by thin lines, creating a web-like structure. Some nodes appear as filled circles while others are outlined, and the density of the network varies across the image, with some areas more clustered and others more sparse.

Tim Thompson

Manager, Metadata Services Unit

Yale Library

timothy.thompson@yale.edu

www.linkedin.com/in/timathompson

 

 

Code4Lib 2025

Princeton, New Jersey

March 10, 2025

Attribution & Acknowledgments

License
Creative Commons Attribution 4.0 International (CC-BY).
Credits
  • Consultants:
    • Gavin Mendel-Gleason
    • Maren van Otterdijk
  • Catherine Kwon (Yale, B.S. in Statistics & Data Science and Art)
  • Anisia Hassan Ferreira Evangelista (Yale, B.A. in Economics, Statistics & Data Science, and Energy Studies)
  • Yale Technical Services, Library IT, and Cultural Heritage IT (CHIT) Initiative
Badge representing the Creative Commons Attribution 4.0 International License (CC-BY).

Authority Control

Whos really in control here?

Authority Control in the Catalog

  • Users depend on consistent access points to find what they’re looking for.
  • Access points for people, groups, subjects (and to some extent works) are meant to be controlled and “authorized” against master files.
  • In practice, authority control is messy because data is denormalized and entities are identified by strings rather than unique identifiers.
Catalog card taken from the digital version of the book Documentation made easy.

Catalog card taken from the digital version of the book Documentation made easy.

Screenshot of a MARC Bibliographic record in the Voyager Cataloging Module.

This Heading Is Valid?

Screenshot of the Library of Congress Linked Data Service page for Schubert, Franz.
Homepage of the LUX: Yale Collections Discovery Search Portal
Screenshot of search results in the LUX platform for the name Franz Schubert.
Screenshot of a single cropped search result in the LUX platform for Franz Schubert.
Cropped screenshot of occupation metadata from the Library of Congress for a different Franz Schubert.

Schubert, Franz

Will the Real Franz Schubert Please Stand Up?

Scooby Doo unmasking meme for Schubert, Franz.
  • The lack of consistent, comprehensive authority control is a form of technical debt that impacts both users and library staff.
  • Systems like Alma support integrated authority control processes, but we need to improve the quality of our data to use this functionality effectively.

Problem Statement

Entity Resolution

With some help from text embeddings?

1

20222023

Student employees (CHIT and library funded) used an Excel plugin to perform manual entity resolution and disambiguation for common names in the catalog​​.

3

November 2023

Pilot project proposal

5

September 2024present

  • Hiring students
  • Following up with consultants on bug fixes and optimizations
  • Carrying out further testing
  • Developing Proof of Concept with Claude 3.7 Sonnet

2

2023

Research and testing with TerminusDB (open source graph database)

4

FebruaryMarch 2024

Pilot project with consultants

Screenshot of a Microsoft Excel tool called Linked Data Types.

Initial Steps

Training Data

Network graph of showing 74 distinct identity clusters.
  • 2,354 names + records
  • 74 identity clusters with at least two representative records

Key Concepts

Embeddings

Vector databases

HNSW, ANN...

Vector Databases

  • Embedding models transform text into a numeric representation.

  • Given a word or string of words, embedding models output a high-dimensional vector that captures the texts semantic meaning, syntactic structure, and contextual nuance.

Credit: Catherine Kwon

[0.004, -0.005, ..., -0.017, 0.005]

“Today is a great day!”

string of text

high-dimensional vector

Text Embeddings

EMBEDDING

MODEL

Credit: Catherine Kwon

  • Once you have a set of text embeddings, what do you do with them?
  • Store them in a specialized index!
  • Vector databases use indexing algorithms such as Hierarchical Navigable Small World (HNSW) graphs for Approximate Nearest Neighbor (ANN) search.
  • As search engines, vector databases make it possible to return results based on user intent, rather than simple keyword matching.

Vector Databases

Embedding Strings

Contributor: Schubert, Franz
Title: The art of photography: instructions in the art of producing photographic pictures
Attribution: by G.C. Hermann Halleur ; with practical hints on the locale best suited for photographic operations, and on the proper posture, attitude, and dress, for portraiture by F. Schubert and an appendix ; translated from the German by G.L. Strauss
Subjects: Photography
Provision information: London: J. Weale, 1854

[0.007, -0.008, ..., -0.019, 0.001]

Contributor: Schubert, Franz...

high-dimensional vector

EMBEDDING

MODEL

Credit: Catherine Kwon

  • Once embeddings are indexed, the vector database calculates the “neighborhood” of each node based on a similarity threshold.
  • The results can be converted to an edge list for graph visualization.

“Schubert, Franz”

  • 88 records
  • All 4 identities correctly clustered
  • 4 unconnected nodes
  • ~95% accuracy
  • Threshold: 0.16
Network graph of identities associated with the name Schubert, Franz, numbered 1 to 4 and color-coded by community.

1

2

3

4

  • Every person is different! Difficult to set a uniform threshold for similarity. 
  • Difficult to curate data and embedding strings.
  • Trying to model a person’s identity (or identities) through their works is at best an approximation.
  • Need a more nuanced approach.

Challenges and Limitations

Refining the Approach

Entity Resolution Pipeline with Logistic Regression

Training a Classifier with Text Embeddings

See Leveraging LLMs and Machine Learning for Record Matching,” blog post by Gavin Mendel-Gleason.

  • Supervised machine learning method.
  • Predicts a binary outcome.
  • Outputs a probability.

Logistic Regression

Graph showing the characteristic S-curve of the sigmoid function.

Instead of just indexing a single embedding string, we need an entity resolution pipeline:

  • preprocessing and deduplication
  • embedding
  • indexing
  • imputation (vector “hot deck”)
  • feature engineering (with interaction features)
  • classification

Logistic Regression Pipeline

Split the data into separate fields:

  • composite (combined string)
  • person name
  • title
  • provision information
  • subjects
  • ...

Logistic Regression Pipeline

  • Vectorize each field separately.
  • Choose a field as a key to filter and index based on similarity (e.g., person).
  • Within each neighborhood, compare the entries and create a feature vector containing the distance between vectors for each field:

Logistic Regression Pipeline

\vec{x}_0 = d(e1_{title}, e2_{title})
  • Use gradient descent to train the model on labeled data.
  • Output weights corresponding to each independent variable.
  • Classify pairs of records as matching or not.

Logistic Regression Pipeline

Why not just use an LLM?

Entity resolution prompt and partial answer from Claude 3.7 Sonnet.

Response from Claude Sonnet 4

  • Reproducible
  • Interpretable
  • Scalable
  • Configurable
  • Affordable

Classifier

LLM

  • Black box
  • Expensive
  • Wasteful
  • Smaller models are not as accurate.

LLM Cost Estimates

Claude 3.5 Haiku Claude 3.7 Sonnet
Expected Cost
(1 input/output)
$0.00029
 
$0.00403
Unique name strings in Yale catalog 4,984,022 4,984,022
Total $1,445.37 $20,085.61
  • Embedding model cost (for example, OpenAI text-embedding-3-small) is trivial by comparison.
  • Rough estimate: about $60.00

Why not just use an LLM to build a logistic regression pipeline?

Proof-of-Concept Entity Resolution Pipeline

by

Claude 3.7 Sonnet

Featuring Weaviate

 

https://bit.ly/er-pipeline

Results

Test Results Summary:
Total test instances: 23268
Correct predictions: 22102 (94.99%)
Incorrect predictions: 1166 (5.01%)

Results

Performance Metrics
Metric Value
Precision 1.0000
Recall 0.9006
F1 0.9477
Accuracy 0.9500
ROC AUC 0.9999
Confusion Matrix
Predicted Negative Predicted Positive
Actual Negative 11559 0
Actual Positive 1164 10545
Chart showing a distribution by true/false class of a feature called composite_cosine.

Composite Vector Cosine Similarity

Chart showing a distribution by true/false class of a feature called person_title_harmonic.

Person + Title Vector Harmonic Mean

Black and white photograph of the construction of Yale University's Beinecke Library.

[Photograph of Beinecke Construction]. 0AD. https://collections.library.yale.edu/catalog/2037090.

Work in Progress

Thank you!

Tim Thompson

1

20222023

Student employees (CHIT and library funded) used an Excel plugin to perform manual entity resolution and disambiguation for common names in the catalog​​

3

November 2023

Proposal to Marty Kurth and ITSC

5

SeptemberOctober 2024

  • Hiring students
  • Following up with DataChemist on bug fixes and optimizations
  • Carrying out further testing

2

2023

TerminusDB (graph database) and VectorLink: research and testing

4

FebruaryMarch 2024

Consultant project with DataChemist (TerminusDB and VectorLink developers)

Consultant Project

Credit: Catherine Kwon

Dimensionality Reduction

Principal Component Analysis (PCA)

Ground Transportation

Credit: Catherine Kwon

Project Outcomes

Pipeline

  • Export from Voyager

1

Extract

  • Transform to BIBFRAME
  • Generate embeddings strings

2

Transform

  • Embed using OpenAI
  • Index vectors in VectorLink

3

Load

Status

  • Currently focused on people as contributors (vs. subjects).
  • Testing with the curated benchmark dataset produced 20222023.
  • 1,274 entries across 54 name/identity clusters.

Embedding Objects

{
  "op": "Inserted",
  "string": "Contributor: Schubert, Franz\n Title: The art of photography: instructions in the art of producing photographic pictures in any color, and on any material : for the use of beginners : and also of persons who have already attained some proficiency in the art : and of engravers on copper, stone, wood, etc.\n Subjects: Photography\n",
  "marcKey": "7001 $aSchubert, Franz.",
  "person": "Schubert, Franz",
  "roles": "Contributor",
  "title": "The art of photography: instructions in the art of producing photographic pictures in any color, and on any material : for the use of beginners : and also of persons who have already attained some proficiency in the art : and of engravers on copper, stone, wood, etc.",
  "variant_titles": null,
  "hub_title": null,
  "subjects": "Photography",
  "genres": null,
  "record": "14703468",
  "id": "14703468#Agent700-23"
}

“Thayer, Nathaniel”

  • 26 records
  • 3 (4?) identities correctly clustered
  • Some fragmentation of the main cluster
  • Threshold: 0.16

1

2

3

4?

1a

1b

Redundant Data

Contributor: Halleur, G. C. Hermann
 Title: The art of photography: instructions in the art of producing photographic pictures in any color, and on any material : for the use of beginners : and also of persons who have already attained some proficiency in the art : and of engravers on copper, stone, wood, etc.
 Subjects: Photography

Contributor: Schubert, Franz
 Title: The art of photography: instructions in the art of producing photographic pictures in any color, and on any material : for the use of beginners : and also of persons who have already attained some proficiency in the art : and of engravers on copper, stone, wood, etc.
 Subjects: Photography

Library of Congress Names?

Name: Schubert, Franz, 1876-
 Variant names: Schubert, Franz, b. 1876
 Sources: Grundzüge der Pastoraltheologie, 1922 (Dr. Franz Schubert, o. ö. Professor an der Universität Breslau); DNB in VIAF, Oct. 7, 2011 (hdg.: Schubert, Franz, 1876-; German theologian, professor of pastoral theology); Deutsche Biographie, viewed 28 September 2022 (Franz Schubert; born in Bistrai-Bielitz (Austrian Silesia) in 1876, died in Breslau in 1937; Catholic theologian specializing in pastoral theology)

Contributor: Schubert, Franz
 Title: Liturgische Zeitschrift
 Subjects: Catholic Church--Liturgy--Periodicals.

Similarity: 0.23 😿

Bonus!

Simple RAG demo

Text Embeddings for Entity Resolution and Name Disambiguation in the Catalog (Code4Lib 2025)

By Tim Thompson

Text Embeddings for Entity Resolution and Name Disambiguation in the Catalog (Code4Lib 2025)

Explore the innovative use of text embeddings for enhancing entity resolution and name disambiguation in library catalogs, addressing common ambiguities and showcasing authority control challenges.

  • 335