Text Embeddings for Entity Resolution and Name Disambiguation in the Library Catalog

Network of interconnected nodes and lines on a gradient background that transitions from a warm golden hue at the bottom to a clear blue at the top. The nodes vary in size and are connected by thin lines, creating a web-like structure. Some nodes appear as filled circles while others are outlined, and the density of the network varies across the image, with some areas more clustered and others more sparse.

Tim Thompson

Librarian for Applied Metadata Research

Yale University Library

timothy.thompson@yale.edu

www.linkedin.com/in/timathompson

@timathom@indieweb.social

 

Yale Library AI Community of Interest

October 9, 2024

Attribution & Acknowledgments

License
Creative Commons Attribution 4.0 International (CC-BY).
Credits
  • Catherine Kwon (Yale, B.S. in Statistics & Data Science and Art)
  • Data Chemist
    • Gavin Mendel-Gleason
    • Maren van Otterdijk
  • Daniel Lovins
  • Marty Kurth
  • IT Steering Committee
  • Library IT
  • Cultural Heritage IT (CHIT) Initiative
Badge representing the Creative Commons Attribution 4.0 International License (CC-BY).

Overview

Issue
Problem statement and key concepts
Action
Project scope and evolution
Impact

Initial results, challenges, potential next steps 

[Photograph of Beinecke Construction]. 0AD. https://collections.library.yale.edu/catalog/2037090.

Work in Progress

Authority Control

Whos really in control here?

  • Users depend on consistent access points to find what they’re looking for.
  • Access points for people, groups, subjects (and to some extent works) are meant to be controlled and “authorized” against master files.
  • In practice, authority control is messy because data is denormalized and entities are identified by strings rather than unique identifiers.

Authority Control in the Catalog

Catalog card taken from the digital version of the book Documentation made easy.

Catalog card taken from the digital version of the book Documentation made easy.

This Heading Is Valid?

Will the Real Schubert Please Stand Up?

  • The lack of consistent, comprehensive authority control in the catalog is a form of technical debt that impacts both users and staff.
  • The migration from Voyager to Alma gives us an opportunity to address the problem.
  • Alma supports integrated authority control processes, but we need to improve the quality of our data to utilize this functionality effectively.

Problem Statement

Entity Resolution

With some help from text embeddings?

1

20222023

Student employees (CHIT and library funded) used an Excel plugin to perform manual entity resolution and disambiguation for common names in the catalog​​

3

November 2023

Proposal to Marty Kurth and ITSC

5

SeptemberOctober 2024

  • Hiring students
  • Following up with DataChemist on bug fixes and optimizations
  • Carrying out further testing

2

2023

TerminusDB (graph database) and VectorLink: research and testing

4

FebruaryMarch 2024

Consultant project with DataChemist (TerminusDB and VectorLink developers)

Initial Steps

Consultant Project

Key Concepts

Text embeddings, vector databases, HNSW, ANN...

  • Embedding models transform text into a numeric representation.

  • Given a word or string of words, embedding models output a high-dimensional vector that captures the texts semantic meaning, syntactic structure, and contextual nuance.

Text Embeddings

Credit: Catherine Kwon

[0.004, -0.005, ..., -0.017, 0.005]

“Today is a great day!”

string of text

high-dimensional vector

Text Embeddings

EMBEDDING

MODEL

Credit: Catherine Kwon

Credit: Catherine Kwon

Dimensionality Reduction

Principal Component Analysis (PCA)

Ground Transportation

Credit: Catherine Kwon

  • Once you have a set of text embeddings, what do you do with them?
  • Store them in a specialized index!
  • Vector databases use indexing algorithms such as Hierarchical Navigable Small World (HNSW) graphs for Approximate Nearest Neighbor (ANN) search.
  • As search engines, vector databases make it possible to return results based on user intent, rather than simple keyword matching.

Vector Databases

Credit: Catherine Kwon

Project Outcomes

Pipeline

  • Export from Voyager

1

Extract

  • Transform to BIBFRAME
  • Generate embeddings strings

2

Transform

  • Embed using OpenAI
  • Index vectors in VectorLink

3

Load

Status

  • Currently focused on people as contributors (vs. subjects).
  • Testing with the curated benchmark dataset produced 20222023.
  • 1,274 entries across 54 name/identity clusters.

Embedding Strings

Contributor: Schubert, Franz
Title: The art of photography: instructions in the art of producing photographic pictures in any color, and on any material : for the use of beginners : and also of persons who have already attained some proficiency in the art : and of engravers on copper, stone, wood, etc.
Subjects: Photography

Embedding Objects

{
  "op": "Inserted",
  "string": "Contributor: Schubert, Franz\n Title: The art of photography: instructions in the art of producing photographic pictures in any color, and on any material : for the use of beginners : and also of persons who have already attained some proficiency in the art : and of engravers on copper, stone, wood, etc.\n Subjects: Photography\n",
  "marcKey": "7001 $aSchubert, Franz.",
  "person": "Schubert, Franz",
  "roles": "Contributor",
  "title": "The art of photography: instructions in the art of producing photographic pictures in any color, and on any material : for the use of beginners : and also of persons who have already attained some proficiency in the art : and of engravers on copper, stone, wood, etc.",
  "variant_titles": null,
  "hub_title": null,
  "subjects": "Photography",
  "genres": null,
  "record": "14703468",
  "id": "14703468#Agent700-23"
}

VectorLink

  • Once embeddings are indexed, VectorLink calculates the “neighborhood” of each node based on a similarity threshold.
  • The results can be converted to an edge list for graph visualization.

“Schubert, Franz”

  • 88 records
  • All 4 identities correctly clustered
  • 4 unconnected nodes
  • ~95% accuracy
  • Threshold: 0.16

1

2

3

4

“Thayer, Nathaniel”

  • 26 records
  • 3 (4?) identities correctly clustered
  • Some fragmentation of the main cluster
  • Threshold: 0.16

1

2

3

4?

1a

1b

  • Every person is different! Difficult to set a uniform threshold for similarity. 
  • Trying to model a person’s identity (or identities) through their works is at best an approximation.
  • Difficult to “curate” benchmark data and embedding strings.
  • Pipeline is not yet optimized for production.
    • ​Embeddings are cheap (~$15.00 to embed 13 million strings using text-embedding-3-small)
    • Computation and storage are expensive (7.5G of embedding strings equals 75G of vectors).

Challenges and Limitations

Redundant Data

Contributor: Halleur, G. C. Hermann
 Title: The art of photography: instructions in the art of producing photographic pictures in any color, and on any material : for the use of beginners : and also of persons who have already attained some proficiency in the art : and of engravers on copper, stone, wood, etc.
 Subjects: Photography

Contributor: Schubert, Franz
 Title: The art of photography: instructions in the art of producing photographic pictures in any color, and on any material : for the use of beginners : and also of persons who have already attained some proficiency in the art : and of engravers on copper, stone, wood, etc.
 Subjects: Photography

Library of Congress Names?

Name: Schubert, Franz, 1876-
 Variant names: Schubert, Franz, b. 1876
 Sources: Grundzüge der Pastoraltheologie, 1922 (Dr. Franz Schubert, o. ö. Professor an der Universität Breslau); DNB in VIAF, Oct. 7, 2011 (hdg.: Schubert, Franz, 1876-; German theologian, professor of pastoral theology); Deutsche Biographie, viewed 28 September 2022 (Franz Schubert; born in Bistrai-Bielitz (Austrian Silesia) in 1876, died in Breslau in 1937; Catholic theologian specializing in pastoral theology)

Contributor: Schubert, Franz
 Title: Liturgische Zeitschrift
 Subjects: Catholic Church--Liturgy--Periodicals.

Similarity: 0.23 😿

Bonus!

Simple RAG demo

Thank you!

Tim Thompson