Text Embeddings for Entity Resolution and Name Disambiguation in the Library Catalog

Network of interconnected nodes and lines on a gradient background that transitions from a warm golden hue at the bottom to a clear blue at the top. The nodes vary in size and are connected by thin lines, creating a web-like structure. Some nodes appear as filled circles while others are outlined, and the density of the network varies across the image, with some areas more clustered and others more sparse.

Tim Thompson

Manager, Metadata Services Unit

Yale Library

timothy.thompson@yale.edu

www.linkedin.com/in/timathompson

 

Generative AI at Yale Library: A Symposium for Yale Library Staff

January 7, 2024

Attribution & Acknowledgments

License
Creative Commons Attribution 4.0 International (CC-BY).
Credits
  • VectorLink (formerly DataChemist)
    • Gavin Mendel-Gleason
    • Maren van Otterdijk
  • Catherine Kwon (Yale, B.S. in Statistics & Data Science and Art)
  • Daniel Lovins
  • Marty Kurth
  • IT Steering Committee
  • Library IT
  • Cultural Heritage IT (CHIT) Initiative
Badge representing the Creative Commons Attribution 4.0 International License (CC-BY).

Authority Control

Whos really in control here?

  • Users depend on consistent access points to find what they’re looking for.
  • Access points for people, groups, subjects (and to some extent works) are meant to be controlled and “authorized” against master files.
  • In practice, authority control is messy because data is denormalized and entities are identified by strings rather than unique identifiers.

Authority Control in the Catalog

Catalog card taken from the digital version of the book Documentation made easy.

Catalog card taken from the digital version of the book Documentation made easy.

This Heading Is Valid?

Overview

Issue
Problem statement and key concepts
Action
Project scope and evolution
Impact

Initial results, challenges, potential next steps 

  • The lack of consistent, comprehensive authority control in the catalog is a form of technical debt that impacts both users and staff.
  • The migration from Voyager to Alma gives us an opportunity to address the problem.
  • Alma supports integrated authority control processes, but we need to improve the quality of our data to utilize this functionality effectively.

Problem Statement

Entity Resolution

With some help from text embeddings?

Initial Steps

Identity Clusters

1

20222023

Student employees (CHIT and library funded) used an Excel plugin to perform manual entity resolution and disambiguation for common names in the catalog​​

3

November 2023

Proposal to Marty Kurth and ITSC

5

SeptemberOctober 2024

  • Hiring students
  • Following up with DataChemist on bug fixes and optimizations
  • Carrying out further testing

2

2023

TerminusDB (graph database) and VectorLink: research and testing

4

FebruaryMarch 2024

Consultant project with DataChemist (TerminusDB and VectorLink developers)

Key Concepts

Text embeddings, vector databases, HNSW, ANN...

  • Embedding models transform text into a numeric representation.

  • Given a word or string of words, embedding models output a high-dimensional vector that captures the texts semantic meaning, syntactic structure, and contextual nuance.

Text Embeddings

Credit: Catherine Kwon

[0.004, -0.005, ..., -0.017, 0.005]

“Today is a great day!”

string of text

high-dimensional vector

Text Embeddings

EMBEDDING

MODEL

Credit: Catherine Kwon

  • Once you have a set of text embeddings, what do you do with them?
  • Store them in a specialized index!
  • Vector databases use indexing algorithms such as Hierarchical Navigable Small World (HNSW) graphs for Approximate Nearest Neighbor (ANN) search.
  • As search engines, vector databases make it possible to return results based on user intent, rather than simple keyword matching.

Vector Databases

Credit: Catherine Kwon

Embedding Strings

Contributor: Schubert, Franz
Title: The art of photography: instructions in the art of producing photographic pictures
Attribution: by G.C. Hermann Halleur ; with practical hints on the locale best suited for photographic operations, and on the proper posture, attitude, and dress, for portraiture by F. Schubert and an appendix ; translated from the German by G.L. Strauss
Subjects: Photography
Provision information: London: J. Weale, 1854

  • Once embeddings are indexed, the vector database calculates the “neighborhood” of each node based on a similarity threshold.
  • The results can be converted to an edge list for graph visualization.

“Schubert, Franz”

  • 88 records
  • All 4 identities correctly clustered
  • 4 unconnected nodes
  • ~95% accuracy
  • Threshold: 0.16

1

2

3

4

  • Every person is different! Difficult to set a uniform threshold for similarity. 
  • Difficult to “curate” benchmark data and embedding strings.
  • Trying to model a person’s identity (or identities) through their works is at best an approximation.
  • Need a more nuanced approach.

Challenges and Limitations

  • Supervised machine learning method.
  • Predicts a binary outcome based on training data.
  • Outputs a probability.

Solution? Logistic Regression

  • Instead of indexing a single embedding string, split the process into two stages:
    • filtering
    • classification
  • Split the data into separate fields:
    • record (combined string)
    • person
    • title
    • attribution
    • subjects
    • ...

Solution? Logistic Regression

  • Vectorize each field separately.
  • Choose a field as a key to filter and index based on similarity (e.g., person).
  • Within each neighborhood, compare the entries and create a feature vector containing the distance between vectors for each field:

Solution? Logistic Regression

\vec{x}_0 = d(e1_{title}, e2_{title})
  • Use gradient descent to train the model on labeled data.
  • Output weights corresponding to each independent variable.

Solution? Logistic Regression

[Photograph of Beinecke Construction]. 0AD. https://collections.library.yale.edu/catalog/2037090.

Work in Progress

Thank you!

Tim Thompson

Will the Real Schubert Please Stand Up?

1

20222023

Student employees (CHIT and library funded) used an Excel plugin to perform manual entity resolution and disambiguation for common names in the catalog​​

3

November 2023

Proposal to Marty Kurth and ITSC

5

SeptemberOctober 2024

  • Hiring students
  • Following up with DataChemist on bug fixes and optimizations
  • Carrying out further testing

2

2023

TerminusDB (graph database) and VectorLink: research and testing

4

FebruaryMarch 2024

Consultant project with DataChemist (TerminusDB and VectorLink developers)

Consultant Project

Credit: Catherine Kwon

Dimensionality Reduction

Principal Component Analysis (PCA)

Ground Transportation

Credit: Catherine Kwon

Project Outcomes

Pipeline

  • Export from Voyager

1

Extract

  • Transform to BIBFRAME
  • Generate embeddings strings

2

Transform

  • Embed using OpenAI
  • Index vectors in VectorLink

3

Load

Status

  • Currently focused on people as contributors (vs. subjects).
  • Testing with the curated benchmark dataset produced 20222023.
  • 1,274 entries across 54 name/identity clusters.

Embedding Objects

{
  "op": "Inserted",
  "string": "Contributor: Schubert, Franz\n Title: The art of photography: instructions in the art of producing photographic pictures in any color, and on any material : for the use of beginners : and also of persons who have already attained some proficiency in the art : and of engravers on copper, stone, wood, etc.\n Subjects: Photography\n",
  "marcKey": "7001 $aSchubert, Franz.",
  "person": "Schubert, Franz",
  "roles": "Contributor",
  "title": "The art of photography: instructions in the art of producing photographic pictures in any color, and on any material : for the use of beginners : and also of persons who have already attained some proficiency in the art : and of engravers on copper, stone, wood, etc.",
  "variant_titles": null,
  "hub_title": null,
  "subjects": "Photography",
  "genres": null,
  "record": "14703468",
  "id": "14703468#Agent700-23"
}

“Thayer, Nathaniel”

  • 26 records
  • 3 (4?) identities correctly clustered
  • Some fragmentation of the main cluster
  • Threshold: 0.16

1

2

3

4?

1a

1b

Redundant Data

Contributor: Halleur, G. C. Hermann
 Title: The art of photography: instructions in the art of producing photographic pictures in any color, and on any material : for the use of beginners : and also of persons who have already attained some proficiency in the art : and of engravers on copper, stone, wood, etc.
 Subjects: Photography

Contributor: Schubert, Franz
 Title: The art of photography: instructions in the art of producing photographic pictures in any color, and on any material : for the use of beginners : and also of persons who have already attained some proficiency in the art : and of engravers on copper, stone, wood, etc.
 Subjects: Photography

Library of Congress Names?

Name: Schubert, Franz, 1876-
 Variant names: Schubert, Franz, b. 1876
 Sources: Grundzüge der Pastoraltheologie, 1922 (Dr. Franz Schubert, o. ö. Professor an der Universität Breslau); DNB in VIAF, Oct. 7, 2011 (hdg.: Schubert, Franz, 1876-; German theologian, professor of pastoral theology); Deutsche Biographie, viewed 28 September 2022 (Franz Schubert; born in Bistrai-Bielitz (Austrian Silesia) in 1876, died in Breslau in 1937; Catholic theologian specializing in pastoral theology)

Contributor: Schubert, Franz
 Title: Liturgische Zeitschrift
 Subjects: Catholic Church--Liturgy--Periodicals.

Similarity: 0.23 😿

Bonus!

Simple RAG demo