Auto-cataloging Research Materials with "Small" Vision-Language Models 

Andrew Janco, Princeton University

Ann Farnsworth-Alvear, University of Pennsylvania

Auto-cataloging Research Materials with "Small" Vision-Language Models

Outline

  • HTR project centered on local-history researchers
  • HTR + case summaries (metadata)
  • Shift to VLMs
  •  Model fine-tuning (for period, domain, manuscript damage)
  • Segmentation and "Chunking"
  • Evaluation of model outputs. 
  • Next Steps

Thank you to many collaborators:


Daniel Varela, Daniel Tubb, Kelly López Roldán, Ann Farnsworth-Alvear, Yuri Romaña Rivas, María Fernanda Parra Rodríguez, Sergio Mosquera, Andrew Janco, Cynthia Heider, Brie Gettleson, Eliecer Angulo Castro, Angélica Aqualimpia Copete, Laura Caicedo, Milagros Gonzalez, Javier Hurtado Ibargüen, Ernestina Lemos Rentería, Yusleyda Perea Cuesta, Jhon Leison Rivas Rodríguez, Nallely Taborda Castañeda, Yeison Vente
 

Photographs: Yeison Vente, Daniel Varela, María Fernanda Parra Ramírez. Persons shown lower left, clockwise from left: Jhon Leison Rivas, Yusleyda Perea Cuesta, Ernestina Lemos Rentería (2022)

Circuit Court of Istmina, Chocó
(Project with Daniel Varela (University of  Michigan)
and Fundación Muntú Bantú)

https://eap.bl.uk/project/EAP1477

 

Location of the Chocó

​Map credit:  Instituto Geográfico Agustín Codazzi (IGAC)

Pacific-coast rainforest region: long history of mining and impoverishment, majority Afrocolombian population. Persons shown: Javier Hurtado, Angelica Aqualimpia, Yusleyda Perea Cuesta (2022). Photos: Ann Farnsworth-Alvear

Colombia

In Istmina, 2022

 the work of photographing and digitizing the archive was done by
Angélica Aqualimpia, Yeison Vente, and Daniel Varela
in photo: Yeison Vente of @formatonegro

Semillero (in Quibdó, 2022-23)

 the work of human cataloging and the creation of youth-authored essays was done in collaboration with
Fundación Muntú Bantú and Kelly López Roldán, University of Pennsylvania

Fichero, 2023-2025

 transcribes and auto-catalogues historical archives using vision large language models and artificial intelligence, running locally or in the cloud -- a collaboration with Anthropologist Daniel Tubb, University of New Brunswick

Text

model accuracy
HTR-Araucania_XIX 90.4%
Fmb-best 91.8%
Sergio diary
(single handwriting style)
97.5%
Se trata de una diligencia judicial para constatar la realidad y magnitud de los 

chat interleaved VL data

 {
    "messages":[
      {
        "content":"<image>extract text",
        "role":"user"
      },
      {
        "content":"plotar dichas minas en la forma que lo",
        "role":"assistant"
      }
    ],
    "images":[
      "fmb_images/055e42a3-a5c9-40f8-9bf6-3cd5e82dcc1c.png"
    ]
  },

Fine-tuning "small" VLMs on line-level text-image pairs from eScriptorium

CER min q1 q2 q3 max
model param
Qwen2VL 2B 0.0 0.076 0.168 0.478 3.979
FMB-2b 2B 0.0 0.0426 0.086 0.211 1.062
Qwen2VL 7B 0.0 0.059 0.105 0.303 3.609
FMB 7B 0.0 0.058 0.087 0.353 0.923
Qwen-VL-Plus - 72B 0.0 0.057 0.180 0.417 1.125
GPT-4o 200B+ 0.0 0.029 0.0883 0.408 1.125

Fine-tuning "small" VLMs on line-level text-image pairs from eScriptorium

38 test images

CNNs vs. VLMs

CNNs "small" VLMs (7B) "big" VLMs (70B+)
fine-tuning easy to fine-tune fine-tuning  usually none
segmentation req. line segmentation work well with chunks work with full pages
hardware no GPU, M1 1 x A10G GPU
train: 4 x L40S
Many GPUs or API
pos/neg char errors fewer hallucinations hallucinations
capability
 
HTR only NER, case-level summarization collection-level metadata

Segmentation

Base-line Segementer (kraken.blla)

Chunking

Processing for readability

Output formats: Word image-text pairs

1930 Luis Enrique Bernal Contra Nicanor Córdoba y Juan Francisco Moreno

Daniel Tubb,
fichero collaborator

Output formats: Case-summaries with entities in context

Next Steps

  • Evaluation with other "small" VLMs
    • Party
    • SAIL-VL
    • InternVL
    • Kimi-VL
  • Packaging as a Mac and Windows app
  • Run on HPC using Blackfish
  • Ongoing evaluation of synthetic data with community partners--young people in Chocó will read each summary alongside document images
  • Workshop in Andagoya, for young people to present the research to elders in the community, using a "mango" set-up to compensate for limited internet
  • Producing print material for local circulation

Thank you!

2023 workshop

Auto-cataloging

By Andrew Janco

Auto-cataloging

  • 83