Auto-cataloging Research Materials with "Small" Vision-Language Models
Andrew Janco, Princeton University
Ann Farnsworth-Alvear, University of Pennsylvania
Auto-cataloging Research Materials with "Small" Vision-Language Models
Outline
- HTR project centered on local-history researchers
- HTR + case summaries (metadata)
- Shift to VLMs
- Model fine-tuning (for period, domain, manuscript damage)
- Segmentation and "Chunking"
- Evaluation of model outputs.
- Next Steps
Thank you to many collaborators:
Daniel Varela, Daniel Tubb, Kelly López Roldán, Ann Farnsworth-Alvear, Yuri Romaña Rivas, María Fernanda Parra Rodríguez, Sergio Mosquera, Andrew Janco, Cynthia Heider, Brie Gettleson, Eliecer Angulo Castro, Angélica Aqualimpia Copete, Laura Caicedo, Milagros Gonzalez, Javier Hurtado Ibargüen, Ernestina Lemos Rentería, Yusleyda Perea Cuesta, Jhon Leison Rivas Rodríguez, Nallely Taborda Castañeda, Yeison Vente

Photographs: Yeison Vente, Daniel Varela, María Fernanda Parra Ramírez. Persons shown lower left, clockwise from left: Jhon Leison Rivas, Yusleyda Perea Cuesta, Ernestina Lemos Rentería (2022)

Circuit Court of Istmina, Chocó
(Project with Daniel Varela (University of Michigan)
and Fundación Muntú Bantú)
https://eap.bl.uk/project/EAP1477


Location of the Chocó
Map credit: Instituto Geográfico Agustín Codazzi (IGAC)
Pacific-coast rainforest region: long history of mining and impoverishment, majority Afrocolombian population. Persons shown: Javier Hurtado, Angelica Aqualimpia, Yusleyda Perea Cuesta (2022). Photos: Ann Farnsworth-Alvear
Colombia


In Istmina, 2022
the work of photographing and digitizing the archive was done by
Angélica Aqualimpia, Yeison Vente, and Daniel Varela
in photo: Yeison Vente of @formatonegro



Semillero (in Quibdó, 2022-23)
the work of human cataloging and the creation of youth-authored essays was done in collaboration with
Fundación Muntú Bantú and Kelly López Roldán, University of Pennsylvania

Fichero, 2023-2025
transcribes and auto-catalogues historical archives using vision large language models and artificial intelligence, running locally or in the cloud -- a collaboration with Anthropologist Daniel Tubb, University of New Brunswick

Text


model | accuracy |
---|---|
HTR-Araucania_XIX | 90.4% |
Fmb-best | 91.8% |
Sergio diary (single handwriting style) |
97.5% |




Se trata de una diligencia judicial para constatar la realidad y magnitud de los |
chat interleaved VL data
{
"messages":[
{
"content":"<image>extract text",
"role":"user"
},
{
"content":"plotar dichas minas en la forma que lo",
"role":"assistant"
}
],
"images":[
"fmb_images/055e42a3-a5c9-40f8-9bf6-3cd5e82dcc1c.png"
]
},

Fine-tuning "small" VLMs on line-level text-image pairs from eScriptorium

CER | min | q1 | q2 | q3 | max | |
---|---|---|---|---|---|---|
model | param | |||||
Qwen2VL | 2B | 0.0 | 0.076 | 0.168 | 0.478 | 3.979 |
FMB-2b | 2B | 0.0 | 0.0426 | 0.086 | 0.211 | 1.062 |
Qwen2VL | 7B | 0.0 | 0.059 | 0.105 | 0.303 | 3.609 |
FMB | 7B | 0.0 | 0.058 | 0.087 | 0.353 | 0.923 |
Qwen-VL-Plus - | 72B | 0.0 | 0.057 | 0.180 | 0.417 | 1.125 |
GPT-4o | 200B+ | 0.0 | 0.029 | 0.0883 | 0.408 | 1.125 |
Fine-tuning "small" VLMs on line-level text-image pairs from eScriptorium
38 test images
CNNs vs. VLMs
CNNs | "small" VLMs (7B) | "big" VLMs (70B+) | |
---|---|---|---|
fine-tuning | easy to fine-tune | fine-tuning | usually none |
segmentation | req. line segmentation | work well with chunks | work with full pages |
hardware | no GPU, M1 | 1 x A10G GPU train: 4 x L40S |
Many GPUs or API |
pos/neg | char errors | fewer hallucinations | hallucinations |
capability |
HTR only | NER, case-level summarization | collection-level metadata |
Segmentation

Base-line Segementer (kraken.blla)
Chunking







Processing for readability
Output formats: Word image-text pairs


1930 Luis Enrique Bernal Contra Nicanor Córdoba y Juan Francisco Moreno

Daniel Tubb,
fichero collaborator
Output formats: Case-summaries with entities in context
Next Steps
- Evaluation with other "small" VLMs
- Party
- SAIL-VL
- InternVL
- Kimi-VL
- Packaging as a Mac and Windows app
- Run on HPC using Blackfish
- Ongoing evaluation of synthetic data with community partners--young people in Chocó will read each summary alongside document images
- Workshop in Andagoya, for young people to present the research to elders in the community, using a "mango" set-up to compensate for limited internet
- Producing print material for local circulation
Thank you!

2023 workshop
Auto-cataloging
By Andrew Janco
Auto-cataloging
- 83