Vision + Language

10/2020

Vision + Language

08/2024

Why VLMs?

Language model provides for better contextual understanding for OCR/HTR.
Feels "good enough" to researchers
Retain formatting as markdown
Entities with context
Tables as HTML or md
Form data with 🗹
Many sizes (2B, 8B, 30B, 280B) and
Lots of fine-tuned variants (nanoNets)
Possible to fine-tune for domain and task

Limited by the languages of the base model
- Gemini 3, 140 languages, struggles with regional variations and dialects
- Qwen3-VL, 32 languages
'Hiccups' given max tokens to generatetokens to generatetokens to generatetokens to generate
Errors are less offensive but not less frequent
Need more data to fine-tune than CNN/RNN
Require GPUs for training and inference

But...

User Needs

Read handwritten manuscripts
Search handwritten text
Create research data/metadata
Translation of full book-length texts
Text to speech for accessibility

Alt-text without Metadata

An aged, sepia-toned family portrait featuring three individuals: an older man on the left with a serious
expression, a young child in the center wearing a textured jacket, and a woman on the right with dark hair and a
gentle smile. The photo shows signs of wear — creases, discoloration, and a noticeable vertical rainbow glare
across the center — suggesting it’s an old, possibly cherished photograph. The image is framed by a decorative
border with repeating patterns. All three subjects are posed formally, looking directly at the camera, evoking a
sense of nostalgia and historical intimacy.

Alt-text with Metadata

Vintage sepia-toned family portrait from Moldova, Europe, showing a man, a woman, and a young child
seated side by side. The photograph shows signs of age — faded edges, creases, and a visible vertical light flare
or scratch running through the center. The man on the left wears a dark jacket over a turtleneck; the child in the
middle is dressed in a textured coat with a high collar; the woman on the right has dark hair styled away from her
face and wears a dark garment with a patterned scarf or collar. All three look directly at the camera with solemn
expressions. The photo is framed with an ornate border featuring repeating decorative motifs. Digitised by the
Institute of the Cultural Heritage of the Academy of Sciences of Moldova as part of the Endangered Archives
Programme.