Vision + Language

10/2020

Vision + Language

08/2024

Why VLMs?

  • Language model provides for better contextual understanding for OCR/HTR.
  • Feels "good enough" to researchers
  • Retain formatting as markdown
  • Entities with context
  • Tables as HTML or md
  • Form data with 🗹
  • Many sizes (2B, 8B, 30B, 280B) and
  • Lots of fine-tuned variants (nanoNets)
  • Possible to fine-tune for domain and task
  • Limited by the languages of the base model
    • Gemini 3, 140 languages, struggles with regional variations and dialects
    • Qwen3-VL,  32 languages
  • 'Hiccups' given max tokens to generatetokens to generatetokens to generatetokens to generate
  • Errors are less offensive but not less frequent
  • Need more data to fine-tune than CNN/RNN
  • Require GPUs for training and inference

But...

User Needs

  • Read handwritten manuscripts
  • Search handwritten text
  • Create research data/metadata
  • Translation of full book-length texts
  • Text to speech for accessibility 

Alt-text without Metadata

An aged, sepia-toned family portrait featuring three individuals: an older man on the left with a serious
expression, a young child in the center wearing a textured jacket, and a woman on the right with dark hair and a
gentle smile. The photo shows signs of wear — creases, discoloration, and a noticeable vertical rainbow glare
across the center — suggesting it’s an old, possibly cherished photograph. The image is framed by a decorative
border with repeating patterns. All three subjects are posed formally, looking directly at the camera, evoking a
sense of nostalgia and historical intimacy.

Alt-text with Metadata

 Vintage sepia-toned family portrait from Moldova, Europe, showing a man, a woman, and a young child
seated side by side. The photograph shows signs of age — faded edges, creases, and a visible vertical light flare
or scratch running through the center. The man on the left wears a dark jacket over a turtleneck; the child in the
middle is dressed in a textured coat with a high collar; the woman on the right has dark hair styled away from her
face and wears a dark garment with a patterned scarf or collar. All three look directly at the camera with solemn
expressions. The photo is framed with an ornate border featuring repeating decorative motifs. Digitised by the
Institute of the Cultural Heritage of the Academy of Sciences of Moldova as part of the Endangered Archives
Programme.

 

Visual Reasoning

Visual Tool Calling

Video Understanding

@rachelcoldicutt.bsky.social

Fantastic Futures Retro

By Andrew Janco

Fantastic Futures Retro

  • 32