Documents

New Tools for Old Documents

New Tools for Old Documents

2 de noviembre

eScriptorium

A web interface to manage annotations and CNN model training with kraken

Overview

Information Retrieval

Technologies for transforming scans, images, PDFs, audio, and other media into machine readable data.

1.

Organization & Evaluation

Clean, evaluate, curate, and document.

2.

Research Methods

Methods and research practices for the study of large document collections. How ML can augment existing research practices.

3.

Information Retrieval

Optical Character Recognition (OCR)

Handwritten Text Recognition (HTR)

Visual Document Understanding (VDU)

Base Model

Trained on large heterogeneous data
Learns a general model of language or materials
Works out of the box
Fine-tuning is always better (but not always possible)

Fine-Tuning

Builds on training and data of base model and inherits bias
Focuses the model's "understanding" to specific domain or materials
Can tune for various tasks (classification, text extraction...

Open Source Models

CNN

Tesseract

CNN

Kraken

ViT

Nougat

Commercial Offerings

OCR+HTR

Vision

OCR

ABBYY

VDU

Document AI

Ideas / Experiments

eScriptorium to fine-tune HTR, from scratch

#1

Vision to eScriptorium, just correct

#2

#3

compare CNN vs ViT

#4

fine-tune Nougat

LLMs to create research data

#5

your ideas!

#6

Document Understanding
eScriptorium
Penn's eScriptorium
- https://escriptorium.pennds.org/
- invitado escriptorium
Google Vision notebook

Links now!

Text