New Tools for Old Documents
New Tools for Old Documents
2 de noviembre
eScriptorium
A web interface to manage annotations and CNN model training with kraken
Overview
Information Retrieval
Technologies for transforming scans, images, PDFs, audio, and other media into machine readable data.
1.
Organization & Evaluation
Clean, evaluate, curate, and document.
2.
Research Methods
Methods and research practices for the study of large document collections. How ML can augment existing research practices.
3.
Information Retrieval
Base Model
- Trained on large heterogeneous data
- Learns a general model of language or materials
- Works out of the box
- Fine-tuning is always better (but not always possible)
Fine-Tuning
- Builds on training and data of base model and inherits bias
- Focuses the model's "understanding" to specific domain or materials
- Can tune for various tasks (classification, text extraction...
Open Source Models
CNN
Tesseract
CNN
Kraken
ViT
Nougat
Commercial Offerings
OCR+HTR
Vision
OCR
ABBYY
VDU
Document AI
Ideas / Experiments
- Document Understanding
- eScriptorium
-
Penn's eScriptorium
- https://escriptorium.pennds.org/
- invitado escriptorium
- Google Vision notebook
Links now!
Text
Documents
By Andrew Janco
Documents
- 284