New Tools for Old Documents

New Tools for Old Documents

2 de noviembre

eScriptorium

A web interface to manage annotations and CNN model training with kraken

Overview

Information Retrieval

Technologies for transforming scans, images, PDFs, audio, and other media into machine readable data.

1.

Organization & Evaluation

Clean, evaluate, curate, and document.

2.

Research Methods

Methods and research practices for the study of large document collections. How ML can augment existing research practices.

3.

Information Retrieval

Base Model

  • Trained on large heterogeneous data
  • Learns a general model of language or materials
  • Works out of the box
  • Fine-tuning is always better (but not always possible)

Fine-Tuning

  • Builds on training and data of base model and inherits bias
  • Focuses the model's "understanding" to specific domain or materials
  • Can tune for various tasks (classification, text extraction...

Open Source Models

CNN

Tesseract

CNN

Kraken

ViT

Nougat

Commercial Offerings

OCR+HTR

Vision

OCR

ABBYY

VDU

Document AI

Ideas / Experiments

Links now!

Text

Documents

By Andrew Janco

Documents

  • 136