Documents on Australian Foreign Policy
&
digital technologies

These slides

  • download documents from web
  • capture metadata
  • find and fix problems
  • extract and normalise dates
  • parse and link NAA references
  • enrich metadata
  • build new site
  • build search app and index

Processes

  • OCR errors
  • 57 missing documents (43 from vol 24)
  • documents out of order
  • documents munged together
  • encoding errors (�)
  • capitalisation of titles
  • formatting inconsistencies

Problems

NAA links

  • find references in documents
  • check missing references
  • fix obvious OCR errors
  • standardise know inconsistencies
  • look up in RecordSearch
  • save RS metadata

& lots of manual checking!

Getting data from RecordSearch

[AA : A1066 H45/453/2]
[AA : A5954, BOX 577]
[AA : A981, CHINA 114, ix]
[NAA: A10463, 80111311111, iii]
[FA: A3195, 1.3991]
[DEFENCE:SPECIAL COLLECTION II, BUNDLE 5, STRATEGICAL POLICY-SWPA, FILE No. 3, 48/1942]

Problems

& some just missing...

Results

  • 89% of NAA references linked to items in RecordSearch
  • 99% of NAA references linked to series in RecordSearch
  • 80% of documents linked to RecordSearch

8,753 references found!

NAA series in references

179 different series

Top twenty

2,722 files cited

5 closed

171 NYE

676 digitised

20% of documents

The web site

  • main content is static HTML
  • light weight search app linked to Elasticsearch index

Data

  • Document metadata (CSV)
  • References and NAA links (CSV)
  • Document texts (plain text and Markdown)

Documents per year

Documents per year & volume

11

28

12

19

5

26