Web Archiving Update

MIT Libraries Collections Directorate Winter 2018 Meeting

Joe Carrano | Digital Archivist | IASC

2018-12-06

What is Web Archiving anyway?

the process of collecting portions of the World Wide Web, preserving the collections in an archival format, and then serving the archives for access and use.

~ International Internet Preservation Consortium

Archive-It

Webrecorder

Free tool from Rhizhome

Better for capturing dynamic websites and social media

Have to do most of the capture manually at this point

Subscription suite from the Internet Archive

Good for most websites, especially text based

Can set up automated crawls

Where we were

Pilot project 2016-2017
Small number of seeds and individual captures
Had not begun systematic collecting

Where to start?

We need to prioritize
We need to know about websites to crawl
We need to appraise

We need to prioritize

Focus on collecting the archival records i.e. the records of the Institute
Most of these are found on the mit.edu domain
Begin in areas with existing archival collections

We need to know about websites to crawl

Got a list of websites!

There were 300+ of them

We need to appraise

Determine what the Internet Archive is doing already
Look at list to see which can align with EDISJ values
Look at list to determine which sites represent unique information about activities at the Institute

Where we are

61 active seeds, 926 GB since 2016 (315 GB since August 2018)
Finalizing metadata application profile

Where we're going

Describing seeds
Open public access in late spring
Start collecting sites from student groups

Web Archives Update

By jcarrano