Exploring the Internet Archive Blog

-david satten-lopez

Initial goals/brainstorm:

  • Build off previous research on IA's Labor Practices
  • How are "workers", "scan operators", "scribes", & the "Philippines" represented in the the blog?
  • Thinking about absences in the blog
  • How much should scanners be represented?
  • How does Internet Archive portray itself?
  • Tools: Beautiful Soup, SpaCy, Mallet

Inspecting the IA Blog:

article_tags = html.find_all("div", class_="entry-content")
title_tags = html.find_all("h1", class_="entry-title")

articles = []
for article in article_tags:
    article_text = article.text
    article_cleaned = re.sub("\n", " ", article_text)
    articles.append(article_cleaned)

titles = []
for title in title_tags:
    title_text = title.text
    title_cleaned = re.sub("\n", "", title_text)
    titles.append(title_cleaned)
# PRESENTING CODE

Coding with BeautifulSoup:

The blog wasn't uniform...

if len(titles) == len(articles):
  scraped_articles.append(pd.DataFrame({"Title": titles, "Article": articles, "Url": url}))
else:
  titles.append("Same as above?")
  scraped_articles.append(pd.DataFrame({"Title": titles, "Article": articles, "Url": url}))

Cleaning the dataset:

Corpus overview

  • March 2004 to March 2023: 1228 articles
  • Average word count per article: 488 words
  • Median word count per article: 385 words
  • Largest article size: 2502 words

Playing with SpaCy:

Top 10 Words Over Time:

Top 10 Words Over Time:

Breaking Down the Corpus:

  • A book called “Graphis New Talent Annual 2016” had been scanned just the day before by Internet Archive’s scanning center in the Philippines.

  • Smith started collecting Ellington records in 78rpm format in high school and continued during World War II when he served in the Air Force stationed in various U.S. cities before being deployed to the Philippines and Japan.

  • After World War II, many military Jeeps were left in the Philippines by U.S. troops.

  • This collection showcases Jeepneys in the Philippines starting from the 1950s, exploring a visual history of this symbol of Filipino culture.

  • The Internet Archive partners with Innodata Knowledge Services, an organization focused on machine learning and digital data transformation, to complete the digitization process at their facilities in Cebu, Philippines.

  • Audio stations complete with turntables & recording equipment set up in Cebu, Philippines.

  • This summer it shipped most of the remaining volumes to be digitized by Internet Archive at its scanning facility in the Philippines.

  • As Barker awaits the return of the book collection from the Philippines, he is tracking the shipment (which went on two separate ships and was insured).

Bonus! only Cebu Mentioned:

  • bash-3.2$ ia search "date:2000 AND mediatype:texts AND scanningcenter:cebu" --itemlist --sort=downloads\ desc | head -1000 | parallel --will-cite -j10 "curl -Ls https://archive.\
    org/download/{}/page/cover_.jpg?fail=soon.jpg\&cnt=0 >> ~/tmp/picture/{}.jpg"
  • Once coming in the door to be inventoried, secondly as we ship it out to be digitized in Hong Kong or Cebu, and thirdly coming back to us for long term storage.
  • We will have tours of our facilities, demonstration of how we get hundreds of thousands books a year digitized at our Hong Kong (and now Cebu) Super Centers and safely back again.

Visual Possibilities with SpaCy

Visualizations Possible with SpaCy

CoSine Similarity with SpaCy:

  • 'archive': ['Award', ',', 'cryptocurrency', 'or', 'a', 'of', 'truly', 'exploring', 'us', 'a']
  • 'library': ['In', 'a', 'Jesmyn', 'ComputerWorld', 'boxes', 'Taylor', '—', 'are', 'a', 'co']
  • 'workers': ['he', 'find', 'Memorial', 'which', 'not', 'support', 'faculty', 'Yesterday', 'Transparencies', 'more']

2019-2020 Words most associated with...

  • 'Philippines': ['Philippines', 'II', 'just', 'history', 'Force', 'format', 'in', '.', 'Jeepneys', 'the']
  • 'Cebu': ['Jeepneys', 'Philippines', 'the', 'learning', 'Philippines', 'history', '.', 'Force', 'in', 'format']

Philippine Corpus Words most associated with...

Corpus v. Corpus

Philippine Corpus 2019-2020 Corpus
"Philippines" 'Philippines', 'II', 'just', 'history', 'Force', 'format', 'in', '.', 'Jeepneys', 'the' '’s', 'CatherNo', 'Van', '.', 'to', 'Jones', 'history', 'by', 'version', 'also'
"Cebu" Jeepneys', 'Philippines', 'the', 'learning', 'Philippines', 'history', '.', 'Force', 'in', 'format 'history', 'by', 'version', 'to', 'Jones', '’s', 'CatherNo', 'Van', '.', 'also'

Other CoSine Similarities:

Keyword Similar Vectors
Worker '-', 'e', 'lesson', 'NSF', 'continuous', 'worker', 'We', '—', 'not', 'worker'
Hong Kong 'Donations', 'for', 'Wan', ',', 'his', 'wife', 'and', 'two', 'children', '.'
Operator 'on', 'I', 'viral', 'ISSN', 'Scribe', '.', 'due', 'operator', 'Here', '-'
Scribe '.', 'to', 'get', 'digitization', '-', 'if', 'does', 'If', 'digitization', 'with'

Other Other CoSine Similarities:

Keyword Similar Vectors
Internet Archive 'Internet', 'The', 'Red', 'Shoes', ',', 'both', 'of', 'which', 'you', 'can']
Brewster Kahle 'Statement', 'of', 'Librarian', 'Digital', 'and', 'founder', ',', 'Kahle', 'Brewster', 'the'
Kahle 'are', 'published', ',', ',', 'the', 'and', ',', 'Major', 'launch', '50,000'
Open Library '📕', 'very', 'recent', 'resources', 'in', 'the', 'field', '.', '”', '“'

Playing with Mallet:

The Programming Historian 

Topic Modeling

Next Steps:

  • (Re)Presenting on Github Pages
  • Close reading with the data?
  • Regrouping with rest of team...

Thanks!

Copy of deck

By David S-L @ Rowan

Copy of deck

  • 28