Exploring the Internet Archive Blog
-david satten-lopez
Initial goals/brainstorm:
- Build off previous research on IA's Labor Practices
- How are "workers", "scan operators", "scribes", & the "Philippines" represented in the the blog?
- Thinking about absences in the blog
- How much should scanners be represented?
- How does Internet Archive portray itself?
- Tools: Beautiful Soup, SpaCy, Mallet
Inspecting the IA Blog:
article_tags = html.find_all("div", class_="entry-content")
title_tags = html.find_all("h1", class_="entry-title")
articles = []
for article in article_tags:
article_text = article.text
article_cleaned = re.sub("\n", " ", article_text)
articles.append(article_cleaned)
titles = []
for title in title_tags:
title_text = title.text
title_cleaned = re.sub("\n", "", title_text)
titles.append(title_cleaned)
# PRESENTING CODE
Coding with BeautifulSoup:
The blog wasn't uniform...
if len(titles) == len(articles):
scraped_articles.append(pd.DataFrame({"Title": titles, "Article": articles, "Url": url}))
else:
titles.append("Same as above?")
scraped_articles.append(pd.DataFrame({"Title": titles, "Article": articles, "Url": url}))
Cleaning the dataset:
Corpus overview
- March 2004 to March 2023: 1228 articles
- Average word count per article: 488 words
- Median word count per article: 385 words
- Largest article size: 2502 words
Playing with SpaCy:
Top 10 Words Over Time:
Top 10 Words Over Time:
Breaking Down the Corpus:
-
A book called “Graphis New Talent Annual 2016” had been scanned just the day before by Internet Archive’s scanning center in the Philippines.
-
Smith started collecting Ellington records in 78rpm format in high school and continued during World War II when he served in the Air Force stationed in various U.S. cities before being deployed to the Philippines and Japan.
-
After World War II, many military Jeeps were left in the Philippines by U.S. troops.
-
This collection showcases Jeepneys in the Philippines starting from the 1950s, exploring a visual history of this symbol of Filipino culture.
-
The Internet Archive partners with Innodata Knowledge Services, an organization focused on machine learning and digital data transformation, to complete the digitization process at their facilities in Cebu, Philippines.
-
Audio stations complete with turntables & recording equipment set up in Cebu, Philippines.
-
This summer it shipped most of the remaining volumes to be digitized by Internet Archive at its scanning facility in the Philippines.
-
As Barker awaits the return of the book collection from the Philippines, he is tracking the shipment (which went on two separate ships and was insured).
Bonus! only Cebu Mentioned:
- bash-3.2$ ia search "date:2000 AND mediatype:texts AND scanningcenter:cebu" --itemlist --sort=downloads\ desc | head -1000 | parallel --will-cite -j10 "curl -Ls https://archive.\
org/download/{}/page/cover_.jpg?fail=soon.jpg\&cnt=0 >> ~/tmp/picture/{}.jpg" - Once coming in the door to be inventoried, secondly as we ship it out to be digitized in Hong Kong or Cebu, and thirdly coming back to us for long term storage.
- We will have tours of our facilities, demonstration of how we get hundreds of thousands books a year digitized at our Hong Kong (and now Cebu) Super Centers and safely back again.
Visual Possibilities with SpaCy
Visualizations Possible with SpaCy
CoSine Similarity with SpaCy:
- 'archive': ['Award', ',', 'cryptocurrency', 'or', 'a', 'of', 'truly', 'exploring', 'us', 'a']
- 'library': ['In', 'a', 'Jesmyn', 'ComputerWorld', 'boxes', 'Taylor', '—', 'are', 'a', 'co']
- 'workers': ['he', 'find', 'Memorial', 'which', 'not', 'support', 'faculty', 'Yesterday', 'Transparencies', 'more']
2019-2020 Words most associated with...
- 'Philippines': ['Philippines', 'II', 'just', 'history', 'Force', 'format', 'in', '.', 'Jeepneys', 'the']
- 'Cebu': ['Jeepneys', 'Philippines', 'the', 'learning', 'Philippines', 'history', '.', 'Force', 'in', 'format']
Philippine Corpus Words most associated with...
Corpus v. Corpus
Philippine Corpus | 2019-2020 Corpus | |
---|---|---|
"Philippines" | 'Philippines', 'II', 'just', 'history', 'Force', 'format', 'in', '.', 'Jeepneys', 'the' | '’s', 'CatherNo', 'Van', '.', 'to', 'Jones', 'history', 'by', 'version', 'also' |
"Cebu" | Jeepneys', 'Philippines', 'the', 'learning', 'Philippines', 'history', '.', 'Force', 'in', 'format | 'history', 'by', 'version', 'to', 'Jones', '’s', 'CatherNo', 'Van', '.', 'also' |
Other CoSine Similarities:
Keyword | Similar Vectors |
---|---|
Worker | '-', 'e', 'lesson', 'NSF', 'continuous', 'worker', 'We', '—', 'not', 'worker' |
Hong Kong | 'Donations', 'for', 'Wan', ',', 'his', 'wife', 'and', 'two', 'children', '.' |
Operator | 'on', 'I', 'viral', 'ISSN', 'Scribe', '.', 'due', 'operator', 'Here', '-' |
Scribe | '.', 'to', 'get', 'digitization', '-', 'if', 'does', 'If', 'digitization', 'with' |
Other Other CoSine Similarities:
Keyword | Similar Vectors |
---|---|
Internet Archive | 'Internet', 'The', 'Red', 'Shoes', ',', 'both', 'of', 'which', 'you', 'can'] |
Brewster Kahle | 'Statement', 'of', 'Librarian', 'Digital', 'and', 'founder', ',', 'Kahle', 'Brewster', 'the' |
Kahle | 'are', 'published', ',', ',', 'the', 'and', ',', 'Major', 'launch', '50,000' |
Open Library | '📕', 'very', 'recent', 'resources', 'in', 'the', 'field', '.', '”', '“' |
Playing with Mallet:
The Programming Historian
Topic Modeling
Next Steps:
- (Re)Presenting on Github Pages
- Close reading with the data?
- Regrouping with rest of team...
Thanks!
Copy of deck
By David S-L @ Rowan
Copy of deck
- 28