Impresso

Media Monitoring of the Past

∨ Use the vertical arrows to navigate the slides
Introduction
Analyzing the Press and Radio with Impresso Tools
Here's how the slide deck is built
Introduction
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Plan
You are here







Impresso
Simple search
Datalab
Perspectives










Plan
Use the vertical arrows to navigate
or the horizontal arrows to jump to the next chapter

The Impresso project
"Media Monitoring of the Past"
Chapter 1
Chapter 1
The Impresso project


Impresso 1
2017-2020
Impresso 2
2023-2027
Developing new approaches and interfaces for the exploration and critical analysis of historical media archives.
Swiss and Luxembourg newspapers
Cutting-edge platform
New collections from across Europe
Radio broadcasts
Datalab
Impresso 2







Funded by:
Hosted by:
Team




Principal investigators
An interdisciplinary team spanning history, web development, linguistics, and digital humanities
Partners

A network of partner institutions that provide digitized newspaper and radio collections
Challenges
In short, the Impresso project is a series of challenges:
I.
The data
Multiple sources, multiple formats, massive data, and numerous legal issues
II.
The design
Make all of this accessible and useful to the public
III.
The analysis
Enrich data to create added value and develop relevant analysis tools
IV.
Writing history
How should all of this be interpreted in the context of historical research?
Challenge I.
The data





The collections are massive, but also very diverse:
- wide time spans
- different countries and languages
- different types of sources (newspapers, radio, typescripts)
- varying OCR quality
Press from 1732 to 2018

Radio programs in audio format with speech-to-text
Transcripts of radio news broadcasts
Challenge I.
The data
The collections come from different institutions:
- different jurisdictions, requirements, and policies
- different shared data, metadata, formats, and transcriptions
- different copyright regimes, approaches to managing the public domain, and gray areas
Consequences:
- Signing of multiple data-sharing agreements
- Identification of target audiences and differentiated access rights
- Identification of different types of access (view text excerpts, view facsimiles, export metadata or full text)


Impresso user plans
Challenge II.
The design
As many features as possible, but with the smoothest possible user experience
Creating an app that provides access to tens of millions of newspaper articles and radio broadcasts is one thing.
Making sure it’s usable by both the general public and advanced researchers is another.
Impresso web app


Challenge II.
The design
Multiple filters

date range
language
keywords
source
length
article type
and many more...
Challenge II.
The design
Advanced features




Named entities profiles
Corpus overview
Ngrams
Text reuse
Challenge II.
The design
Browse newspapers and radio broadcasts at the same time

Example of future integration of audio files into the Impresso app
Challenge III.
The analysis
It's not just about collecting, it's about enriching and connecting

Enriching and connecting historical sources by transforming noisy and heterogeneous sources into semantically enriched and structured data, ultimately connected in a common vector space.
Challenge III.
The analysis
It's not just about collecting, it's about enriching and connecting

Conducting transmedia and transnational historical research using semantically enriched historical media sources
Impresso API
to process other sources
via notebooks in a datalab
and compare them to our data
Challenge III.
The analysis
The Impresso datalab, a suite of analysis tools that leverage the API and enriched data from the press and radio
A Jupyter notebook is a Python script that allows users to retrieve Impresso data and perform operations



Challenge III.
The analysis
Impresso datalab
The goal is to spark users’ curiosity, make computational methods accessible to them, and encourage them to explore code without needing to master it.
Descriptive statistics for a collection




Network of entities
Map of locations
Semantic proximity
Challenge IV.
Writing history
Bringing a community together and conducting historical research

Impresso also organizes workshops and international conferences


Challenge IV.
Writing history
Bringing a community together and conducting historical research
Impresso Seminar
Building bridges with existing communities




Challenge IV.
Writing history
We develop case studies to examine how certain phenomena are covered in the media.





Challenge IV.
Writing history
This opens up new avenues for historiographical reflection
- How is the writing of history changing with these large-scale datasets and new methods?
- How should we interpret the results produced by these tools?




Next: Impresso app

In the next chapter, let’s see how to use the Impresso app to explore the relationship between print media and radio
Impresso app
What does the press have to say about radio?
Chapter 2
An example of how the tools developed by Impresso can be used to address a media history question


Radio Nations in Prangins in the 1930s
United Nations Archives Geneva (P132-01-007)
Newspaper stand at Zurich train station, 1962
ETH-Bibliothek Zürich (Com_C17-073-047-001-005)
What does the press have to say about radio?
Chapter 2


Let's browse the digitized press
Welcome to the Impresso web app




Before starting a search, log in to access more content and facsimiles.
Impresso app


Let's search for "radio" in article contents
Impresso app

This very broad search returns over a million results
Search result

But what are we looking at?
Search result


Total number of articles
Proportion (%)
For example: Is the result biased by the nature of the corpus, since some newspapers are only available during certain periods?

But what are we looking at?
Search result
Or: Does this keyword mainly return results in a specific language, a specific country, or from specific news outlets?




And is that even the right keyword?
Search result

This is a classic problem in searching large text corpora: keywords are an imperfect way to filter
- Is the keyword polysemous?
Radio = the device, the institution, the broadcast?
- Does it appear in any other word?
Radioactivity, Radiography, ...
- Is the concept we're looking for gradually being replaced by another one?
- Are there different ways to spell it?
(in the case of radio, no problem)
Radio -> Podcast
- Does this word often cause OCR issues?
radio, r4dio, radid, adio, 'radio, ...
- Should it be paired with other keywords?
Radio AND Broadcast, Radio OR Antenna
By selecting a date range (in this case, a decade), we reduced the number of results by a factor of 10.


Filter by date

Let's narrow down the search using filters






Filter by source
Here, three French-language Swiss newspapers from different regions and with different political leanings




Remove articles that are less than 100 words long
Separate articles from advertisements
Keep only what is featured on the front page
Using other filters to narrow the results

Search filters
Search summary



Content item
The facsimile is visible only if you have access rights
Exploration

Each result is available as a facsimile and/or transcript
Exploration

Article in context, along with the entire issue


Block-by-block transcription, for close reading
Exploration

To preserve a corpus, it can be saved as a collection.


All articles are now tagged so they stand out in search results.
Collections

Each collection has its own page, complete with some descriptive statistics.
Collections

A collection can be exported as a CSV file.




Once exported, the CSV file can be downloaded and opened in a spreadsheet program.
It includes all the data that your plan grants you access to.
Collection export


Now that we’ve explored our collection, how is radio depicted in print media?
Radio in the press?


Category 1.1 Advertisements



At first glance, this may not be what we’re looking for, but radio is widely present in the press as a commodity.
Radio in the press?
Radio can also serve as a selling point for other products, such as cars.


Category 1.1 Advertisements
Radio in the press?


If we compare an opinion newspaper with a mainstream newspaper, the difference in the proportion of advertisements is clear.



Journal de Genève
8 % of "radio" articles are ads
L'Express
31 % of "radio" articles are ads
Category 1.1 Advertisements
Radio in the press?


Category 1.2
Job advertisements

Radio isn't just a device, it's also an employer. Job listings in the press allow for documentation of this professional reality.


Radio in the press?

La Liberté 1956
La Liberté 1978
Category 2
Radio program
From the very beginning, the radio needed the press to publicize its weekly schedule.



Radio in the press?
Journal de Genève 1952

Feuille d'avis de Neuchâtel 1930
It is worth noting that even the dissemination of this very practical information creates synergies among the media
Category 2
Radio program


Radio in the press?
The opposite is also true: newspapers often include the schedules of many stations because they know this is an essential service for many of their readers.


Gazette de Lausanne 1986
Category 3
Radio as a source of information


Radio in the press?
More interesting than advertisements or radio programs, radio can also be cited as a source of information in news articles.
Feuille d'avis de Neuchâtel 1940


A search of the Impresso corpus shows that this was fairly rare in Swiss newspapers during the early decades (1920–1950), but it became more common thereafter.
Category 3
Radio as a source of information


Radio in the press?
An interesting observation: it is very often radio stations from communist countries that are mentioned in Swiss newspapers
Feuille d'avis de Neuchâtel 1950


Feuille d'avis de Neuchâtel 1960
The Swiss media do not have correspondents in these countries and therefore rely on reports from government radio stations
Category 3
Radio as a source of information


Radio in the press?
As the second half of the 20th century progresses, radio is cited more and more often as a source of information. It is increasingly becoming an integral part of the media landscape.
Feuille d'avis de Neuchâtel 1973


Category 4
Radio as a subject in the news


Radio in the press?
The “richest” occurrences in this corpus involve references to radio as the subject of a news article.
Feuille d'avis de Neuchâtel 1940

Here, the press is complaining about the poor language used by radio hosts!


As soon as radio became widely accessible, it became a topic of discussion in its own right for the press. For example, newspapers began to review the programs themselves.
Category 4
Radio as a subject in the news


Radio in the press?
Feuille d'avis de Neuchâtel 1960
"Miroir du monde" is the flagship international news program on the Swiss French-speaking radio, coming soon in the Impresso app!
The press is becoming a platform for reacting to what is said on the radio, a forum for discussing the new profession of radio journalism.




Feuille d'avis de Neuchâtel 1970
The "radio critique" section
Category 4
Radio as a subject in the news


Radio in the press?
Feuille d'avis de Neuchâtel 1940
The press is a valuable source for exploring the institutional history of radio.


Feuille d'avis de Neuchâtel 1970

Feuille d'avis de Neuchâtel 1980
Category 4
Radio as a subject in the news


Radio in the press?
Feuille d'avis de Neuchâtel 1940
The press also shows us when and how radio becomes part of people’s daily lives.



Feuille d'avis de Neuchâtel 1960

Category 4
Radio as a subject in the news


Radio in the press?
L'Impartial 1926
Finally, the press is also a source for studying the history of radio technology



Feuille d'avis de Neuchâtel 1940
Feuille d'avis de Neuchâtel 1960
Category 4
Radio as a subject in the news


Radio in the press?
Feuille d'avis de Neuchâtel 1960
Worth noting: While radio and TV magazines would be a major source for writing about the people behind the radio, the press is also very valuable.


Feuille d'avis de Neuchâtel 1960
These two articles, published in the same issue on March 30, 1960, offer two unique perspectives on the people behind the microphone:
On the left, a popular radio host “reveals” himself in a photo… for an advertisement for a brand of razors! On the right, a report states that a radio employee has been convicted for refusing to serve in the Swiss army due to his pacifist and religious beliefs.
Named entities
This leads us down an interesting path: can we study the people who appear in the selected articles?


In the Impresso app, “named entities” (people, places, organizations) mentioned in articles are automatically identified.
Named entities
This leads us down an interesting path: can we study the people who appear in the selected articles?

In a collection centered on the keyword “radio,” three categories of people emerge: composers, heads of state and prominent politicians, and… figures from the radio world.

In our case, two individuals emerge: the journalist René Payot (who moved from print media to radio), and the satirist Jack Rollan (who did the opposite).

Entities are linked to the Wikidata entries for the relevant individuals, if such entries exist.
Named entities
The platform allows users to search for all instances of a named entity


This provides insight into the timeline of René Payot’s career, a journalist for the Journal de Genève who went on to become a radio commentator on international news during World War II, before hosting widely listened-to international news programs for two decades.

Radio in Impresso
But very soon, there will also be radio collections available right in Impresso!
1. Radio in text format
Radio typescripts



We have received the transcripts from Swissinfo, the Swiss international radio station, covering the period of World War II, in French and German.
Radio in Impresso
But very soon, there will also be radio collections available right in Impresso!
2. Radio in audio format
Radio broadcasts
We are making major changes to the platform to enable it to support audio.

We have collections from RTS, radio stations in French-speaking Switzerland, and the INA, which archives French public radio, just waiting to be released.
Next: Impresso datalab
In the next chapter, let’s see how digital humanities tools allow us to go further with the data from Impresso.

Impresso datalab
Taking it further: press archives and quantitative methods
Chapter 3

The Impresso datalab
Welcome to the Impresso datalab


The Impresso datalab


First, you need to log in to the datalab to be then able create a temporary personal token.

The Impresso datalab

The Datalab contains instructions for programmatically accessing the Impresso API. But for the general public, it’s the notebooks that are of particular interest.

We have also developed additional notebooks that are currently pending. We also invite colleagues outside the project to develop their own notebooks, which we could add here.
Jupyter notebooks

But what exactly is a notebook?

A Jupyter Notebook is an interactive, (often) web-based tool that lets you combine live code, visual graphics, mathematical equations, and narrative text into a single shareable document.

1. A notebook contains lines of code
2. It also includes comments explaining what we're doing
3. Each cell of code can be executed to see the result immediately
Notebook 1: "Inspecting" a collection

Let's use a very simple statistical analysis notebook


This notebook was created for educational purposes, to generate basic statistics based on an Impresso collection.
Notebook 1: "Inspecting" a collection

We open it in Google Colab, an interface that lets us work online with this Jupyter notebook



And we upload the CSV file of the collection we created during our previous exploration of radio coverage in the press.
Notebook 1: "Inspecting" a collection

Each line of code is commented to guide the user



Here, we load the dataset

Notebook 1: "Inspecting" a collection

The notebook guides us through the process of creating our first chart.



In this case, a histogram that shows how many articles in our collection come from which newspapers.


Remember that our collection includes articles containing the keyword “radio” from three Swiss newspapers during the 1950s.
Notebook 1: "Inspecting" a collection

The notebook then generates a timeline of articles, organized by newspaper.



We can see that the number of articles in La Sentinelle (LSE) mentioning the radio increased toward the end of the decade


Notebook 1: "Inspecting" a collection

The notebook then guides us through the process of generating other exploratory statistics.



For example, an analysis of article length (as a reminder, when we created the collection, we excluded all articles with fewer than 100 words)


The length of articles varies from one newspaper to another: on average, articles about radio in La Liberté (LLE) are twice as long as those in Le Journal de Genève (JDG).
Notebook 2: Network of named entities




This second notebook helps highlight the main named entities that appear together in the articles
No need to create a collection beforehand; the notebook will connect to the Impresso data on its own.
So we can run a query directly in the notebook. For example, to find the 100 people who appear most frequently in articles containing the word “radio” in the three newspapers covered by our previous analyses for the same period.
Notebook 2: Network of named entities

It generates a network that shows how often the most frequently mentioned people appear in the same press articles.
This network can then be exported and viewed more clearly in software such as Gephi


In the case of the "radio" keyword, the entities are very strongly influenced by the music programming. Composers (in white) are much more prominent than political figures (in blue) or media personalities (in red).
Notebook 3: Map of locations

This third notebook will retrieve the geographic coordinates of the locations mentioned in the articles and generate a map.

Just like in the previous notebook, we run the query directly here.
Next, the top 100 results are geolocated by querying Wikidata.
Notebook 3: Map of locations

It generates a map showing the cities (in black) and countries (in red) mentioned in the news articles.
In the case of a corpus built around the keyword “radio,” the results aren’t very interesting: the locations are generally the cities where the broadcasts or concerts take place, with a few capital cities occasionally mentioned in news stories. This approach is much more interesting when working with a more general-purpose corpus.



Beyond metadata analysis
Semantic analysis of the article contents using embeddings
What are embeddings?
Beyond metadata analysis

Animation: Coenen and Pearce, “Understanding UMAP”, https://pair-code.github.io/understanding-umap/
How can we interpret a UMAP, a cloud of dots that is a “flattened” version of our original multidimensional object (which is itself too complex to grasp)?
Visualizing vector representations of articles in 2D reduces the dimensionality of the semantic distance between them (from hundreds of dimensions to just 2).
Interpreting semantic proximity maps (UMAPs)
Beyond metadata analysis

Animation: Michelet and Grandjean (2026) "Les embeddings, nouvel outil d’une histoire numérique des grands corpus de presse?"
A tool for exploring large press corpora
UMAP of all 60 000 articles published in La Liberté, the Journal de Genève and the Feuille d'avis de Neuchâtel in 1950.
The semantic space of a collection


Let's analyze our collection as a UMAP
This notebook, which is not yet available on the Datalab, retrieves the embeddings already calculated by Impresso or calculates them again if they are not provided by the API.
We've tweaked the collection's filters slightly to include advertisements and short articles that aren't on the front page, which results in a larger and more diverse corpus. This corpus of 4,000 articles is processed in about 20 minutes.
The semantic space of a collection

La Liberté (Catholic, Fribourg)
Journal de Genève (Liberal conservative, Geneva)
La Sentinelle (Socialist, La Chaux-de-Fonds)
A map for exploring a corpus of press articles about radio
Each dot represents an article from 1950 that contains the keyword “radio” in the three selected newspapers.
This UMAP “map” is a simplification of the semantic distance between all these articles; it makes significant compromises to ensure it is readable in 2D.
This doesn't mean that if two articles are close to each other, they are the most similar; it means that they probably belong to the same semantic register (that they are about the same thing). That's why we'll focus mainly on the groups rather than on individual situations.
The semantic space of a collection

La Liberté (Catholic, Fribourg)
Journal de Genève (Liberal conservative, Geneva)
La Sentinelle (Socialist, La Chaux-de-Fonds)
What can we see on this map?
Stock market and financial news
(Radio corporation of America stocks)
International news
Religious news
"Faits divers"
(accidents, crimes, unexpected)
National/local news
Foreclosure auctions
(A radio device is included among the items sold)
Radio programs
(Each newspaper has a different style)
Religious calendar
Culture
(music, cinema, theatre)
Radio critique
Advertisement
The semantic space of a collection

La Liberté (Catholic, Fribourg)
Journal de Genève (Liberal conservative, Geneva)
La Sentinelle (Socialist, La Chaux-de-Fonds)

Focus on international news: Why is radio mentioned?
Korean war
(North Korean and Russian radios as sources)
United Kingdom
(In many cases, the government addressing the public via radio)
Belgium
(King Leopold III is facing criticism and addresses the nation on the radio)
Sweden
(The death of King Gustav V announced on Swedish radio)
Miscellaneous news
(Articles that contain a series of short news stories from around the world. Radio is often a source)
Various European news
Various Asian news
(China, Tibet, Indochina)
Since all the articles contain the word “radio,” this does not accurately reflect media coverage in 1950; it skews the results in favor of regions where news reaches Swiss newspapers via official (often communist) radio stations.
Comparing two collections

Now, we want to use embeddings to compare two collections.
For example, to study how radio news programs select information compared to the print media.
Here, we have selected a week's worth of news bulletins from Radio Suisse Internationale in June 1945.


To make them comparable to news articles, these bulletins must first be broken down into segments of information.
This is the text that was read on the air. There’s one every day in French, German, English, and sometimes Spanish.
Comparing two collections




One of the major news stories this week is the diplomatic progress made at the San Francisco conference among the “Big Five” (establishment of the future UN).

45 radio news
1674 press articles
Let's compare the news coverage in the press and on the radio for the week of June 4–10, 1945
All the articles (no ads) from the Journal de Genève, Gazette de Lausanne, La Sentinelle, Feuille d'Avis de Neuchâtel, L'Impartial and La Liberté.
Comparing two collections

News coverage in the press and on the radio
June 4–10, 1945
Press articles
Radio broadcast segment
Area of the semantic map that is not covered by the radio bulletins
Comparing two collections

What is the order of priority in Radio Suisse Internationale's news bulletins?
June 4–10, 1945
Press articles
Radio broadcast segment
Crisis in the Levant
Syria, Lebanon, France
War in the Pacific
Invasion of Okinawa
Germany
Administrating liberated Berlin
The 'Big Five'
San Francisco conference
Tito in Trieste
Bonomi resignation
Italy
Expulsions of nazis
Release of prisonners
Humanitarian issues
Swiss parliament
Resignation of General Guisan
Economic news
INTERNATIONAL
SWITZERLAND
Obituaries
"Faits divers"
Radio programs
Culture
Sports
Swiss international relations
Classified ads
Commercial ads
Visual embeddings
1 slide to mention the potential of visual embeddings
Next: Perspectives
In the next chapter, we will conclude by discussing the limitations of these approaches and the future developments of Impresso

Perspectives
Critical approach and ongoing developments
Chapter 4
Perspectives
Digitizing collections isn’t exactly new, nor is finding quantitative methods to analyze historical data. But creating an interface that allows us to do both—on a large scale and using methods that weren’t yet available when the first project was launched—raises some questions.
In particular, because this combination of data and methods bridges the gap between the computer science and history communities, and we must ensure that the democratization of these tools is accompanied by a critical perspective, regardless of which side one comes from.
A critical digital history

Perspectives
For several years now, we have been testing these tools in our classes. It’s always interesting to see how students—whether or not they have prior technical knowledge—use them to analyze these press archive collections.
Incorporating these tools into historians' toolkit

EPFL MA Course 2025-2026




Examples of student works
Perspectives
Ongoing developments, new collections
To go further
List of links and resources
1 publication to highlight?