Impresso

Media Monitoring of the Past

 ∨ Use the vertical arrows to navigate the slides

Introduction

Analyzing the Press and Radio with Impresso Tools

Here's how the slide deck is built

Introduction

Chapter 1

Chapter 2

Chapter 3

Chapter 4

Plan

You are here

Impresso

Simple search

Datalab

Perspectives

Plan

Use the vertical arrows to navigate

or the horizontal arrows to jump to the next chapter

The Impresso project

"Media Monitoring of the Past"

Chapter 1

Chapter 1

The Impresso project

Impresso 1

2017-2020

Impresso 2

2023-2027

Developing new approaches and interfaces for the exploration and critical analysis of historical media archives.

 

Swiss and Luxembourg newspapers

Cutting-edge platform

New collections from across Europe

Radio broadcasts

Datalab

Impresso 2

Funded by:

Hosted by:

Team

Principal investigators

An interdisciplinary team spanning history, web development, linguistics, and digital humanities

Partners

A network of partner institutions that provide digitized newspaper and radio collections

Challenges

In short, the Impresso project is a series of challenges: 

I.

The data

Multiple sources, multiple formats, massive data, and numerous legal issues

II.

The design

Make all of this accessible and useful to the public

III.

The analysis

Enrich data to create added value and develop relevant analysis tools

IV.

Writing history

How should all of this be interpreted in the context of historical research?

Challenge I.

The data

The collections are massive, but also very diverse: 

  • wide time spans
  • different countries and languages
  • different types of sources (newspapers, radio, typescripts)
  • varying OCR quality

Press from 1732 to 2018

Radio programs in audio format with speech-to-text

Transcripts of radio news broadcasts

Challenge I.

The data

The collections come from different institutions:

  • different jurisdictions, requirements, and policies
  • different shared data, metadata, formats, and transcriptions
  • different copyright regimes, approaches to managing the public domain, and gray areas

Consequences:

  • Signing of multiple data-sharing agreements
  • Identification of target audiences and differentiated access rights
  • Identification of different types of access (view text excerpts, view facsimiles, export metadata or full text)

Impresso user plans

Challenge II.

The design

As many features as possible, but with the smoothest possible user experience

Creating an app that provides access to tens of millions of newspaper articles and radio broadcasts is one thing.

Making sure it’s usable by both the general public and advanced researchers is another.

Impresso web app

Challenge II.

The design

Multiple filters

date range

language

keywords

source

length

article type

and many more...

Challenge II.

The design

Advanced features

Named entities profiles

Corpus overview

Ngrams

Text reuse

Challenge II.

The design

Browse newspapers and radio broadcasts at the same time

Example of future integration of audio files into the Impresso app

Challenge III.

The analysis

It's not just about collecting, it's about enriching and connecting

 

 

Enriching and connecting historical sources by transforming noisy and heterogeneous sources into semantically enriched and structured data, ultimately connected in a common vector space.

Challenge III.

The analysis

It's not just about collecting, it's about enriching and connecting

 

 

Conducting transmedia and transnational historical research using semantically enriched historical media sources

Impresso API

to process other sources

via notebooks in a datalab

and compare them to our data

Challenge III.

The analysis

The Impresso datalab, a suite of analysis tools that leverage the API and enriched data from the press and radio

A Jupyter notebook is a Python script that allows users to retrieve Impresso data and perform operations

Challenge III.

The analysis

Impresso datalab

The goal is to spark users’ curiosity, make computational methods accessible to them, and encourage them to explore code without needing to master it.

Descriptive statistics for a collection

Network of entities

Map of locations

Semantic proximity

Challenge IV.

Writing history

Bringing a community together and conducting historical research

Impresso also organizes workshops and international conferences

Challenge IV.

Writing history

Bringing a community together and conducting historical research

Impresso Seminar

Building bridges with existing communities

Challenge IV.

Writing history

We develop case studies to examine how certain phenomena are covered in the media.

Challenge IV.

Writing history

This opens up new avenues for historiographical reflection

  • How is the writing of history changing with these large-scale datasets and new methods?
  • How should we interpret the results produced by these tools?

Next: Impresso app

In the next chapter, let’s see how to use the Impresso app to explore the relationship between print media and radio

Impresso app

What does the press have to say about radio?

Chapter 2

An example of how the tools developed by Impresso can be used to address a media history question

Radio Nations in Prangins in the 1930s

United Nations Archives Geneva (P132-01-007)

Newspaper stand at Zurich train station, 1962

ETH-Bibliothek Zürich (Com_C17-073-047-001-005)

What does the press have to say about radio?

Chapter 2

Let's browse the digitized press

Welcome to the Impresso web app

Before starting a search, log in to access more content and facsimiles.

Impresso app

Let's search for "radio" in article contents

Impresso app

This very broad search returns over a million results

Search result

But what are we looking at?

Search result

Total number of articles

Proportion (%)

For example: Is the result biased by the nature of the corpus, since some newspapers are only available during certain periods?

But what are we looking at?

Search result

Or: Does this keyword mainly return results in a specific language, a specific country, or from specific news outlets?

And is that even the right keyword?

Search result

This is a classic problem in searching large text corpora: keywords are an imperfect way to filter

- Is the keyword polysemous?

Radio = the device, the institution, the broadcast?

- Does it appear in any other word?

Radioactivity, Radiography, ...

- Is the concept we're looking for gradually being replaced by another one?

- Are there different ways to spell it?

(in the case of radio, no problem)

Radio -> Podcast

- Does this word often cause OCR issues?

radio, r4dio, radid, adio, 'radio, ...

- Should it be paired with other keywords?

Radio AND Broadcast, Radio OR Antenna

By selecting a date range (in this case, a decade), we reduced the number of results by a factor of 10.

Filter by date

Let's narrow down the search using filters

Filter by source

Here, three French-language Swiss newspapers from different regions and with different political leanings

Remove articles that are less than 100 words long

Separate articles from advertisements

Keep only what is featured on the front page

Using other filters to narrow the results

Search filters

Search summary

Content item

The facsimile is visible only if you have access rights

Exploration

Each result is available as a facsimile and/or transcript

Exploration

Article in context, along with the entire issue

Block-by-block transcription, for close reading

Exploration

To preserve a corpus, it can be saved as a collection.

All articles are now tagged so they stand out in search results.

Collections

Each collection has its own page, complete with some descriptive statistics.

Collections

A collection can be exported as a CSV file.

Once exported, the CSV file can be downloaded and opened in a spreadsheet program.

It includes all the data that your plan grants you access to.

Collection export

Now that we’ve explored our collection, how is radio depicted in print media?

Radio in the press?

Category 1.1 Advertisements

At first glance, this may not be what we’re looking for, but radio is widely present in the press as a commodity.

Radio in the press?

Radio can also serve as a selling point for other products, such as cars.

Category 1.1 Advertisements

Radio in the press?

If we compare an opinion newspaper with a mainstream newspaper, the difference in the proportion of advertisements is clear.

Journal de Genève

8 % of "radio" articles are ads

L'Express

31 % of "radio" articles are ads

Category 1.1 Advertisements

Radio in the press?

Category 1.2

Job advertisements

Radio isn't just a device, it's also an employer. Job listings in the press allow for documentation of this professional reality.

Radio in the press?

La Liberté 1956

La Liberté 1978

Category 2

Radio program

From the very beginning, the radio needed the press to publicize its weekly schedule.

Radio in the press?

Journal de Genève 1952

Feuille d'avis de Neuchâtel 1930

It is worth noting that even the dissemination of this very practical information creates synergies among the media

Category 2

Radio program

Radio in the press?

The opposite is also true: newspapers often include the schedules of many stations because they know this is an essential service for many of their readers. 

Gazette de Lausanne 1986

Category 3

Radio as a source of information

Radio in the press?

More interesting than advertisements or radio programs, radio can also be cited as a source of information in news articles.

Feuille d'avis de Neuchâtel 1940

A search of the Impresso corpus shows that this was fairly rare in Swiss newspapers during the early decades (1920–1950), but it became more common thereafter.

Category 3

Radio as a source of information

Radio in the press?

An interesting observation: it is very often radio stations from communist countries that are mentioned in Swiss newspapers

Feuille d'avis de Neuchâtel 1950

Feuille d'avis de Neuchâtel 1960

The Swiss media do not have correspondents in these countries and therefore rely on reports from government radio stations

Category 3

Radio as a source of information

Radio in the press?

As the second half of the 20th century progresses, radio is cited more and more often as a source of information. It is increasingly becoming an integral part of the media landscape.

Feuille d'avis de Neuchâtel 1973

Category 4

Radio as a subject in the news

Radio in the press?

The “richest” occurrences in this corpus involve references to radio as the subject of a news article.

Feuille d'avis de Neuchâtel 1940

Here, the press is complaining about the poor language used by radio hosts!

As soon as radio became widely accessible, it became a topic of discussion in its own right for the press. For example, newspapers began to review the programs themselves.

Category 4

Radio as a subject in the news

Radio in the press?

Feuille d'avis de Neuchâtel 1960

"Miroir du monde" is the flagship international news program on the Swiss French-speaking radio, coming soon in the Impresso app!

The press is becoming a platform for reacting to what is said on the radio, a forum for discussing the new profession of radio journalism.

Feuille d'avis de Neuchâtel 1970

The "radio critique" section

Category 4

Radio as a subject in the news

Radio in the press?

Feuille d'avis de Neuchâtel 1940

The press is a valuable source for exploring the institutional history of radio.

Feuille d'avis de Neuchâtel 1970

Feuille d'avis de Neuchâtel  1980

Category 4

Radio as a subject in the news

Radio in the press?

Feuille d'avis de Neuchâtel 1940

The press also shows us when and how radio becomes part of people’s daily lives.

Feuille d'avis de Neuchâtel 1960

Category 4

Radio as a subject in the news

Radio in the press?

L'Impartial 1926

Finally, the press is also a source for studying the history of radio technology

Feuille d'avis de Neuchâtel 1940

Feuille d'avis de Neuchâtel 1960

Category 4

Radio as a subject in the news

Radio in the press?

Feuille d'avis de Neuchâtel 1960

Worth noting: While radio and TV magazines would be a major source for writing about the people behind the radio, the press is also very valuable.

Feuille d'avis de Neuchâtel 1960

These two articles, published in the same issue on March 30, 1960, offer two unique perspectives on the people behind the microphone:

On the left, a popular radio host “reveals” himself in a photo… for an advertisement for a brand of razors! On the right, a report states that a radio employee has been convicted for refusing to serve in the Swiss army due to his pacifist and religious beliefs.

Named entities

This leads us down an interesting path: can we study the people who appear in the selected articles? 

In the Impresso app, “named entities” (people, places, organizations) mentioned in articles are automatically identified. 

Named entities

This leads us down an interesting path: can we study the people who appear in the selected articles? 

In a collection centered on the keyword “radio,” three categories of people emerge: composers, heads of state and prominent politicians, and… figures from the radio world.

In our case, two individuals emerge: the journalist René Payot (who moved from print media to radio), and the satirist Jack Rollan (who did the opposite).

Entities are linked to the Wikidata entries for the relevant individuals, if such entries exist.

Named entities

The platform allows users to search for all instances of a named entity

This provides insight into the timeline of René Payot’s career, a journalist for the Journal de Genève who went on to become a radio commentator on international news during World War II, before hosting widely listened-to international news programs for two decades.

Radio in Impresso

But very soon, there will also be radio collections available right in Impresso! 

1. Radio in text format

Radio typescripts

We have received the transcripts from Swissinfo, the Swiss international radio station, covering the period of World War II, in French and German.

Radio in Impresso

But very soon, there will also be radio collections available right in Impresso! 

2. Radio in audio format

Radio broadcasts

We are making major changes to the platform to enable it to support audio.

We have collections from RTS, radio stations in French-speaking Switzerland, and the INA, which archives French public radio, just waiting to be released.

Next: Impresso datalab

In the next chapter, let’s see how digital humanities tools allow us to go further with the data from Impresso.

Impresso datalab

Taking it further: press archives and quantitative methods

Chapter 3

The Impresso datalab

Welcome to the Impresso datalab

The Impresso datalab

First, you need to log in to the datalab to be then able create a temporary personal token.

The Impresso datalab

The Datalab contains instructions for programmatically accessing the Impresso API. But for the general public, it’s the notebooks that are of particular interest. 

We have also developed additional notebooks that are currently pending. We also invite colleagues outside the project to develop their own notebooks, which we could add here.

Jupyter notebooks

But what exactly is a notebook?

A Jupyter Notebook is an interactive, (often) web-based tool that lets you combine live code, visual graphics, mathematical equations, and narrative text into a single shareable document.

1. A notebook contains lines of code 

2. It also includes comments explaining what we're doing

3. Each cell of code can be executed to see the result immediately 

Notebook 1: "Inspecting" a collection

Let's use a very simple statistical analysis notebook

This notebook was created for educational purposes, to generate basic statistics based on an Impresso collection. 

Notebook 1: "Inspecting" a collection

We open it in Google Colab, an interface that lets us work online with this Jupyter notebook

And we upload the CSV file of the collection we created during our previous exploration of radio coverage in the press.

Notebook 1: "Inspecting" a collection

Each line of code is commented to guide the user

Here, we load the dataset

Notebook 1: "Inspecting" a collection

The notebook guides us through the process of creating our first chart.

In this case, a histogram that shows how many articles in our collection come from which newspapers.

Remember that our collection includes articles containing the keyword “radio” from three Swiss newspapers during the 1950s.

Notebook 1: "Inspecting" a collection

The notebook then generates a timeline of articles, organized by newspaper.

We can see that the number of articles in La Sentinelle (LSE) mentioning the radio increased toward the end of the decade

Notebook 1: "Inspecting" a collection

The notebook then guides us through the process of generating other exploratory statistics.

For example, an analysis of article length (as a reminder, when we created the collection, we excluded all articles with fewer than 100 words) 

The length of articles varies from one newspaper to another: on average, articles about radio in La Liberté (LLE) are twice as long as those in Le Journal de Genève (JDG).

Notebook 2: Network of named entities

This second notebook helps highlight the main named entities that appear together in the articles

No need to create a collection beforehand; the notebook will connect to the Impresso data on its own.

So we can run a query directly in the notebook. For example, to find the 100 people who appear most frequently in articles containing the word “radio” in the three newspapers covered by our previous analyses for the same period.

Notebook 2: Network of named entities

It generates a network that shows how often the most frequently mentioned people appear in the same press articles.

This network can then be exported and viewed more clearly in software such as Gephi

In the case of the "radio" keyword, the entities are very strongly influenced by the music programming. Composers (in white) are much more prominent than political figures (in blue) or media personalities (in red).

Notebook 3: Map of locations

This third notebook will retrieve the geographic coordinates of the locations mentioned in the articles and generate a map.

Just like in the previous notebook, we run the query directly here.

Next, the top 100 results are geolocated by querying Wikidata.

Notebook 3: Map of locations

It generates a map showing the cities (in black) and countries (in red) mentioned in the news articles.

In the case of a corpus built around the keyword “radio,” the results aren’t very interesting: the locations are generally the cities where the broadcasts or concerts take place, with a few capital cities occasionally mentioned in news stories. This approach is much more interesting when working with a more general-purpose corpus.

Beyond metadata analysis

Semantic analysis of the article contents using embeddings

What are embeddings?

Beyond metadata analysis

Animation: Coenen and Pearce, “Understanding UMAP”, https://pair-code.github.io/understanding-umap/ 

How can we interpret a UMAP, a cloud of dots that is a “flattened” version of our original multidimensional object (which is itself too complex to grasp)?

Visualizing vector representations of articles in 2D reduces the dimensionality of the semantic distance between them (from hundreds of dimensions to just 2).

Interpreting semantic proximity maps (UMAPs)

Beyond metadata analysis

Animation: Michelet and Grandjean (2026) "Les embeddings, nouvel outil d’une histoire numérique des grands corpus de presse?"

A tool for exploring large press corpora

UMAP of all 60 000 articles published in La Liberté, the Journal de Genève and the Feuille d'avis de Neuchâtel in 1950.

The semantic space of a collection

Let's analyze our collection as a UMAP

This notebook, which is not yet available on the Datalab, retrieves the embeddings already calculated by Impresso or calculates them again if they are not provided by the API.

We've tweaked the collection's filters slightly to include advertisements and short articles that aren't on the front page, which results in a larger and more diverse corpus. This corpus of 4,000 articles is processed in about 20 minutes.

The semantic space of a collection

La Liberté (Catholic, Fribourg)

Journal de Genève (Liberal conservative, Geneva)

La Sentinelle (Socialist, La Chaux-de-Fonds)

A map for exploring a corpus of press articles about radio

Each dot represents an article from 1950 that contains the keyword “radio” in the three selected newspapers.

This UMAP “map” is a simplification of the semantic distance between all these articles; it makes significant compromises to ensure it is readable in 2D.

This doesn't mean that if two articles are close to each other, they are the most similar; it means that they probably belong to the same semantic register (that they are about the same thing). That's why we'll focus mainly on the groups rather than on individual situations.

The semantic space of a collection

La Liberté (Catholic, Fribourg)

Journal de Genève (Liberal conservative, Geneva)

La Sentinelle (Socialist, La Chaux-de-Fonds)

What can we see on this map?

Stock market and financial news

(Radio corporation of America stocks)

International news

Religious news

"Faits divers"

(accidents, crimes, unexpected)

National/local news

Foreclosure auctions

(A radio device is included among the items sold)

Radio programs

(Each newspaper has a different style)

Religious calendar

Culture

(music, cinema, theatre)

Radio critique

Advertisement

The semantic space of a collection

La Liberté (Catholic, Fribourg)

Journal de Genève (Liberal conservative, Geneva)

La Sentinelle (Socialist, La Chaux-de-Fonds)

Focus on international news: Why is radio mentioned?

Korean war

(North Korean and Russian radios as sources)

United Kingdom

(In many cases, the government addressing the public via radio)

Belgium

(King Leopold III is facing criticism and addresses the nation on the radio)

Sweden

(The death of King Gustav V announced on Swedish radio)

Miscellaneous news

(Articles that contain a series of short news stories from around the world. Radio is often a source)

Various European news

Various Asian news

(China, Tibet, Indochina)

Since all the articles contain the word “radio,” this does not accurately reflect media coverage in 1950; it skews the results in favor of regions where news reaches Swiss newspapers via official (often communist) radio stations.

Comparing two collections

Now, we want to use embeddings to compare two collections.

For example, to study how radio news programs select information compared to the print media.

Here, we have selected a week's worth of news bulletins from Radio Suisse Internationale in June 1945.

To make them comparable to news articles, these bulletins must first be broken down into segments of information.

This is the text that was read on the air. There’s one every day in French, German, English, and sometimes Spanish.

Comparing two collections

One of the major news stories this week is the diplomatic progress made at the San Francisco conference among the “Big Five” (establishment of the future UN).

45 radio news

1674 press articles

Let's compare the news coverage in the press and on the radio for the week of June 4–10, 1945

All the articles (no ads) from the Journal de Genève, Gazette de Lausanne, La Sentinelle, Feuille d'Avis de Neuchâtel, L'Impartial and La Liberté.

Comparing two collections

News coverage in the press and on the radio

June 4–10, 1945

Press articles

Radio broadcast segment

Area of the semantic map that is not covered by the radio bulletins

Comparing two collections

What is the order of priority in Radio Suisse Internationale's news bulletins?

June 4–10, 1945

Press articles

Radio broadcast segment

Crisis in the Levant

Syria, Lebanon, France

War in the Pacific

Invasion of Okinawa

Germany

Administrating liberated Berlin

The 'Big Five'

San Francisco conference

Tito in Trieste

Bonomi resignation

Italy

Expulsions of nazis

Release of prisonners

Humanitarian issues

Swiss parliament

 

Resignation of General Guisan

Economic news

INTERNATIONAL

SWITZERLAND

Obituaries

"Faits divers"

Radio programs

Culture

Sports

Swiss international relations

Classified ads

Commercial ads

Visual embeddings

1 slide to mention the potential of visual embeddings

Next: Perspectives

In the next chapter, we will conclude by discussing the limitations of these approaches and the future developments of Impresso

Perspectives

Critical approach and ongoing developments

Chapter 4

Perspectives

Digitizing collections isn’t exactly new, nor is finding quantitative methods to analyze historical data. But creating an interface that allows us to do both—on a large scale and using methods that weren’t yet available when the first project was launched—raises some questions.

In particular, because this combination of data and methods bridges the gap between the computer science and history communities, and we must ensure that the democratization of these tools is accompanied by a critical perspective, regardless of which side one comes from.

A critical digital history

Perspectives

For several years now, we have been testing these tools in our classes. It’s always interesting to see how students—whether or not they have prior technical knowledge—use them to analyze these press archive collections.

Incorporating these tools into historians' toolkit

EPFL MA Course 2025-2026

Examples of student works

Perspectives

Ongoing developments, new collections

To go further

List of links and resources

 

1 publication to highlight?

Impresso – Media Monitoring of the Past

By Martin Grandjean

Impresso – Media Monitoring of the Past

Presentation of the Impresso project, with a focus on the analysis of press and radio archives

  • 34