Computational approaches to opera libretti

An experiment on DraCor Corpora

Luca Giovannini

Daniil Skorinkin

Workshop version - 09.02.2023

Opera and libretti

  • A new, “artificial” genre born in the early 17th c. in Italy, and rapidly exported across Europe

  • Traditionally: focus on music more than words

  • Librettology: still an analogic discipline

  • Few computational investigations

Research questions

  • Is it possible to consider libretti a unitary genre with its own structural features?

  • Do libretti possess a peculiar "genre signal" which sets them apart from contemporary comedies and tragedies?

  • How did they structurally evolve in comparison to the other genres?

  1. Data preparation

  2. Features selection

  3. Data exploration

  4. Results and discussion

Paper walkthrough 🚀

1. Data preparation

Corpus preparation

  1. Get metadata from GerDraCor 🇩🇪 & FreDraCor 🇫🇷 via the Dracor API (programmable corpora go brrr!)
  2. Investigate the 'libretto' column:
    • 55 True (explicitly marked libretti) for 🇩🇪
    • 58 True (explicitly marked libretti) for 🇫🇷
  3. Compare 'libretto' and 'normalized_genre' columns: 
    • For 🇩🇪 'libretto' and 'normalized_genre' were mutually exclusive (= 0 multi-label plays)
    • For 🇫🇷 16 multi-label plays

Corpus preparation

  1. Create the new 'libretto or genre' column
  2. For the 16 multi-label 🇫🇷 plays prefer 'libretto' over 'normalized genre'
  3. Initial 'libretto or genre' stats:

Corpus enrichment

  1. Retrieve all items with the 'subtitle' containing one of these labels for operatic subgenres:
    ballet de cour, ballet-héroïque, burlesque, comédie-ballet, divertissement, drame lyrique, entrée, grand opéra, intermède, Lehrstück, Liederspiel, Märchenoper, masque, Monodrama, opéra-ballet, opéra bouffon, opéra comique, opéra-féerie, pantomime, pastorale-héroïque, Posse, Schuldrama, Schuloper, Singspiel, Spieloper, tragédie en musique, vaudeville, Zauberoper, Zeitoper
  2. Check them manually and append to 'libretti'

Problem #1: blurred boundaries for the concept of opera

Corpus enrichment

  1. Retrieve Wikidata Genres through the plays' Wikidata IDs
  2. Map genres manually to one of our 5 categories (Comedy, Tragedy, Tragicomedy, Libretto, None)
  3. Add Wikidata genre to those which had neither normalized_genre nor libretto filled

Problem #2: missing genres

Enrichment results

2. Feature selection

Feature selection

  1. Get numeric features from the metadata table
  2. Drop features deemed irrelevant to play structure (e.g. num_p, num_l, num_female_speakers)
  3. Look for highly correlated features (cf. Szemes & Vida 2023)
  4. Remove one in each highly correlated pair

Correlation matrix

Correlation network

edge = high correlation (>0.75 or <-0.75)

dropped in 🇩🇪

  • average_path_length
  • diameter
  • max_degree
  • num_connected_components


dropped in 🇫🇷

  • num_of_segments
  • average_path_length
  • max_degree

3. Data exploration

Exploration attempt #1

  1. Split the corpora into roughly 50-year spans (3 for 🇩🇪, 5 for 🇫🇷) to follow closely the genre's evolution
  2. Apply dimensionality reduction methods (PCA, UMAP, t-SNE) to our "bag-of-features" plays


Distribution across timeframes

None of the dimensionality reduction methods worked well... with the possible exception of the 🇫🇷 1670-1719 segment

Rethinking libretti

  • We could not see any progressive "genrification" of libretti... at least with the collection of features we used
  • Let's take into account their generic alignment!

Exploration attempt 2

  1. Split the corpora into 50-year spans
  2. 🆕 Differentiate between comic and non-comic libretti using their subtitles (e.g. komisches Oper)
  3. Apply dimensionality reduction methods (PCA, UMAP, t-SNE) to timeframes

This is how the 🇫🇷 1670-1719 timeframe looks with the two libretti subclasses:

This is how the 🇫🇷 1670-1719 timeframe looks with the two libretti subclasses:

comic space

tragic zone

non-comic libretti

autonomous region

PCA 🇩🇪 timespans










PCA 🇫🇷 timespans














Exploring features' significance #1

  • binary: libretti vs non-libretti (tragicomedies removed)


  1. Statistical significance tests (Shapiro-Wilk, Wilcoxon)
  2. random forest classifier

Statistical testing

  • Feature-wise comparison of libretti against non-libretti with Wilcoxon Rank Sum test
  • Resulting p-values for each feature distribution:

Boxplot distributions for the 🇫🇷 word_count_stage     word_count_sp

Classifier training

  • Random Forest Classifier

  • 5-fold cross validation on all data

  • Iterative selection of the best n estimators
    parameter (10-1000)

  • Looking at feature importances

Classifier feature importances







Most relevant features, 🇫🇷

  1. word_count_stage
  2. word_count_sp
  3. num_connected_components
  4. density
  5. num_of_speakers
  6. diameter

Most relevant features, 🇩🇪

  1. word_count_sp
  2. num_of_person_groups
  3. average_degree

Exploring features' significance #2

  • four-class implementation

  • plotting each play individually

  • LOWESS-based smoothing curves to make trends visible

4. Results and discussion

1. Empirical verification of literary criticism

Libretti have less spoken text and more stage directions


trend more prominent in French, but visible also in German

2. An interesting pattern: independence of non-comic libretti (as far as some structural features are concerned)

🇩🇪  num_groups / word_count_sp

🇫🇷: density / num_speakers

🇩🇪 4-class classifier, confusion matrix

it is easier to confuse comedies and comic libretti

🇫🇷 4-class classifier, confusion matrix

it is easier to confuse comedies and comic libretti

3. The French dramatic space is more formalised than the German one

  • Looking at the PCA clusterings, it seems easier to discriminate between different genres in 🇫🇷
  • Historical reasons (due to text availability):
    • 🇫🇷 corpus starts from an "age of normative aesthetics" (Boileau, d'Aubignac)
    • 🇩🇪 corpus starts from an age where deconstruction of French classical models was underway (Lessing)
    • Rough explanation!

the two types of French libretti are more distinct than the German ones

(some) limitations

  • Corpora extension and markup quality
  • Comparative approach: lack of 🇮🇹
  • Difficulties in modelling relations between texts on the basis of formal features → need for better operationalisation and implementation?
  • Individual structural features might be useful for distinguishing libretti from non-libretti (e.g. text length), or comedies from tragedies (density)

  • However, it is not easy to distinguish between plays formalised as vectors of multiple features

  • Drama seems too homogenous, in terms of structural properties, for discriminative clustering

  • Topic modelling seems actually to work better in distinguishing genres — as per Shaw's famous quote

Closing reflections 🤔

Thanks for listening!

Made with