Computational approaches to opera libretti

2nd Conference for Computational Literary Studies, Würzburg, 23.06.2023

Luca Giovannini — Daniil Skorinkin

University of Potsdam, Germany

Summary

 

  1. Research question
  2. Corpus
  3. Experiments
  4. Findings and discussion

This presentation: plu.sh/libretti

1. Research question

Libretto

  • A new, “artificial” genre born in the early 17th century in Italy and rapidly exported across Europe

  • Traditional scholarly focus on music more than words

  • Librettology: still largely an analogic discipline

  • Few computational investigations

Some questions

  • Is it possible to consider libretti a unitary genre with its own structural features?

  • Do libretti possess a peculiar "genre signal" which sets them apart from contemporary comedies and tragedies?

  • How did the structure of libretti evolve compared to the other genres?

2. Corpus

Starting point: DraCor corpora

(☞ Fischer et al. 2017, dracor.org)

Initial survey

  1. Get metadata from GerDraCor 🇩🇪 & FreDraCor 🇫🇷 via the Dracor API
  2. Investigate the 'libretto' column:
    • 55 libretti (as marked by DraCor) for 🇩🇪
    • 58 libretti (as marked by DraCor) for 🇫🇷
  3. Compare 'libretto' and 'normalized_genre' columns: 
    • For 🇩🇪 'libretto' and 'normalized_genre' were mutually exclusive (= 0 multi-label plays)
    • For 🇫🇷 16 multi-label plays

Corpus preprocessing

  1. Normalisation of genre for the 16 multi-label 🇫🇷 plays: we preferred 'libretto' to 'normalized genre'
  2. Initial hypothesis: the intended usage of a libretto is more distinctive than its generic alignment

Corpus enrichment

  1. Retrieve all items with the 'subtitle' containing one of these labels for operatic subgenres:
    ballet de cour, ballet-héroïque, burlesque, comédie-ballet, divertissement, drame lyrique, entrée, grand opéra, intermède, Lehrstück, Liederspiel, Märchenoper, masque, Monodrama, opéra-ballet, opéra bouffon, opéra comique, opéra-féerie, pantomime, pastorale-héroïque, Posse, Schuldrama, Schuloper, Singspiel, Spieloper, tragédie en musique, vaudeville, Zauberoper, Zeitoper
  2. Qualitative check, then append to 'libretti' list

Problem #1: blurred boundaries

of the concept of libretto

Corpus enrichment

  1. Retrieve Wikidata genres through the plays' Wikidata IDs (in the TEI markup)
  2. Map genres manually to one of 5 categories (Comedy, Tragedy, Tragicomedy, Libretto, None)
  3. Add Wikidata genre to those which had neither normalized_genre nor libretto filled

Problem #2: missing

genre indicators

Enrichment results

🇩🇪

+ 51%

 

 

 

 

🇫🇷

+ 55%

3. Experiments

Exploratory data analysis as a methodological choice

  • No strong hypothesis on how the structure of a libretto would have looked like
  • "Let data speak by themselves"

A quite simple pipeline

Vectorisation of plays according to structural features

EDA on different textual aspects

Feature selection

  1. Get numeric features from the metadata table
  2. Drop features deemed irrelevant to play structure (e.g. num_p, num_l, num_female_speakers)
  3. Look for highly correlated features and remove one in each highly correlated pair

num_of_segments, num_of_speakers,

num_of_person groups, word_count_sp,

word_count_stage, average_degree, density, average_clustering, max_degree,

num_of_connected components,

diameter, average_path_length

A mixture of network measures, size statistics, and speech distribution metrics

Experiment #1

Recognising clusters

Procedure

  1. Split the corpora into roughly 50-year spans (3 for 🇩🇪, 5 for 🇫🇷) to follow closely the genre's evolution
  2. Apply dimensionality reduction methods (PCA) to the vectorised plays
  3. Results were unsatisfying: no meaningful clustering, no signs of libretto being a unitary genre

Semi-automatic labelling of libretti as comic/non comic, based on their subtitles (e.g. komisches Oper → comic libretto)

Refining categories

Results: clustering still problematic BUT

significant topological patterns emerge

One interesting example:

the 🇫🇷 1670-1719 timeframe

comic space

tragic zone

non-comic libretti

Experiment #2

Measuring feature significance

1. Computing statistical significance of features variation

2. Training a classifier

  • Random Forest Classifier

  • 5-fold cross validation on all data

  • Iterative selection of the best n estimators
    parameter (10-1000)

  • Removed highly correlated values (see correlation matrix)

Correlation matrix

Classifier feature importances

🇩🇪

 

 

 

 

🇫🇷

Most relevant features 🇫🇷

  1. word_count_stage
  2. word_count_sp
  3. num_connected_components
  4. density
  5. num_of_speakers
  6. diameter

Most relevant features 🇩🇪

  1. word_count_sp
  2. num_of_person_groups
  3. average_degree

Experiment #3

Plotting individual features

Charting the most interpretable features as scatterplots

  • four-class implementation

  • plotting each play individually

  • LOWESS-based smoothing curves to make trends visible

4. Findings and discussion

1. Distinctive

traits of libretti

Libretti have consistently less spoken text and more stage directions

 

trend more prominent in French, but visible also in German

2. An interesting pattern: independence of non-comic libretti (as far as some structural features are concerned)

🇩🇪  num_groups / word_count_sp

🇫🇷: density / num_speakers

🇩🇪 4-class classifier,

confusion matrix

it is easier to confuse comedies and comic libretti

🇫🇷 4-class classifier,

confusion matrix

it is easier to confuse comedies and comic libretti

3. The French dramatic space is more formalised than the German one

  • Looking at the PCA clusterings, it seems slightly easier to discriminate between different genres in 🇫🇷
  • Historical reasons:
    • 🇫🇷 corpus starts from an "age of normative aesthetics" (Boileau, d'Aubignac)
    • 🇩🇪 corpus starts from an age where deconstruction of French classical models was underway (Lessing) → more formal freedom?

Even the two types of French libretti

are more distinct than the German ones

Limitations

  • Corpora extension and markup quality
  • Comparative approach: lack of 🇮🇹
  • Difficulties in modelling relations between dramatic texts on the basis of formal features → could we do better?

Comparison: topic modelling (Schöch 2017)

  • Individual structural features might be useful for distinguishing libretti from non-libretti (e.g. text length), or comedies from tragedies (density)

  • However, it is generally not easy to distinguish between plays formalised as vectors of multiple features

  • Drama often seems too homogenous, in terms of structural properties, for discriminative clustering

  • Need to employ better features or rethink operationalisation patterns

In lieu of a conclusion

Thanks for listening!

Made with Slides.com