Computational approaches to opera libretti
An experiment on DraCor Corpora
Luca Giovannini
Daniil Skorinkin
Workshop version - 09.02.2023
Opera and libretti
-
A new, “artificial” genre born in the early 17th c. in Italy, and rapidly exported across Europe
-
Traditionally: focus on music more than words
-
Librettology: still an analogic discipline
-
Few computational investigations
Research questions
Is it possible to consider libretti a unitary genre with its own structural features?
Do libretti possess a peculiar "genre signal" which sets them apart from contemporary comedies and tragedies?
How did they structurally evolve in comparison to the other genres?
-
Data preparation
-
Features selection
-
Data exploration
-
Results and discussion
Paper walkthrough 🚀
1. Data preparation
Corpus preparation
- Get metadata from GerDraCor 🇩🇪 & FreDraCor 🇫🇷 via the Dracor API (programmable corpora go brrr!)
- Investigate the '
libretto
' column:- 55 True (explicitly marked libretti) for 🇩🇪
- 58 True (explicitly marked libretti) for 🇫🇷
- Compare '
libretto
' and 'normalized_genre
' columns:- For 🇩🇪 '
libretto
' and 'normalized_genre
' were mutually exclusive (= 0 multi-label plays) - For 🇫🇷 16 multi-label plays
- For 🇩🇪 '
Corpus preparation
- Create the new '
libretto or genre
' column - For the 16 multi-label 🇫🇷 plays prefer '
libretto
' over 'normalized genre
' - Initial '
libretto or genre
' stats:
Corpus enrichment
- Retrieve all items with the
'subtitle'
containing one of these labels for operatic subgenres:
ballet de cour, ballet-héroïque, burlesque, comédie-ballet, divertissement, drame lyrique, entrée, grand opéra, intermède, Lehrstück, Liederspiel, Märchenoper, masque, Monodrama, opéra-ballet, opéra bouffon, opéra comique, opéra-féerie, pantomime, pastorale-héroïque, Posse, Schuldrama, Schuloper, Singspiel, Spieloper, tragédie en musique, vaudeville, Zauberoper, Zeitoper - Check them manually and append to 'libretti'
Problem #1: blurred boundaries for the concept of opera
Corpus enrichment
- Retrieve Wikidata Genres through the plays' Wikidata IDs
- Map genres manually to one of our 5 categories (Comedy, Tragedy, Tragicomedy, Libretto, None)
- Add Wikidata genre to those which had neither
normalized_genre
norlibretto
filled
Problem #2: missing genres
Enrichment results
2. Feature selection
Feature selection
- Get numeric features from the metadata table
- Drop features deemed irrelevant to play structure (e.g.
num_p
,num_l
,num_female_speakers
) - Look for highly correlated features (cf. Szemes & Vida 2023)
- Remove one in each highly correlated pair
Correlation matrix
Correlation network
edge = high correlation (>0.75 or <-0.75)
dropped in 🇩🇪
average_path_length
diameter
max_degree
num_connected_components
dropped in 🇫🇷
num_of_segments
average_path_length
max_degree
3. Data exploration
Exploration attempt #1
- Split the corpora into roughly 50-year spans (3 for 🇩🇪, 5 for 🇫🇷) to follow closely the genre's evolution
- Apply dimensionality reduction methods (PCA, UMAP, t-SNE) to our "bag-of-features" plays
Distribution across timeframes
None of the dimensionality reduction methods worked well... with the possible exception of the 🇫🇷 1670-1719 segment
Rethinking libretti
- We could not see any progressive "genrification" of libretti... at least with the collection of features we used
- Let's take into account their generic alignment!
Exploration attempt 2
- Split the corpora into 50-year spans
- 🆕 Differentiate between comic and non-comic libretti using their subtitles (e.g. komisches Oper)
- Apply dimensionality reduction methods (PCA, UMAP, t-SNE) to timeframes
This is how the 🇫🇷 1670-1719 timeframe looks with the two libretti subclasses:
This is how the 🇫🇷 1670-1719 timeframe looks with the two libretti subclasses:
comic space
tragic zone
non-comic libretti
autonomous region
PCA 🇩🇪 timespans
1770-1819
1820-1869
1870-1921
PCA 🇫🇷 timespans
1620–1669
1670–1719
1720–1769
1770-1819
1820-1889
Exploring features' significance #1
- binary: libretti vs non-libretti (tragicomedies removed)
- Statistical significance tests (Shapiro-Wilk, Wilcoxon)
- random forest classifier
Statistical testing
- Feature-wise comparison of libretti against non-libretti with Wilcoxon Rank Sum test
- Resulting p-values for each feature distribution:
Boxplot distributions for the 🇫🇷 word_count_stage word_count_sp
Classifier training
-
Random Forest Classifier
-
5-fold cross validation on all data
-
Iterative selection of the best n estimators
parameter (10-1000) -
Looking at feature importances
Classifier feature importances
🇩🇪
🇫🇷
Most relevant features, 🇫🇷
word_count_stage
word_count_sp
num_connected_components
density
num_of_speakers
diameter
Most relevant features, 🇩🇪
word_count_sp
num_of_person_groups
average_degree
Exploring features' significance #2
-
four-class implementation
-
plotting each play individually
-
LOWESS-based smoothing curves to make trends visible
4. Results and discussion
1. Empirical verification of literary criticism
Libretti have less spoken text and more stage directions
trend more prominent in French, but visible also in German
2. An interesting pattern: independence of non-comic libretti (as far as some structural features are concerned)
🇩🇪 num_groups / word_count_sp
🇫🇷: density
/ num_speakers
🇩🇪 4-class classifier, confusion matrix
it is easier to confuse comedies and comic libretti
🇫🇷 4-class classifier, confusion matrix
it is easier to confuse comedies and comic libretti
3. The French dramatic space is more formalised than the German one
- Looking at the PCA clusterings, it seems easier to discriminate between different genres in 🇫🇷
- Historical reasons (due to text availability):
the two types of French libretti are more distinct than the German ones
(some) limitations
- Corpora extension and markup quality
- Comparative approach: lack of 🇮🇹
- Difficulties in modelling relations between texts on the basis of formal features → need for better operationalisation and implementation?
-
Individual structural features might be useful for distinguishing libretti from non-libretti (e.g. text length), or comedies from tragedies (density)
-
However, it is not easy to distinguish between plays formalised as vectors of multiple features
-
Drama seems too homogenous, in terms of structural properties, for discriminative clustering
-
Topic modelling seems actually to work better in distinguishing genres — as per Shaw's famous quote
Closing reflections 🤔
Thanks for listening!
computational approaches to libretti
By danilsko
computational approaches to libretti
- 325