Computational approaches to opera libretti
An experiment on DraCor Corpora
Luca Giovannini
Daniil Skorinkin
![](https://s3.amazonaws.com/media-p.slid.es/uploads/2140904/images/10183902/pasted-from-clipboard.png)
Workshop version - 09.02.2023
![](https://s3.amazonaws.com/media-p.slid.es/uploads/2140904/images/10184786/pasted-from-clipboard.png)
Opera and libretti
-
A new, “artificial” genre born in the early 17th c. in Italy, and rapidly exported across Europe
-
Traditionally: focus on music more than words
-
Librettology: still an analogic discipline
-
Few computational investigations
Research questions
Is it possible to consider libretti a unitary genre with its own structural features?
Do libretti possess a peculiar "genre signal" which sets them apart from contemporary comedies and tragedies?
How did they structurally evolve in comparison to the other genres?
-
Data preparation
-
Features selection
-
Data exploration
-
Results and discussion
Paper walkthrough 🚀
1. Data preparation
Corpus preparation
- Get metadata from GerDraCor 🇩🇪 & FreDraCor 🇫🇷 via the Dracor API (programmable corpora go brrr!)
- Investigate the '
libretto
' column:- 55 True (explicitly marked libretti) for 🇩🇪
- 58 True (explicitly marked libretti) for 🇫🇷
- Compare '
libretto
' and 'normalized_genre
' columns:- For 🇩🇪 '
libretto
' and 'normalized_genre
' were mutually exclusive (= 0 multi-label plays) - For 🇫🇷 16 multi-label plays
- For 🇩🇪 '
Corpus preparation
- Create the new '
libretto or genre
' column - For the 16 multi-label 🇫🇷 plays prefer '
libretto
' over 'normalized genre
' - Initial '
libretto or genre
' stats:
![](https://s3.amazonaws.com/media-p.slid.es/uploads/641147/images/10191645/RatiosInitial.jpg)
Corpus enrichment
- Retrieve all items with the
'subtitle'
containing one of these labels for operatic subgenres:
ballet de cour, ballet-héroïque, burlesque, comédie-ballet, divertissement, drame lyrique, entrée, grand opéra, intermède, Lehrstück, Liederspiel, Märchenoper, masque, Monodrama, opéra-ballet, opéra bouffon, opéra comique, opéra-féerie, pantomime, pastorale-héroïque, Posse, Schuldrama, Schuloper, Singspiel, Spieloper, tragédie en musique, vaudeville, Zauberoper, Zeitoper - Check them manually and append to 'libretti'
Problem #1: blurred boundaries for the concept of opera
Corpus enrichment
- Retrieve Wikidata Genres through the plays' Wikidata IDs
- Map genres manually to one of our 5 categories (Comedy, Tragedy, Tragicomedy, Libretto, None)
- Add Wikidata genre to those which had neither
normalized_genre
norlibretto
filled
Problem #2: missing genres
Enrichment results
![](https://s3.amazonaws.com/media-p.slid.es/uploads/641147/images/10192302/RatiosEnriched.png)
2. Feature selection
Feature selection
- Get numeric features from the metadata table
- Drop features deemed irrelevant to play structure (e.g.
num_p
,num_l
,num_female_speakers
) - Look for highly correlated features (cf. Szemes & Vida 2023)
- Remove one in each highly correlated pair
Correlation matrix
![](https://s3.amazonaws.com/media-p.slid.es/uploads/641147/images/10192575/cormat.png)
Correlation network
![](https://s3.amazonaws.com/media-p.slid.es/uploads/641147/images/10192578/german_correlated_2.png)
edge = high correlation (>0.75 or <-0.75)
dropped in 🇩🇪
average_path_length
diameter
max_degree
num_connected_components
dropped in 🇫🇷
num_of_segments
average_path_length
max_degree
3. Data exploration
Exploration attempt #1
- Split the corpora into roughly 50-year spans (3 for 🇩🇪, 5 for 🇫🇷) to follow closely the genre's evolution
- Apply dimensionality reduction methods (PCA, UMAP, t-SNE) to our "bag-of-features" plays
Distribution across timeframes
![](https://s3.amazonaws.com/media-p.slid.es/uploads/641147/images/10198573/Screenshot_2023-02-06_at_16.03.35.png)
None of the dimensionality reduction methods worked well... with the possible exception of the 🇫🇷 1670-1719 segment
![](https://s3.amazonaws.com/media-p.slid.es/uploads/641147/images/10198126/french1670_librettounified.png)
Rethinking libretti
- We could not see any progressive "genrification" of libretti... at least with the collection of features we used
- Let's take into account their generic alignment!
Exploration attempt 2
- Split the corpora into 50-year spans
- 🆕 Differentiate between comic and non-comic libretti using their subtitles (e.g. komisches Oper)
- Apply dimensionality reduction methods (PCA, UMAP, t-SNE) to timeframes
This is how the 🇫🇷 1670-1719 timeframe looks with the two libretti subclasses:
![](https://s3.amazonaws.com/media-p.slid.es/uploads/641147/images/10198261/pcafrench_data_1670_1719.png)
This is how the 🇫🇷 1670-1719 timeframe looks with the two libretti subclasses:
![](https://s3.amazonaws.com/media-p.slid.es/uploads/641147/images/10198261/pcafrench_data_1670_1719.png)
comic space
tragic zone
non-comic libretti
autonomous region
PCA 🇩🇪 timespans
1770-1819
1820-1869
1870-1921
![](https://s3.amazonaws.com/media-p.slid.es/uploads/641147/images/10198267/pca-german.png)
PCA 🇫🇷 timespans
![](https://s3.amazonaws.com/media-p.slid.es/uploads/641147/images/10198268/pca-french.png)
1620–1669
1670–1719
1720–1769
1770-1819
1820-1889
Exploring features' significance #1
- binary: libretti vs non-libretti (tragicomedies removed)
- Statistical significance tests (Shapiro-Wilk, Wilcoxon)
- random forest classifier
Statistical testing
- Feature-wise comparison of libretti against non-libretti with Wilcoxon Rank Sum test
- Resulting p-values for each feature distribution:
![](https://s3.amazonaws.com/media-p.slid.es/uploads/641147/images/10198587/Screenshot_2023-02-06_at_16.04.07.png)
Boxplot distributions for the 🇫🇷 word_count_stage word_count_sp
![](https://s3.amazonaws.com/media-p.slid.es/uploads/641147/images/10198345/french_word_count_stage_no_outliers.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/641147/images/10213356/french_word_count_sp.png)
Classifier training
-
Random Forest Classifier
-
5-fold cross validation on all data
-
Iterative selection of the best n estimators
parameter (10-1000) -
Looking at feature importances
Classifier feature importances
![](https://s3.amazonaws.com/media-p.slid.es/uploads/641147/images/10198613/rfo_big.png)
🇩🇪
🇫🇷
Most relevant features, 🇫🇷
word_count_stage
word_count_sp
num_connected_components
density
num_of_speakers
diameter
Most relevant features, 🇩🇪
word_count_sp
num_of_person_groups
average_degree
Exploring features' significance #2
-
four-class implementation
-
plotting each play individually
-
LOWESS-based smoothing curves to make trends visible
4. Results and discussion
![](https://s3.amazonaws.com/media-p.slid.es/uploads/2140904/images/10208762/sc-fre-wordcounts.png)
1. Empirical verification of literary criticism
Libretti have less spoken text and more stage directions
trend more prominent in French, but visible also in German
2. An interesting pattern: independence of non-comic libretti (as far as some structural features are concerned)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/2140904/images/10201021/sc-fre-density-speakers.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/2140904/images/10201023/sc-ger-wcsp-groups.png)
🇩🇪 num_groups / word_count_sp
🇫🇷: density
/ num_speakers
🇩🇪 4-class classifier, confusion matrix
![](https://s3.amazonaws.com/media-p.slid.es/uploads/641147/images/10198634/confusion_matrix_ger.png)
it is easier to confuse comedies and comic libretti
🇫🇷 4-class classifier, confusion matrix
![](https://s3.amazonaws.com/media-p.slid.es/uploads/641147/images/10198636/confusion_matrix_fre.png)
it is easier to confuse comedies and comic libretti
3. The French dramatic space is more formalised than the German one
- Looking at the PCA clusterings, it seems easier to discriminate between different genres in 🇫🇷
- Historical reasons (due to text availability):
![](https://s3.amazonaws.com/media-p.slid.es/uploads/641147/images/10198637/PCA_both_corpora_libretti_only_with_centroids_properly_colored_.png)
the two types of French libretti are more distinct than the German ones
(some) limitations
- Corpora extension and markup quality
- Comparative approach: lack of 🇮🇹
- Difficulties in modelling relations between texts on the basis of formal features → need for better operationalisation and implementation?
-
Individual structural features might be useful for distinguishing libretti from non-libretti (e.g. text length), or comedies from tragedies (density)
-
However, it is not easy to distinguish between plays formalised as vectors of multiple features
-
Drama seems too homogenous, in terms of structural properties, for discriminative clustering
-
Topic modelling seems actually to work better in distinguishing genres — as per Shaw's famous quote
Closing reflections 🤔
Thanks for listening!
computational approaches to libretti
By danilsko
computational approaches to libretti
- 272