Music Mood Classification Using the Million Song Dataset
Bhavika Tekwani
Problem
Given audio features for a song, can we predict what mood the song represents?
Do audio features help with mood identification?
Motivation
- Indexing
- Metadata generation
- Predicting success ("Hit Song Science")
- Recommender Systems
Data
Million Song Dataset | Spotify API |
---|---|
Artist, Song title | Speechiness |
Duration | Energy |
Loudness | Acousticness |
Key, Mode, Time Signature | Instrumentalness |
Tempo | Danceability |
Segments Pitches (Chroma features, 2D) | |
Segments Timbre (MFCC + PCA, 2D) Beats |
Hand labelled 7396 songs as 'happy' and 'sad'. Train test split is 60/40.
Imputing missing values
- All songs in the Million Song Subset (10,000 songs) had 0 for Energy and Danceability i.e., they had not been analysed.
- Used Spotify's Web API to fetch Danceability, Energy, Acousticness, Instrumentalness and Speechiness metrics.
- If a song from the dataset was not on Spotify, I imputed the mean of the feature as the missing value.
Understanding the data
Low Level Segment Features
Timbre
Pitch
Descriptive Features
Speechiness, Danceability, Tempo, Loudness, Energy, Acousticness, Instrumentalness
Notational Features
Key, Mode,
Time Signature
Feature Engineering
- Square loudness (dB) for interpretability
- Scale energy, tempo, loudness to Gaussian distribution (mean = 0, variance = 1)
- Segment aggregation: Convert segment level 2D information to track level 1D feature
- Key * Mode, Tempo * Mode to capture multiplicative interaction
Segment Aggregation
- A segment is 0.3 seconds long. Each segment has a pitch and timbre.
- Pitch: 2D array of Chroma features. The shape varies from (100, 12) to (1600, 12).
- Timbre: 2D array of MFCC features. Shape varies from (100, 12) to (1600, 12).
- Mel Frequency Cepstral Coefficients (MFCC) captures the logarithmic perception of loudness and pitch as heard by a human.
- Aggregation: Calculate the min, max, kurtosis, mean, standard deviation, variance of each segment and average over them
Feature Selection
-
Recursive Feature Elimination with Random Forest Classifier and 5 fold cross validation
-
Used 25 of a possible 52 features
-
Compared feature importance by model
-
Most important descriptive feature was Danceability, followed by Energy, Speechiness and Beats
Modelwise Feature Importance
Model Comparison
Model | Features | CV score | Test accuracy |
---|---|---|---|
Random Forest | Segment + Desc | 73.33 | 75.44 |
Segment | 71.73 | 73.13 | |
XGBoost | Segment + Desc | 73.33 | 75.24 |
Segment | 71.73 | 73.10 | |
Gradient Boosting | Segment + Desc | 72.65 | 74.39 |
Segment | 71.12 | 72.87 | |
Extra Trees | Segment + Desc | 68.44 | 73.86 |
Segment | 68.14 | 71.81 | |
SVM | Segment + Desc | 73.33 | 73.26 |
Segment | 71.81 | 69.97 |
Further Work
- KNearest Neighbour with Mahalanobis distance
- Exploring whether lyrics can be added as features
Tools
Visualization: Seaborn, Matplotlib
Models: Scikit-Learn, XGBoost, Pandas, Numpy
Spotify API wrapper: Spotipy
Data Wrangling: SQL
Thank You.
Music Mood Classification
By Bhavika Tekwani
Music Mood Classification
Mood Classification on the Million Song Dataset
- 1,780