Music Mood Classification Using the Million Song Dataset

Bhavika Tekwani

Problem

Given audio features for a song, can we predict what mood the song represents?

Do audio features help with mood identification?

Motivation

Indexing
Metadata generation
Predicting success ("Hit Song Science")
Recommender Systems

Data

Million Song Dataset	Spotify API
Artist, Song title	Speechiness
Duration	Energy
Loudness	Acousticness
Key, Mode, Time Signature	Instrumentalness
Tempo	Danceability
Segments Pitches (Chroma features, 2D)
Segments Timbre (MFCC + PCA, 2D) Beats

Hand labelled 7396 songs as 'happy' and 'sad'. Train test split is 60/40.

Imputing missing values

All songs in the Million Song Subset (10,000 songs) had 0 for Energy and Danceability i.e., they had not been analysed.

Used Spotify's Web API to fetch Danceability, Energy, Acousticness, Instrumentalness and Speechiness metrics.

If a song from the dataset was not on Spotify, I imputed the mean of the feature as the missing value.

Understanding the data

Low Level Segment Features

Timbre

Pitch

Descriptive Features

Speechiness, Danceability, Tempo, Loudness, Energy, Acousticness, Instrumentalness

Notational Features

Key, Mode,

Time Signature

Feature Engineering

Square loudness (dB) for interpretability

Scale energy, tempo, loudness to Gaussian distribution (mean = 0, variance = 1)

Segment aggregation: Convert segment level 2D information to track level 1D feature

Key * Mode, Tempo * Mode to capture multiplicative interaction

Segment Aggregation

A segment is 0.3 seconds long. Each segment has a pitch and timbre.

Pitch: 2D array of Chroma features. The shape varies from (100, 12) to (1600, 12).

Timbre: 2D array of MFCC features. Shape varies from (100, 12) to (1600, 12).

Mel Frequency Cepstral Coefficients (MFCC) captures the logarithmic perception of loudness and pitch as heard by a human.

Aggregation: Calculate the min, max, kurtosis, mean, standard deviation, variance of each segment and average over them

Feature Selection

Recursive Feature Elimination with Random Forest Classifier and 5 fold cross validation
Used 25 of a possible 52 features
Compared feature importance by model
Most important descriptive feature was Danceability, followed by Energy, Speechiness and Beats

Modelwise Feature Importance

Model Comparison

Model	Features	CV score	Test accuracy
Random Forest	Segment + Desc	73.33	75.44
	Segment	71.73	73.13
XGBoost	Segment + Desc	73.33	75.24
	Segment	71.73	73.10
Gradient Boosting	Segment + Desc	72.65	74.39
	Segment	71.12	72.87
Extra Trees	Segment + Desc	68.44	73.86
	Segment	68.14	71.81
SVM	Segment + Desc	73.33	73.26
	Segment	71.81	69.97

Further Work

KNearest Neighbour with Mahalanobis distance

Exploring whether lyrics can be added as features

Tools

Visualization: Seaborn, Matplotlib

Models: Scikit-Learn, XGBoost, Pandas, Numpy

Spotify API wrapper: Spotipy

Data Wrangling: SQL