Using NLP to classfy math lectures

Use closed captioning from 92 lectures

Save XML file in a directory

Objective: Use supervised and unsupervised learning techniques to best classify math lectures.

(Clean the data)

- Train Doc2Vec Model

- Dimensionality Reduction for Visualization

Clustering and Similarity

- Clustering the data

- KMeans Silhouette Scores

- 9, 10, and 11 clusters

- Topic Extraction (NMF and LDA)

Modeling

- Tf-idf vectorization

- Initial model

- Parameter Search

- Parts of Speech

- Final Model

Part 2

Part 1

Clean and tokenize the text

lecture id

str

label

Spacy Doc

- extract lemmas, remove punctuation and stop words

Use Doc2Vec to vectorize each lecture

- Converts each lecture into a 65 dimensional vector

[ x 1, x 2 ... x 64, x 65 ]

Lecture

i

Epoch 1

Epoch 2

Epoch 3

Epoch 100

...

Extract the numerical representation for each lecture

*tagged_data[0] = 1 full lecture

Calculate cosine similarity of lectures

Metrics for clustering and determining similarity

Reducing the Dimensionality

for Visualization

Using PCA

Using t-SNE

Click to Learn About t-SNE

The scatter plot of the t-SNE components is much easier to interpret. For this reason t-SNE was chosen as the prefered method for reducing the dimensionality

Y = pca.fit_transform(vecs)

Yt = tsne.fit_transform(vecs)

Clustering the Data

Agglomerative Clustering - 10 Clusters

Spectral Clustering (damping .8)

Mean Shift (all bandwidth)

KMeans - 10 Clusters

Clustering the Data with KMeans

KMeans 9 Clusters

KMeans 10 clusters

KMeans 11 Clusters

Actual labels

9 clusters by Subject

59/69 points correctly clustered

8/10 subjects correctly clustered

10 Clusters by Subject

22/69 points correctly clustered

3/10 Subjects correctly clustered

11 Clusters by Subject

27/69 points correctly clustered

4/10 Subjects correctly clustered

Results of Clustering

9 Clusters

10 Clusters

11 Clusters

By Subject

By Professor

85.5%

31.88%

39.13%

46.37%

10.14%

33%

Score is based on cluster completeness. (Only lectures perfectly in their true label group are scored)

But wait...

Calculus?

Why here?

?

This calculus lecture ended up far from its cluster during the t-SNE decomposition

Why can't the KMeans discern between data structures and algorithms?

Why does this AI lecture get clustered to Differential Equations?

Calculus is very close to Differential Equations.

Math for Computer Science, Artificial Intelligence, Algorithms, Data Structures in this context are more closely related to one another than the others. This relationship is captured in coordinates of the lectures