Math Lectures

Combine NLP with supervised and unsupervised learning to classify math lectures

Objective:

Use closed captioning from 92 lectures

Save XML file in a directory

Objective: Use supervised and unsupervised learning techniques to best classify math lectures.

(Clean the data)

- Train Doc2Vec Model

- Dimensionality Reduction for Visualization

Clustering and Similarity

- Clustering the data

- KMeans Silhouette Scores

- 9, 10, and 11 clusters

- Topic Extraction (NMF and LDA)

Modeling

- Tf-idf vectorization

- Initial model

- Parameter Search

- Parts of Speech

- Final Model

Part 2

Part 1

Clean and tokenize the text

lecture id

str

label

label

Spacy Doc

 - extract lemmas, remove punctuation and stop words

Use Doc2Vec to vectorize each lecture

- Converts each lecture into a 65 dimensional vector

[ x 1, x 2 ... x 64, x 65 ]

Lecture

i

i

i

i

i

Epoch 1

Epoch 2

Epoch 3

Epoch 100

...

Extract the numerical representation for each lecture

*tagged_data[0] = 1 full lecture

Calculate cosine similarity of lectures

Metrics for clustering and determining similarity

Reducing the Dimensionality

for Visualization

Using PCA

Using t-SNE

The scatter plot of the t-SNE components is much easier to interpret. For this reason t-SNE was chosen as the prefered method for reducing the dimensionality

Y = pca.fit_transform(vecs)

Yt = tsne.fit_transform(vecs)

Clustering the Data

Agglomerative Clustering - 10 Clusters

Spectral Clustering (damping .8)

Mean Shift (all bandwidth)

KMeans - 10 Clusters

Clustering the Data with KMeans

KMeans 9 Clusters

KMeans 10 clusters

KMeans 11 Clusters

Actual labels

9 clusters by Subject

59/69 points correctly clustered

8/10 subjects correctly clustered

10 Clusters by Subject

22/69 points correctly clustered

3/10 Subjects correctly clustered

11 Clusters by Subject

27/69 points correctly clustered

4/10 Subjects correctly clustered

Results of Clustering

9 Clusters

10 Clusters

11 Clusters

By Subject

By Professor

85.5%

31.88%

39.13%

46.37%

10.14%

33%

Score is based on cluster completeness. (Only lectures perfectly in their true label group are scored)

But wait...

Calculus?

Why here?

?

?

This calculus lecture ended up far from its cluster  during the t-SNE decomposition

Why can't the KMeans discern between data structures and algorithms?

Why does this AI lecture get clustered to Differential Equations?

Calculus is very close to Differential Equations.

Math for Computer Science, Artificial Intelligence, Algorithms, Data Structures in this context are more closely related to one another than the others. This relationship is captured in coordinates of the lectures

Topic Extraction using Non negative matrix factorization

Data Structures

Algorithms

Do these look that different?

Topic Extraction using LDA

Winston AI 10

Differential Equations

(Latent Dirichlet Allocation)

AI

Modeling with TF-IDF vectorization

Initial Results with TF-IDF vectores

Logistic Regression

Multinomial NB

Random Forest

K Neighbors

Logistic Regression

61%

Random Forest

80%

Multinomial NB

57%

KNN

63%

Other feature generation

- Extract parts of speech (POS)

- Count occurrence of POS by lecture

- Divide each POS by lecture length

List of Spacy Docs

True

False

norm=

using POS only

Logistic Regression

Random Forest

90%

74%

using POS / len(total_lecture_pos)

24%

92%

Parameter Search

min_df = 19, score = 95.65%

tf-idf vectors

+

POS vectors

=

Random Forest (n_estimators=200, max_depth=4, min_samples_leaf=4, random_state=43, class_weight='balanced')

tf-idf df_min  = 25

Cross Validation 5 folds

mean

New X

Modeling:

Clustering:

+ Able to identify similar subjects.

- Unable to decipher closely related subjects

- Unable to decipher professors

+ Very accurate

+ able to decipher professors

For further study

- Generate an overall rating of each lecture

- Youtube ratings, comments, view counts

- Predict the quality of newly posted content

 - match most relevant new content for a given user

        - Scale the data collection

- Programmatically obtain lecture subtitles from youtube's API

- Create new features based on sentiment analysis of comments

Using NLP to classfy math lectures

By will-m

Using NLP to classfy math lectures

Using NLP to classify math lectures

  • 259