Math Lectures



Combine NLP with supervised and unsupervised learning to classify math lectures
Objective:



Use closed captioning from 92 lectures
Save XML file in a directory
Objective: Use supervised and unsupervised learning techniques to best classify math lectures.
(Clean the data)
- Train Doc2Vec Model
- Dimensionality Reduction for Visualization
Clustering and Similarity
- Clustering the data
- KMeans Silhouette Scores
- 9, 10, and 11 clusters
- Topic Extraction (NMF and LDA)
Modeling
- Tf-idf vectorization
- Initial model
- Parameter Search
- Parts of Speech
- Final Model
Part 2
Part 1
Clean and tokenize the text
lecture id
str
label
label
Spacy Doc


- extract lemmas, remove punctuation and stop words

Use Doc2Vec to vectorize each lecture

- Converts each lecture into a 65 dimensional vector

[ x 1, x 2 ... x 64, x 65 ]

Lecture
i
i
i
i
i




Epoch 1
Epoch 2
Epoch 3
Epoch 100
...
Extract the numerical representation for each lecture

*tagged_data[0] = 1 full lecture
Calculate cosine similarity of lectures



Metrics for clustering and determining similarity
Reducing the Dimensionality
for Visualization
Using PCA
Using t-SNE


The scatter plot of the t-SNE components is much easier to interpret. For this reason t-SNE was chosen as the prefered method for reducing the dimensionality
Y = pca.fit_transform(vecs)
Yt = tsne.fit_transform(vecs)
Clustering the Data


Agglomerative Clustering - 10 Clusters


Spectral Clustering (damping .8)
Mean Shift (all bandwidth)
KMeans - 10 Clusters
Clustering the Data with KMeans

KMeans 9 Clusters

KMeans 10 clusters


KMeans 11 Clusters
Actual labels


9 clusters by Subject

59/69 points correctly clustered
8/10 subjects correctly clustered


10 Clusters by Subject
22/69 points correctly clustered
3/10 Subjects correctly clustered

11 Clusters by Subject
27/69 points correctly clustered
4/10 Subjects correctly clustered
Results of Clustering
9 Clusters
10 Clusters
11 Clusters
By Subject
By Professor
85.5%
31.88%
39.13%
46.37%
10.14%
33%

Score is based on cluster completeness. (Only lectures perfectly in their true label group are scored)
But wait...


Calculus?
Why here?
?
?
This calculus lecture ended up far from its cluster during the t-SNE decomposition
Why can't the KMeans discern between data structures and algorithms?
Why does this AI lecture get clustered to Differential Equations?
Calculus is very close to Differential Equations.
Math for Computer Science, Artificial Intelligence, Algorithms, Data Structures in this context are more closely related to one another than the others. This relationship is captured in coordinates of the lectures


Topic Extraction using Non negative matrix factorization
Data Structures
Algorithms
Do these look that different?
Topic Extraction using LDA


Winston AI 10
Differential Equations
(Latent Dirichlet Allocation)

AI
Modeling with TF-IDF vectorization

Initial Results with TF-IDF vectores
Logistic Regression
Multinomial NB
Random Forest
K Neighbors








Logistic Regression
61%
Random Forest
80%
Multinomial NB
57%
KNN
63%
Other feature generation





- Extract parts of speech (POS)
- Count occurrence of POS by lecture


- Divide each POS by lecture length
List of Spacy Docs

True
False
norm=
using POS only
Logistic Regression
Random Forest


90%


74%
using POS / len(total_lecture_pos)






24%
92%
Parameter Search

min_df = 19, score = 95.65%
tf-idf vectors
+
POS vectors
=





Random Forest (n_estimators=200, max_depth=4, min_samples_leaf=4, random_state=43, class_weight='balanced')
tf-idf df_min = 25


Cross Validation 5 folds
mean
New X
Modeling:
Clustering:
+ Able to identify similar subjects.
- Unable to decipher closely related subjects
- Unable to decipher professors
+ Very accurate
+ able to decipher professors
For further study
- Generate an overall rating of each lecture
- Youtube ratings, comments, view counts
- Predict the quality of newly posted content
- match most relevant new content for a given user
- Scale the data collection
- Programmatically obtain lecture subtitles from youtube's API
- Create new features based on sentiment analysis of comments
Using NLP to classfy math lectures
By will-m
Using NLP to classfy math lectures
Using NLP to classify math lectures
- 315