Vectorizing + Category Tree
Recap from last week
- Exclusion List
- Current Results
- categories.txt usage
- Vectorization
Without any Exclusion List

With Exclusion List (Categories - DMOZ)

With Exclusion List (~1000 Homonyms)

With Exclusion List (~100 000 Most Common Words)

With Exclusion List (~10 000 Most Common Words)

Current Status:
(much larger data set: 5 categories, 1000 seeds each)

Current Status:
(much larger data set: 5 categories, 1000 seeds each)

Vectorizing
for the next step: Neural Networks

-
First, vector of all TopN (N=10) terms from all 5 categories.
- Total 49 items in vector; one repeat.
- On our movie dataset: here is an example of a vector on one seed. Attached at end is M, notating that this seed is a movie.
Vectorizing

Each vector corresponds to a seed in the seed file
...
...
Category Tree
Tree structure from DMOZ's categories
Process of Building the Tree
- Preprocess text file to remove non-ASCII lines
- Build the tree from text file which has been preprocessed
- Also created functions to check if a node exist and get all of the childrens from a node
Duplicate Category

Solution:
create multiple trees, not a single tree
Characteristics
- Relatively easy to create super-tree
- The Top Level Roots are disintegrated for easier searching
- Graph vs Tree - Two separate ones (or all-in-one)?
Examples
Get all of Adult's children

Vectorizing + Category Tree
By katiec089
Vectorizing + Category Tree
- 289