Vectorizing + Category Tree

Recap from last week

  • Exclusion List
  • Current Results
  • categories.txt usage
  • Vectorization

Without any Exclusion List

With Exclusion List (Categories - DMOZ)

With Exclusion List (~1000 Homonyms)

With Exclusion List (~100 000 Most Common Words)

With Exclusion List (~10 000 Most Common Words)

Current Status:

(much larger data set: 5 categories, 1000 seeds each)

Current Status:

(much larger data set: 5 categories, 1000 seeds each)

Vectorizing

for the next step: Neural Networks

  • First, vector of all TopN (N=10) terms from all 5 categories.
    • Total 49 items in vector; one repeat.
  • On our movie dataset: here is an example of a vector on one seed. Attached at end is M, notating that this seed is a movie.

Vectorizing

Each vector corresponds to a seed in the seed file

...

...

Category Tree

Tree structure from DMOZ's categories

Process of Building the Tree

  1. Preprocess text file to remove non-ASCII lines
  2. Build the tree from text file which has been preprocessed
  3. Also created functions to check if a node exist and get all of the childrens from a node

Duplicate Category

Solution:

create multiple trees, not a single tree

Characteristics

  • Relatively easy to create super-tree
  • The Top Level Roots are disintegrated for easier searching
  • Graph vs Tree - Two separate ones (or all-in-one)? 

Examples

Get all of Adult's children

Vectorizing + Category Tree

By katiec089

Vectorizing + Category Tree

  • 289