Improving the Categorizer

Overview

More robust data from crawler
DMOZ common word exclusion
Next steps

More Data Categories

Song titles
Restaurants
Movies
Cities
TV Shows

DMOZ

categories.txt from DMOZ (802287)
"common" words are removed from TopN list

Point: Categories spelled in non-English characters

Not included in list of common words

Final Category List

[removed from TopN list]

After removing non-English character categories: 568219

After removing repeated categories: 156160

...

...

Results

Restaurant TopN file:

[Given N = 10]

RunThrough

Get data from multiple sources for different categories
Autocomplete the data for all categories
For each category, generate the topN terms.
Compare the TopN terms with the autocomplete phrases on the testing data.
Get results (APR)

Major division: Calculation of TopN
Method 1: Use TF-IDF to calculate the rate of a word being useful.
Method 2: Use TF-IDF+DMOZ to remove the commonalities and recalculate the topN.

Previously

After DMOZ

TV Anomalies - DMOZ

Next

Improving the categorizer even more with Machine Learning / Neural Networks
- Vectorizing TopN key terms

Improving the Categorizer

By katiec089

Improving the Categorizer

378

katiec089