Improving the Categorizer
Overview
-
More robust data from crawler
-
DMOZ common word exclusion
-
Next steps
More Data Categories
-
Song titles
-
Restaurants
-
Movies
-
Cities
-
TV Shows
DMOZ
- categories.txt from DMOZ (802287)
- "common" words are removed from TopN list



Point: Categories spelled in non-English characters
Not included in list of common words
Final Category List
[removed from TopN list]

After removing non-English character categories: 568219
After removing repeated categories: 156160

...

...
Results
Restaurant TopN file:
[Given N = 10]

RunThrough
- Get data from multiple sources for different categories
- Autocomplete the data for all categories
- For each category, generate the topN terms.
- Compare the TopN terms with the autocomplete phrases on the testing data.
- Get results (APR)
- Major division: Calculation of TopN
- Method 1: Use TF-IDF to calculate the rate of a word being useful.
- Method 2: Use TF-IDF+DMOZ to remove the commonalities and recalculate the topN.
Previously

After DMOZ

TV Anomalies - DMOZ

Next
- Improving the categorizer even more with Machine Learning / Neural Networks
- Vectorizing TopN key terms
Improving the Categorizer
By katiec089
Improving the Categorizer
- 367