Improving the Categorizer
Overview
- TF-IDF Calculation
- Categorization update
- New Seeds
TF-IDF
- Term frequency counter now utilizes TF-IDF calculation to produce Top N key terms
- Based off of number of times terms appear in autocomplete results
- Format of Top N key terms file remains the same

Categorization Update
Two Versions
categorizeNewIO.py
- Use on training data
- Calculates accuracy / precision / recall
- Autocomplete data are notated with its known category


Categorization Update
Two Versions
categorize.py [old]
- Use on testdata
- Data of unknown categories
- Shows number of items categorized based off of Top N terms used

Knowledge Graph
Script for verifying whether a category exists.
Current Version
- Queries the KG API
- Stores the types of the query
- Not limiting it to a particular type.
Issues
- Input parsing - encoding issues.
- Keep all of the phrases?
- Should we limit phrases?
- Searching
Example
Lists title, and the categories that it returns underneath.
Data isn't lost in this. Other things like the relevancy score are still present, but the list isn't sorted on the basis of that.

Just v1!
- As before the URL can be changed and adapted to a different service, if needed in the future.
- Limits and the types can be controlled via the script.


Sample Response received

Gathering Other Seed Data
Process:
- Search related category (types) at schema.org
- Search website which have list of the chosen category
Seed Data
- Travel → Hotel
- Sports → Sports Team
- 3C Products → Computer Store
- Politics → Politicians
- Music → Music Recording (Song Titles)
- Clothing → Clothing Store
- Arts & Culture → Video Games
- TV & Entertainment → TV Series
KG Update
By katiec089
KG Update
- 211