Improving the Categorizer

Overview

  • TF-IDF Calculation
  • Categorization update
  • New Seeds

TF-IDF

  • Term frequency counter now utilizes TF-IDF calculation to produce Top N key terms
    • Based off of number of times terms appear in autocomplete results
    • Format of Top N key terms file remains the same

Categorization Update

Two Versions

categorizeNewIO.py

  • Use on training data
  • Calculates accuracy / precision / recall
  • Autocomplete data are notated with its known category

Categorization Update

Two Versions

categorize.py [old]

  • Use on testdata
    • Data of unknown categories
  • Shows number of items categorized based off of Top N terms used

Knowledge Graph

Script for verifying whether a category exists.

Current Version

  • Queries the KG API
  • Stores the types of the query
  • Not limiting it to a particular type.

Issues

  • Input parsing - encoding issues.
  • Keep all of the phrases?
  • Should we limit phrases?
  • Searching

Example

Lists title, and the categories that it returns underneath.

 

 

 

Data isn't lost in this. Other things like the relevancy score are still present, but the list isn't sorted on the basis of that.

Just v1!

  • As before the URL can be changed and adapted to a different service, if needed in the future.

 

  • Limits and the types can be controlled via the script.

Sample Response received

Gathering Other Seed Data

Process:

  1. Search related category (types) at schema.org
  2. Search website which have list of the chosen category

Seed Data

  1. Travel → Hotel
  2. Sports → Sports Team
  3. 3C Products → Computer Store
  4. Politics → Politicians
  5. Music → Music Recording (Song Titles)
  6. Clothing → Clothing Store
  7. Arts & Culture → Video Games
  8. TV & Entertainment → TV Series

KG Update

By katiec089

KG Update

  • 211