1. Dataset acquisition: GENIA corpus
2. Baseline using TF-IDF: Assumptions
3. Number of terms to be extracted from scored and ranked candidates: function of corpus size and number of annotated domain rel. terms?
4. Converted from smaller version of dataset to bigger, original one: observations
5. Tentative PoA
1. GENIA corpus: 2000 Medline abstracts given in IOB2 (sequence tagged) format.
2. Categories: Protein, DNA, Cell Type, Cell Line
3. Need for general corpus?
1. Baseline using TF-IDF: Observations
By Anjali Bhavan