1. Dataset acquisition: GENIA corpus

2. Baseline using TF-IDF: Assumptions

3. Number of terms to be extracted from scored and ranked candidates: function of corpus size and number of annotated domain rel. terms?

4. Converted from smaller version of dataset to bigger, original one: observations

5. Tentative PoA


1. GENIA corpus: 2000 Medline abstracts given in IOB2 (sequence tagged) format.

2. Categories: Protein, DNA, Cell Type, Cell Line

3. Need for general corpus?





1. Baseline using TF-IDF: Observations









