-
The method should be able
to leverage large amounts of available data;
-
it should utilize a task, which can be optimized independently, leading to further downstream improvements;
-
it should rely on a single model that can be used as-is for most NLP tasks;
-
discriminative fine-tuning, that fine-tunes lower layers to a lesser extent than higher layers in order to retain
the knowledge acquired through language modeling
-
Backprop Through Time for Text Classification