The method should be able
to leverage large amounts of available data;
it should utilize a task, which can be optimized independently, leading to further downstream improvements;
it should rely on a single model that can be used as-is for most NLP tasks;
discriminative fine-tuning, that fine-tunes lower layers to a lesser extent than higher layers in order to retain
the knowledge acquired through language modeling
Backprop Through Time for Text Classification