12 Models:
GPT-2 Large (774M parameters)
GPT-J (6B parameters)
Fairseq LMs (6.7B & 13B parameters)
GPT-3 (175B parameters)
MetaICL (Meta-trained GPT-2 Large)
Inference Methods:
Direct Method: Predict output directly from input and context.
Channel Method: Predict input given output and context.
Evaluation Data:
26 Datasets covering:
Sentiment analysis, paraphrase detection, Hate speech detection, etc.
All tasks are classification or multi-choice tasks
Experimental Details:
Demonstrations: K = 16 examples per prompt.
Repetitions: 5 random seeds, experiments run 5 times.
For Fairseq 13B and GPT-3: subset of 6 datasets and 3 seeds
Metrics:
Macro-F1 for classification tasks
Accuracy for multi-choice tasks
?
?
?