Calibrate Before Use: Improving Few-Shot Performance of Language Models#

Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, Sameer Singh
Paper
Github

Tags: [prompt_engineering, llm]

Abstract#

Shows that few-shot prompting for large language models like GPT-3 can be unstable
Demonstrate that this instability arises from the bias of language models towards predicting certain answers
Mitigate model biases by first estimating the model’s bias towards each answer by asking for its prediction when given the training prompt and a content-free test input (e.g. “N/A”)
Then fit calibration parameters that cause the prediction for this input to be uniform across answers

Ability to learn tasks with limited examples
Few-shot “in-context” learning shows that model can learn without parameter updates
Example:

Input: Subpar acting.  Sentiment: Negative
Input: Beautiful film. Sentiment: Positive
Input: Amazing.        Sentiment: