Calibrate Before Use: Improving Few-Shot Performance of Language Models
Contents
Calibrate Before Use: Improving Few-Shot Performance of Language Models#
Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, Sameer Singh
Paper
Github
Tags: [prompt_engineering, llm]
Abstract#
Shows that few-shot prompting for large language models like GPT-3 can be unstable
Demonstrate that this instability arises from the bias of language models towards predicting certain answers
Mitigate model biases by first estimating the model’s bias towards each answer by asking for its prediction when given the training prompt and a content-free test input (e.g. “N/A”)
Then fit calibration parameters that cause the prediction for this input to be uniform across answers
Introduction#
Few-shot learning#
Ability to learn tasks with limited examples
Few-shot “in-context” learning shows that model can learn without parameter updates
Example:
Input: Subpar acting. Sentiment: Negative
Input: Beautiful film. Sentiment: Positive
Input: Amazing. Sentiment:
Instability#
Shows GPT-3 accuracy can be highly unstable across different prompts
Causes of instability:
Majority label bias
Recency bias
Common token bias