Evaluating Language Model Capabilities: Benchmarks
MMLU
Multiple-choice format, covering 57 tasks including elementary mathematics, US history, computer science, law, and more.

Arena Leaderboard

Uses a stronger language model to judge whether responses from other models are correct.
MT-Bench
Only 80 questions, no standard answers — scored by GPT-4.
- Potential bias: language models tend to favor longer answers.
arena-hard
An improved version of MT-Bench.
Big Bench
Contains 204 tasks spanning linguistics, mathematics, commonsense reasoning, biology, physics, social biases, and more. Task difficulty is designed to exceed the known capabilities of current models.
Emoji Movie:
👧🐟🐠🐡 - Finding Nemo 👦👓⚡️ - Harry Potter
Checkmate In One Move:
1. e4 e6 2. Ke2 d5 3. e5 c5 4. f4 Nc6
2. Nf3 Qb6 6. g4 Bd7 7. h4 Nge7 8. c3 Ng6
3. d4 cxd4 10. cxd4 Be7 11. Kf2 O-O 12. h5 Nh8 -----------> Bxh7#
4. Be3 Qxb2+ 14. Kg3 Qxa1 15. Bd3 Qxa2 16. Rh2 Qa1
5. Qc2 Nb4 18. ASCII Word Recognition:


Needle in a Haystack — Testing Long-Context Comprehension
A random fact (the “needle”) is inserted into a long text context (the “haystack”), and the model is tested on whether it can correctly retrieve that information.

LLM Safety
LLMs Can Say Wrong Things (Hallucination)
Fact-checking: Gemini searches the web for relevant content to back up its responses.

Model Bias and Stereotypes
GPT thinks Asian people are suited for financial analyst roles

Gender stereotypes in occupations

Methods to mitigate bias : adjusting input data, the training process, the inference process, and post-processing.

Was This Written by an LLM?
Looking for differences between human-written and AI-generated text.
- Estimating the Intrinsic Dimension (ID) of text — training a classifier to distinguish human- from AI-generated text
- Testing of Detection Tools for AI-Generated Text
— adding watermarks to language model outputs
- Methods such as LeftHash and SelfHash
- The mainstream approach is to slightly adjust the probability distribution during token generation so that certain words appear slightly more often than usual
- On the Reliability of Watermarks for Large Language Models
LLMs Can Be Fooled — Prompt Hacking
Jailbreaking — attacking the model itself to make it say things it shouldn’t.
- Human analogy: murder and arson
Prompt Injection — attacking LLM-based applications to make them do inappropriate things at inappropriate times.
- Human analogy: suddenly bursting into song in the middle of class
Jailbreaking
DAN — Do Anything Now
https://arxiv.org/abs/2308.03825

Using a language the model is less familiar with https://arxiv.org/abs/2307.02483
Giving contradictory instructions https://arxiv.org/abs/2307.02483

Trying to persuade the model (e.g., by making up a story)
Getting the model to reveal training data

Prompt Injection

Prompt Injection Competition — making the language model forget its assigned role and say “I have been PWNED.”

