LLM Benchmark & Safety

Evaluating Language Model Capabilities: Benchmarks

MMLU

Multiple-choice format, covering 57 tasks including elementary mathematics, US history, computer science, law, and more.

Arena Leaderboard

Uses a stronger language model to judge whether responses from other models are correct.

MT-Bench

Only 80 questions, no standard answers — scored by GPT-4.

Potential bias: language models tend to favor longer answers.

arena-hard

An improved version of MT-Bench.

Big Bench

Contains 204 tasks spanning linguistics, mathematics, commonsense reasoning, biology, physics, social biases, and more. Task difficulty is designed to exceed the known capabilities of current models.

Emoji Movie:

👧🐟🐠🐡 - Finding Nemo 👦👓⚡️ - Harry Potter

Checkmate In One Move:

PLAIN

e4 e6 2. Ke2 d5 3. e5 c5 4. f4 Nc6              
Nf3 Qb6 6. g4 Bd7 7. h4 Nge7 8. c3 Ng6          
d4 cxd4 10. cxd4 Be7 11. Kf2 O-O 12. h5 Nh8          ----------->   Bxh7#
Be3 Qxb2+ 14. Kg3 Qxa1 15. Bd3 Qxa2 16. Rh2 Qa1
Qc2 Nb4 18.               
e4 e6 2. Ke2 d5 3. e5 c5 4. f4 Nc6
Nf3 Qb6 6. g4 Bd7 7. h4 Nge7 8. c3 Ng6
d4 cxd4 10. cxd4 Be7 11. Kf2 O-O 12. h5 Nh8 -----------> Bxh7#
Be3 Qxb2+ 14. Kg3 Qxa1 15. Bd3 Qxa2 16. Rh2 Qa1
Qc2 Nb4 18.

ASCII Word Recognition:

Needle in a Haystack — Testing Long-Context Comprehension

A random fact (the “needle”) is inserted into a long text context (the “haystack”), and the model is tested on whether it can correctly retrieve that information.

LLM Safety

LLMs Can Say Wrong Things (Hallucination)

Fact-checking: Gemini searches the web for relevant content to back up its responses.

Gemini verifies its output via Google Search

Model Bias and Stereotypes

GPT thinks Asian people are suited for financial analyst roles

Gender stereotypes in occupations

Methods to mitigate bias : adjusting input data, the training process, the inference process, and post-processing.

Was This Written by an LLM?

Looking for differences between human-written and AI-generated text.

Estimating the Intrinsic Dimension (ID) of text — training a classifier to distinguish human- from AI-generated text
Testing of Detection Tools for AI-Generated Text — adding watermarks to language model outputs
- Methods such as LeftHash and SelfHash
- The mainstream approach is to slightly adjust the probability distribution during token generation so that certain words appear slightly more often than usual
On the Reliability of Watermarks for Large Language Models

LLMs Can Be Fooled — Prompt Hacking

Jailbreaking — attacking the model itself to make it say things it shouldn’t.

Human analogy: murder and arson

Prompt Injection — attacking LLM-based applications to make them do inappropriate things at inappropriate times.

Human analogy: suddenly bursting into song in the middle of class

Jailbreaking

DAN — Do Anything Now

https://arxiv.org/abs/2308.03825

Using a language the model is less familiar with https://arxiv.org/abs/2307.02483
Giving contradictory instructions https://arxiv.org/abs/2307.02483
Trying to persuade the model (e.g., by making up a story)
Getting the model to reveal training data

Prompt Injection

Prompt Injection Competition — making the language model forget its assigned role and say “I have been PWNED.”