LLM Benchmark & Safety

Evaluating Language Model Capabilities: Benchmarks

MMLU

Multiple-choice format, covering 57 tasks including elementary mathematics, US history, computer science, law, and more.

Arena Leaderboard

Uses a stronger language model to judge whether responses from other models are correct.

MT-Bench

Only 80 questions, no standard answers — scored by GPT-4.

  • Potential bias: language models tend to favor longer answers.

arena-hard

An improved version of MT-Bench.

Big Bench

Contains 204 tasks spanning linguistics, mathematics, commonsense reasoning, biology, physics, social biases, and more. Task difficulty is designed to exceed the known capabilities of current models.

Emoji Movie:

👧🐟🐠🐡 - Finding Nemo 👦👓⚡️ - Harry Potter

Checkmate In One Move:

PLAIN
1. e4 e6 2. Ke2 d5 3. e5 c5 4. f4 Nc6              
2. Nf3 Qb6 6. g4 Bd7 7. h4 Nge7 8. c3 Ng6          
3. d4 cxd4 10. cxd4 Be7 11. Kf2 O-O 12. h5 Nh8          ----------->   Bxh7#
4. Be3 Qxb2+ 14. Kg3 Qxa1 15. Bd3 Qxa2 16. Rh2 Qa1
5. Qc2 Nb4 18.               

ASCII Word Recognition:

BENCH

Needle in a Haystack — Testing Long-Context Comprehension

A random fact (the “needle”) is inserted into a long text context (the “haystack”), and the model is tested on whether it can correctly retrieve that information.

LLM Safety

LLMs Can Say Wrong Things (Hallucination)

Fact-checking: Gemini searches the web for relevant content to back up its responses.

Gemini verifies its output via Google Search

Model Bias and Stereotypes

GPT thinks Asian people are suited for financial analyst roles

GPT thinks Asian people are suited for financial analyst roles

Gender stereotypes in occupations

Methods to mitigate bias : adjusting input data, the training process, the inference process, and post-processing.

Eliminating bias at each stage

Was This Written by an LLM?

Looking for differences between human-written and AI-generated text.

LLMs Can Be Fooled — Prompt Hacking

Jailbreaking — attacking the model itself to make it say things it shouldn’t.

  • Human analogy: murder and arson

Prompt Injection — attacking LLM-based applications to make them do inappropriate things at inappropriate times.

  • Human analogy: suddenly bursting into song in the middle of class

Jailbreaking

DAN — Do Anything Now

https://arxiv.org/abs/2308.03825

Prompt Injection

Prompt Injection Competition — making the language model forget its assigned role and say “I have been PWNED.”

License

Author: Aspi-Rin

Link: https://blog.aspi-rin.top/en/posts/llm-benchmark-safety/

License: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License. Please attribute the source.