Skip to content

VyetGokyra/awaresome_LLM_eval_benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“Š 250 LLM Benchmarks & Evaluation Datasets

A curated list of 250+ benchmarks for Large Language Models (LLMs) evaluation.


πŸ“‚ Overview

  • Total entries: 250
  • Categories: Language & Reasoning, Safety, Retrieval, Multilingual, Conversation, Domain-Specific, Others
  • Format: Markdown list grouped by category
  • Use cases: Model evaluation, research, leaderboard building

πŸ—‚ Agents & tools use

ColBench

BFCL (Berkeley Function-Calling Leaderboard)

FlowBench

  • Description: A benchmark for workflow-guided planning that covers 51 different scenarios from 6 domains, with knowledge presented in text, code, and flowchart formats.
  • Paper: FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents https://arxiv.org/abs/2406.14884
  • Code: https://github.com/Justherozen/FlowBench
  • Dataset: see repo
  • Examples: 5313
  • License: see dataset page
  • Year: 2024

AutoTools

WorfBench

API-Bank

ToolLLM

ToolBench

AgentBench

MetaTool

Webarena

ToolQA

  • Description: A new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels (easy/hard) across eight real-life scenarios.
  • Paper: ToolQA: A Dataset for LLM Question Answering with External Tools https://arxiv.org/abs/2306.13304
  • Code: https://github.com/night-chen/ToolQA
  • Dataset: see repo
  • Examples: nan
  • License: see dataset page
  • Year: 2023

T-Eval

GAIA

MINT

AgentBench

Webshop

PaperBench

LLF-Bench

MultiAgentBench

CRMArena

CRMArena-Pro

FutureBench

SpreadsheetBench

  • Description: A challenging spreadsheet manipulation benchmark exclusively derived from real-world scenarios.
  • Paper: SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation

https://arxiv.org/abs/2406.14991

TheAgentCompany

  • Description: An extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers.
  • Paper: TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

https://arxiv.org/abs/2412.14161

DSBench

BrowseComp

MLE-bench

πŸ—‚ Agents & tools use,domain-specific

TaxCalcBench

  • Description: A benchmark for determining models' abilities to calculate personal income tax returns given all of the necessary information.
  • Paper: TaxCalcBench: Evaluating Frontier Models on the Tax Calculation Task

https://arxiv.org/abs/2507.16126

SciGym

πŸ—‚ Agents & tools use,language & reasoning

ACPBench

πŸ—‚ Bias & ethics

Global MMLU

  • Description: Translated MMLU, that also includes cultural sensitivity annotations for a subset of the questions, with evaluation coverage across 42 languages.
  • Paper: Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation https://arxiv.org/abs/2412.03304
  • Code: No repository provided
  • Dataset: https://huggingface.co/datasets/CohereForAI/Global-MMLU
  • Examples: 601734
  • License: Apache-2.0 license
  • Year: 2024

Civil Comments

πŸ—‚ Bias & ethics,knowledge

SOCKET

πŸ—‚ Coding

CRUXEval (Code Reasoning, Understanding, and Execution Evaluation)

BigCodeBench

SWE-bench verified

CrossCodeEval

EvalPlus

ClassEval

Repobench

SWE-bench

Code Lingua

DS-1000

CodeXGLUE

APPS (Automated Programming Progress Standard)

MBPP (Mostly Basic Programming Problems)

HumanEval

APPS

LiveCodeBench

LiveCodeBench Pro

CodeElo

ResearchCodeBench

  • Description: A benchmark that evaluates LLMs’ ability to translate cutting-edge ML contributions from top 2024-2025 research papers into executable code.
  • Paper: ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code

https://arxiv.org/html/2506.02314v1

Spider 2.0

SciCode

πŸ—‚ Conversation & chatbots

MultiChallenge

MT-Bench-101

Chatbot Arena

MixEval

WildChat

Arena-Hard

MT-Bench

OpenDialKG

CoQA (Conversational Question Answering)

QuAC (Question Answering in Context)

SPC (Synthetic-Persona-Chat Dataset)

Wildbench

SocialDial

πŸ—‚ Decision-making

Contrast Sets

  • Description: Annotation paradigm for NLP that helps to close systematic gaps in the test data. Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities.
  • Paper: Evaluating Models' Local Decision Boundaries via Contrast Sets https://arxiv.org/abs/2004.02709
  • Code: https://github.com/allenai/contrast-sets
  • Dataset: see repo
  • Examples: nan
  • License: see dataset page
  • Year: 2020

πŸ—‚ Domain-specific

ClinicBench

LegalBench

MedMCQA

TAT-QAΒ 

CUAD

MedQA

PubMedQA

MedConceptsQA

CUPCase

  • Description: CUPCase is based on 3,563 real-world clinical case reports formulated into diagnoses in open-ended textual format and as multiple-choice options with distractors.
  • Paper: CUPCase: Clinically Uncommon Patient Cases and Diagnoses Dataset https://arxiv.org/abs/2503.06204
  • Code: No repository provided
  • Dataset: https://huggingface.co/datasets/ofir408/CupCase
  • Examples: 3562
  • License: Apache-2.0 license
  • Year: 2025

LAB-Bench (Language Agent Biology Benchmark)

PERRECBENCH

πŸ—‚ Domain-specific,agents & tools use,information retrieval & rag

DIBS (Domain Intelligence Benchmark Suite)

  • Description: DIBS measures LLM performance on datasets curated to reflect specialized domain knowledge and common enterprise use cases that traditional academic benchmarks often overlook.
  • Paper: Benchmarking Domain Intelligence https://www.databricks.com/blog/benchmarking-domain-intelligence
  • Code: No repository provided
  • Dataset: No dataset link provided
  • Examples: nan
  • License: see dataset page
  • Year: 2024

πŸ—‚ Domain-specific,language & reasoning

MediQ

  • Description: A framework for simulating realistic clinical interactions, where an Expert model asks information-seeking questions when needed and respond reliably.
  • Paper: MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning

https://arxiv.org/abs/2406.00922

πŸ—‚ Empathy

EmotionBench

EQ-Bench

πŸ—‚ Image generation

MC-Bench (Minecraft AI Benchmark)

  • Description: A platform for evaluating and comparing AI models by challenging them to create Minecraft builds.
  • Paper: https://mcbench.ai/
  • Code: https://github.com/mc-bench
  • Dataset: Not dataset-based
  • Examples: nan
  • License: MIT License
  • Year: 2024

πŸ—‚ Information retrieval & rag

NoLiMa

RULER

  • Description: A synthetic benchmark with flexible configurations for customized sequence length and task complexity. RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles.
  • Paper: RULER: What's the Real Context Size of Your Long-Context Language Models? https://arxiv.org/abs/2404.06654
  • Code: https://github.com/NVIDIA/RULER
  • Dataset: see repo
  • Examples: 13
  • License: see dataset page
  • Year: 2024

Loong

WiCE

LFQA-Verification

FEVER

NeedleInAHaystack (NIAH)

CRAG

LongGenBench

FaithEval

MTRAG

  • Description: An end-to-end human-generated multi-turn RAG benchmark that reflects several real-world properties across diverse dimensions for evaluating the full RAG pipeline.
  • Paper: MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems

https://arxiv.org/abs/2501.03468

ContextualBench

WixQA

  • Description: A benchmark suite featuring QA datasets grounded in the released knowledge base corpus, enabling holistic evaluation of retrieval and generation components.
  • Paper: https://arxiv.org/abs/2505.08643 WixQA: A Multi-Dataset Benchmark for Enterprise Retrieval-Augmented Generation
  • Code: No repository provided
  • Dataset: https://huggingface.co/datasets/Wix/WixQA
  • Examples: 12842
  • License: MIT License
  • Year: 2025

πŸ—‚ Information retrieval & rag,language & reasoning,domain-specific

OmniEval

  • Description: RAG benchmark in the financial domain, with queries in five task classes and 16 financial topics.
  • Paper: OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain https://arxiv.org/abs/2412.13018
  • Code: https://github.com/RUC-NLPIR/OmniEval
  • Dataset: see repo
  • Examples: nan
  • License: MIT License
  • Year: 2024

πŸ—‚ Information retrieval & rag,language & reasoning,safety

FRAMES (Factuality, Retrieval, And reasoning MEasurement Set)

πŸ—‚ Information retrieval & rag,multimodal

ViDoRe (Visual Document Retrieval Benchmark)

πŸ—‚ Information retrieval & rag,safety

RAGTruth

πŸ—‚ Instruction-following

Infobench

PandaLM

πŸ—‚ Instruction-following,conversation & chatbots

AlpacaEval

πŸ—‚ Knowledge

ConflictBank

  • Description: Evaluates knowledge conflicts from three aspects: 1) conflicts in retrieved knowledge, 2) conflicts within the models’ encoded knowledge, and 3) the interplay between these conflict forms.
  • Paper: ConflictBank: A Benchmark for Evaluating Knowledge Conflicts in Large Language Models https://arxiv.org/html/2408.12076v1
  • Code: https://github.com/zhaochen0110/conflictbank
  • Dataset: see repo
  • Examples: 553000
  • License: CC-BY-SA-4.0
  • Year: 2024

FreshQA

πŸ—‚ Knowledge,language & reasoning

MMLU Pro

BigBench Hard

BigBench

MMLU

ARC

Humanity's Last Exam (HLE)

MegaScience

SKA-Bench

  • Description: A Structured Knowledge Augmented QA Benchmark that encompasses four widely used structured knowledge forms: Knowledge Graph (KG), Table, KG+Text, and Table+Text.
  • Paper: SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs

https://arxiv.org/abs/2507.17178

πŸ—‚ Knowledge,language & reasoning,multimodal

ScienceQA

πŸ—‚ Knowledge,language & reasoning,safety

TruthfulQA

πŸ—‚ Language & reasoning

Graphwalks

Zebralogic

NovelQA

Reveal

LongBench

InfiniteBench

Chain-of-Thought Hub

MuSR

GPQA

AGIEval

SummEdits

NPHardEval

e-CARE (explainable CAusal REasoning dataset)

PlanBench

GLUE-X

FOLIO

SpartQA

Natural Questions

ANLI

BoolQ

SuperGLUE

DROP (Discrete Reasoning Over Paragraphs)

HellaSwag

Winogrande

PIQA (Physical Interaction QA)

HotpotQA

GLUE (General Language Understanding Evaluation)

OpenBookQA

SQuAD2.0

SWAG

CommonsenseQA

RACE (ReAding Comprehension Dataset From Examinations)

SciQ

TriviaQA

MultiNLI (Multi-Genre Natural Language Inference)

SQuAD (Stanford Question Answering Dataset)

LAMBADA (LAnguage Modelling Broadened to Account for Discourse Aspects)

MS MARCO

RAFT

πŸ—‚ Language & reasoning,agents & tools use,safety,instruction-following

BiGGen-Bench

πŸ—‚ Language & reasoning,bias & ethics

Social Chemistry 101

πŸ—‚ Language & reasoning,coding,math,instruction-following

Livebench

πŸ—‚ Language & reasoning,information retrieval & rag

BEIR

NarrativeQA

πŸ—‚ Language & reasoning,information retrieval & rag,multimodal

LOFT

πŸ—‚ Language & reasoning,instruction-following

IFEval

πŸ—‚ Language & reasoning,knowledge

ARB

AGIEval

  • Description: Uses human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests.
  • Paper: AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models https://arxiv.org/pdf/2304.06364
  • Code: https://github.com/ruixiangcui/AGIEval
  • Dataset: see repo
  • Examples: 8000
  • License: see dataset page
  • Year: 2023

πŸ—‚ Language & reasoning,multilingual

MultiLoKo

Include

  • Description: An evaluation suite to measure the capabilities of multilingual LLMs in a variety of regional contexts across 44 written languages.
  • Paper: INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

https://arxiv.org/pdf/2411.19799

MultiNRC

  • Description: Assesses LLMs on reasoning questions written by native speakers in French, Spanish, and Chinese. MultiNRC covers four core reasoning categories: language-specific linguistic reasoning, wordplay & riddles, cultural/tradition reasoning, and math reasoning with cultural relevance.
  • Paper: MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs

https://arxiv.org/abs/2507.17476

πŸ—‚ Language & reasoning,multimodal,video

Video-MME

πŸ—‚ Language & reasoning,safety

HELM

πŸ—‚ Llm judge evaluation

JudgeBench

πŸ—‚ Llm-generated text detection

DetectRL

πŸ—‚ Math

AIME

CHAMP

TemplateGSM

HARD-Math

TheoremQA

MGSM (Multilingual Grade School Math)

GSMHard

SVAMP

MATH

GSM8K

AQUA-RAT

We-Math

MathArena

πŸ—‚ Math,domain-specific

TeleMath

  • Description: Designed to evaluate LLM performance in solving mathematical problems with numerical solutions in the telecommunications domain.
  • Paper: TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving

https://arxiv.org/abs/2506.10674

πŸ—‚ Math,language & reasoning

DocMath-Eval

FinQA

πŸ—‚ Multimodal

SEED-Bench

Q-bench

M3Exam

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT

MME

LVLM-eHub

MMBench

  • Description: A bilingual benchmark for assessing the multi-modal capabilities of vision-language models. Contains 2974 multiple-choice questions, covering 20 ability dimensions.
  • Paper: MMBench: Is Your Multi-modal Model an All-around Player? https://arxiv.org/abs/2307.06281
  • Code: https://github.com/open-compass/MMBench
  • Dataset: see repo
  • Examples: 2974
  • License: see dataset page
  • Year: 2023

HQHBench

  • Description: Evaluates the performance of LVLMs across different types of hallucination. It consists of 4000 free-form VQA image-instruction pairs, with 500 pairs for each hallucination type.
  • Paper: Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations https://arxiv.org/pdf/1602.07332v1
  • Code: https://github.com/HQHBench/HQHBench
  • Dataset: see repo
  • Examples: 4000
  • License: see dataset page
  • Year: 2016

MM-Vet

  • Description: An evaluation benchmark that examines LMMs on complicated multimodal tasks.
  • Paper: MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

https://arxiv.org/abs/2308.02490

πŸ—‚ Multimodal,information retrieval & rag

MMNeedle

πŸ—‚ Multimodal,language & reasoning

MMMU

WebQA

πŸ—‚ Safety

FACTS Grounding

HarmBench

SimpleQA

AgentHarm

StrongReject

AIR-Bench

OR-Bench

TOFUEVAL

  • Description: Evaluation benchmark on topic-focused dialogue summarization. Contains binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences.
  • Paper: TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization https://arxiv.org/abs/2402.13249
  • Code: https://github.com/amazon-science/tofueval
  • Dataset: see repo
  • Examples: nan
  • License: see dataset page
  • Year: 2024

BackdoorLLM

Hal-eval

LLM-AggreFact

ForbiddenQuestions

MaliciousInstruct

SycophancyEval

DecodingTrust

AdvBench

XSTest

OpinionQA

SafetyBench

HarmfulQA

QHarm

BeaverTails

DoNotAnswer

ExpertQA

HaluEval

RED-EVAL

ToxiGen

HHH (Helpfulness, Honesty, Harmlessness)

PersonalInfoLeak

AnthropicRedTeam

TURINGBENCH

HHH alignment

RealToxicityPrompt

RobustBench

CYBERSECEVAL 2

ConfAIde

πŸ—‚ Safety,agents & tools use

ToolEmu

πŸ—‚ Safety,bias & ethics

TrustLLM

TRUSTGPT

BOLD

BBQ

StereoSet

ETHICS

CrowS-Pairs (Crowdsourced Stereotype Pairs)

SEAT (Sentence Encoder Association Test)

WinoGender

πŸ—‚ Safety,domain-specific

HealthBench

  • Description: Realistic healthcare scenarios: emergency referrals, global health, health data tasks, context-seeking, expertise-tailored communication, response depth, and responding under uncertainty.
  • Paper: HealthBench: Evaluating Large Language Models Towards Improved Human Health

https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca650c7/healthbench_paper.pdf

πŸ—‚ Summarization

TLDR 9+

πŸ—‚ Summarization,language & reasoning

MLSUM

πŸ—‚ Video,multimodal

EvalCrafter


πŸ“œ License

This list is under the Apache-2.0 license.
Each dataset or benchmark has its own license β€” please check before use.

About

250 LLM Benchmarks & Evaluation Datasets

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors