TY - JOUR
T1 - A benchmark of expert-level academic questions to assess AI capabilities
AU - Center for AI Safety
AU - Scale AI
AU - HLE Contributors Consortium
AU - Hendrycks, Dan
AU - Mazeika, Mantas
AU - Zhang, Oliver
AU - Hausenloy, Jason
AU - Ren, Richard
AU - Kim, Ryan
AU - Khoja, Adam
AU - Li, Nathaniel
AU - Gatti, Alice
AU - Phan, Long
AU - Wang, Alexandr
AU - Yue, Summer
AU - Telluri, Anwith
AU - Wu, Aidan
AU - Wang, Kaixin
AU - Nagumalli, Laasya
AU - Nguyen, Leon
AU - Zhang, Alex
AU - Saha, Abhijeet
AU - Shah, Nihar
AU - Sun, David
AU - Samal, Soham
AU - Kasamsetty, Ritesh
AU - Yalam, Srikar
AU - Nasim, Zafir
AU - Le, Andrew
AU - Sundarapandiyan, Vijaykaarti
AU - Kulkarni, Vidhi
AU - Patel, Spandan
AU - Wu, Timothy
AU - Echeazu, Daryl
AU - Wang, Taozhi
AU - Osbey, Tyler
AU - Peng, Clark
AU - Singh, Aryan
AU - Sun, Xiangwan
AU - Yoon, Julia
AU - Zhao, Ben
AU - Yue, Roy
AU - Yang, Ryan
AU - Lee, Sam
AU - Maung, Erik
AU - Xiao, Tyler
AU - Wang, Gavin
AU - Xu, Ziqi
AU - Kalpathi, Tejas
AU - Chen, Kevin
AU - Zhou, Alan
AU - Agrawal, Rishit
AU - Kolt, Noam
N1 - Publisher Copyright:
© The Author(s) 2026.
PY - 2026/1/29
Y1 - 2026/1/29
N2 - Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve more than 90% accuracy on popular benchmarks such as Measuring Massive Multitask Language Understanding1, limiting informed measurement of state-of-the-art LLM capabilities. Here, in response, we introduce Humanity’s Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be an expert-level closed-ended academic benchmark with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable but cannot be quickly answered by internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a marked gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
AB - Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve more than 90% accuracy on popular benchmarks such as Measuring Massive Multitask Language Understanding1, limiting informed measurement of state-of-the-art LLM capabilities. Here, in response, we introduce Humanity’s Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be an expert-level closed-ended academic benchmark with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable but cannot be quickly answered by internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a marked gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
UR - https://www.scopus.com/pages/publications/105028928953
U2 - 10.1038/s41586-025-09962-4
DO - 10.1038/s41586-025-09962-4
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 41606155
AN - SCOPUS:105028928953
SN - 0028-0836
VL - 649
SP - 1139
EP - 1146
JO - Nature
JF - Nature
IS - 8099
ER -