Can AI master every subject known to man?
Can you provide a translation of an ancient Palmyrene script found on a Roman tombstone? How many paired tendons are supported by a particular sesamoid bone in a hummingbird? These are just two of the many varied and challenging questions submitted to Humanity’s Last Exam(opens in new window), or HLE, the apparently unsolvable test reserved only for the best and brightest. But it’s not meant for us.
Ultimate academic test for AI
Don’t let the apocalyptic name fool you. HLE isn’t about humans becoming irrelevant. It’s about celebrating what we know that AI can’t even touch yet. It was created to determine if AI models such as ChatGPT and Gemini can answer the most difficult questions experts could come up with. Basically, HLE was specifically designed to see exactly where today’s AI fails and what remains out of reach for existing AI technology. This new benchmark was introduced in a study published in the journal ‘Nature’(opens in new window). HLE is a truly collaborative effort, with about 2 500 questions from close to 1 000 contributors affiliated with over 500 institutions across 50 countries. Contributors are mainly experts in the sciences, humanities and arts, covering more than 100 highly specialised fields. “What made this project extraordinary was the scale,” commented Tung Nguyen, associate professor in the Department of Computer Science and Engineering at Texas A&M, in a news release(opens in new window). “That diversity is exactly what exposes the gaps in today’s AI systems — perhaps ironically, it’s humans working together.” AI systems didn’t exactly ace this tough test – at least not early on. Initial results in 2025 showed that many of the AIs scored less than 10 % on the exam. However, in March of this year, Gemini 3.1 Pro achieved 45.9 % accuracy, followed closely behind by GPT-5.4 at 40.3 %.
Surpassing the boundaries of human knowledge
“When AI systems start performing extremely well on human benchmarks, it’s tempting to think they’re approaching human‑level understanding,” explained Nguyen. “But HLE reminds us that intelligence isn’t just about pattern recognition — it’s about depth, context and specialized expertise.” Nguyen contributed 73 of the questions – the second-highest number – and wrote the most questions in math and computer science. “For now, Humanity’s Last Exam stands as one of the clearest assessments of the gap between AI and human intelligence, and despite rapid technological advances, it remains wide.” Nguyen emphasised that when AI surpasses traditional metrics, the resulting gap creates challenges that are more than just academic. “Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do. Benchmarks provide the foundation for measuring progress and identifying risks.” HLE is a reality check for AI, proving that our unique knowledge still sets the bar higher than any algorithm can reach. “This isn’t a race against AI,” Nguyen concluded. “It’s a method for understanding where these systems are strong and where they struggle. That understanding helps us build safer, more reliable technologies. And, importantly, it reminds us why human expertise still matters.”