Navigation
Search
|
How Do Olympiad Medalists Judge LLMs in Competitive Programming?
Tuesday June 17, 2025. 04:09 PM , from Slashdot
![]() The researchers measured models and humans on the same Elo scale used by Codeforces and found that OpenAI's o4-mini-high, when stripped of terminal tools and limited to one try per task, lands at an Elo rating of 2,116 -- hundreds of points below the grandmaster cutoff and roughly the 1.5 percentile among human contestants. A granular tag-by-tag autopsy identified implementation-friendly, knowledge-heavy problems -- segment trees, graph templates, classic dynamic programming -- as the models' comfort zone; observation-driven puzzles such as game-theory endgames and trick-greedy constructs remain stubborn roadblocks. Because the dataset is harvested in real time as contests conclude, the authors argue it minimizes training-data leakage and offers a moving target for future systems. The broader takeaway is that impressive leaderboard jumps often reflect tool use, multiple retries or easier benchmarks rather than genuine algorithmic reasoning, leaving a conspicuous gap between today's models and top human problem-solvers. Read more of this story at Slashdot.
https://slashdot.org/story/25/06/17/149238/how-do-olympiad-medalists-judge-llms-in-competitive-progr...
Related News |
25 sources
Current Date
Jun, Wed 18 - 12:57 CEST
|