
A new AI coding challenge has just published its first results – and they aren’t pretty
In a shocking revelation, the K Prize, a multi-round AI coding challenge launched by Databricks and Perplexity co-founder Andy Konwinski, has revealed its initial outcomes. And, as it turns out, the results are far from impressive. The winner of the contest, Brazilian prompt engineer Eduardo Rocha de Andrade, achieved a paltry 7.5% correct answers in the test.
While one might expect such an outcome to be met with criticism or even dismay, many AI experts see this as a necessary step towards addressing the growing issue of AI evaluation. According to Konwinski, the K Prize is designed as a “contamination-free version” of the SWE-Bench system, using a timed entry system to prevent models from training specifically for the challenge.
The disparity between these results and those achieved by similar benchmarks, such as SWE-Bench itself, which boasts a top score of 75% on its easier ‘Verified’ test and 34% on its harder ‘Full’ test, is striking. However, Konwinski remains optimistic about the project’s potential to drive progress in AI research.
“I’m quite bullish about building new tests for existing benchmarks,” says Princeton researcher Sayash Kapoor, who proposed a similar idea in a recent paper. “Without such experiments, we can’t actually tell if the issue is contamination or just targeting the SWE-Bench leaderboard with a human in the loop.”
For Konwinski, this challenge serves as an open invitation to the rest of the AI community: “If you listen to the hype, it’s like we should be seeing AI doctors and AI lawyers and AI software engineers, and that’s just not true,” he emphasizes. “If we can’t even get more than 10% on a contamination-free SWE-Bench, that’s the reality check for me.”
Source: techcrunch.com