Scientists have devised a new way to measure how capable artificial intelligence (AI) systems are — how fast they can beat, or compete with, humans in challenging tasks.
While AIs can generally outperform humans in text prediction and knowledge tasks, when given more substantive projects to carry out, such as remote executive assistance, they are less effective.
To quantify these performance gains in AI models, a new study has proposed measuring AIs based on the duration of tasks they can complete, versus how long it takes humans. The researchers published their findings March 30 on the preprint database arXiv, so they have not yet been peer-reviewed.
“We find that measuring the length of tasks that models can complete is a helpful lens for understanding current AI capabilities. This makes sense: AI agents often seem to struggle with stringing together longer sequences of actions more than they lack skills or knowledge needed to solve single steps,” the researchers from AI organization Model Evaluation & Threat Research (METR) explained in a blog post accompanying the study.
The researchers found that AI models completed tasks that would take humans less than four minutes with a near-100% success rate. However, this dropped to 10% for tasks taking more than four hours. Older AI models performed worse at longer tasks than the latest systems.
This was to be expected, with the study highlighting that the length of tasks generalists AIs could complete with 50% reliability has been doubling roughly every seven months for the last six years.
To conduct their study, the researchers took a variety of AI models — from Sonnet 3.7 and GPT-4 to Claude 3 Opus and older GPT models — and pitted them against a suite of tasks. These ranged from easy assignments that typically take humans a couple of minutes like looking up a basic factual question on Wikipedia) to ones that take human experts multiple hours — complex programming tasks like writing CUDA kernels or fixing a subtle bug in PyTorch, for example.
Testing tools including HCAST and RE-Bench were used; the former has 189 autonomy software tasks setup to assess AI agent capabilities in handling tasks around machine learning, cyber security and software engineering, while the latter uses seven challenging open-ended…
Click Here to Read the Full Original Article at Latest from Live Science…