Evals are as hard as product development

Evaluating the capabilities of LLMs is going to be very hard! You want to be able to elicit top performance from existing language models.

To illustrate why, let’s look at Perplexity. Perplexity is an AI-powered search engine. It’s a successful product, with a large engineering effort behind it. There’s clearly a massive gap between getting an LLM to call a search API and whatever Perplexity is doing. There may be downstream capabilities that aren’t possible with simple search approaches, but are possible with Perplexity.

On top of the quality of Perplexity’s product, there’s its speed: queries resolve in seconds. Even if you spend a lot of time optimizing your Google-search-in-a-ReAct-loop setup, it will likely be slow. What new capabilities does going from minutes- to seconds-per-query unlock? I imagine a few.

This shows that to get state-of-the-art performance on subtasks in model evaluations, you’ll need to spend a huge amount of engineering resources.

Some more examples of state-of-the-art being better than simple prompting:

The Ghost in the Minecraft paper shows that complex prompting techniques can make the difference between being useless at Minecraft & unlocking most of the achievements in the game.

The complexity of Copilot’s implementation suggests that getting models to write code effectively will also require a lot of engineering effort (e.g. vector DBs, parsing ASTs and pulling in relevant content, feeding errors back into LLMs).

Similar things can be said for Sweep’s implementation. Simple approaches, like the ones in the SWE-bench paper, show that this complexity is needed to solve real-world tasks.

Dataset creation & fine-tuning also have a huge role to play here. For example, the Wizard LM papers (1, 2, 3) showcase relatively complex systems for creating datasets & fine-tuning for specific tasks. For smaller models, these systems make the difference between weak and strong performance. I’d bet on similar things being true for the next generation of models.

What does this mean for evals? Simply put, if you don’t invest resources developing cutting-edge scaffolding / data generation / fine-tuning, you will miss capabilities made possible by the next generation of models.

The product companies & research labs will be trying very hard to elicit these capabilities as soon as they’re made available.

The obvious workaround for this is to do more hand-holding, and focus on evaluating higher-level behaviors.

For example: While its very hard to develop a system to generically pull in the correct code snippets across a large codebase, it’s quite easy to hand-craft a few specific cases. An LLM being able to pass these tests could be indicative of it being able to pass a fully generic version given the engineering investment.

This shift in focus would require us to look less at the model’s ability to navigate the nitty-gritty details of the world, and more its ability to do high-level planning, and showcasing the core cognitive ability to perform complex tasks.