aider

o1 tops aider’s new polyglot leaderboard

2024-12-21T00:00:00+00:00

December 21, 2024

o1 tops aider’s new polyglot leaderboard

OpenAI’s new o1 model with “high” reasoning effort gets the top score on the new aider polyglot leaderboard, significantly ahead of other top LLMs. The new polyglot benchmark uses many popular coding languages and was designed to be much more challenging than aider’s original code editing benchmark. This more clearly distinguishes the performance of today’s strongest coding models and leaves headroom for future LLMs.

See the main aider leaderboard for benchmark results from more models. This article only contains a snapshot of results at the time of publication.

The polyglot benchmark

Like aider’s original code editing benchmark, the new polyglot benchmark is based on Exercism coding exercises.

The new polyglot benchmark:

Contains coding problems in C++, Go, Java, JavaScript, Python and Rust. The old benchmark was solely based on Python exercises.
Focuses on the most difficult 225 exercises out of the 697 that Exercism provides for those languages. The old benchmark simply included all 133 Python exercises, regardless of difficulty.

Motivation and goals

Aider’s original code editing benchmark was saturating as the top scores approached and then surpassed 80%. Sonnet’s score of 84.2% was based on solving 112 of the 133 exercises, leaving only 21 unsolved exercises. New champions were advancing the top score by solving just 1-2 more problems than the previous record. This made it hard to clearly measure the difference in code editing skill between these top models.

Part of the problem is that many of the original 133 Python problems are very easy and provide little challenge to today’s frontier LLMs. Models as old as GPT 3.5 Turbo were able to solve half of the 133 problems. Such easy problems simply inflate the benchmark scores of modern LLMs without providing any data about which models are better or worse.

The main goal for a new benchmark was to re-calibrate the scale so that today’s top coding LLMs would occupy a wide range of scores between about 5% and 50%. This should leave headroom for future LLMs and make it possible to more clearly compare the relative performance of top models.

Designing the polyglot benchmark

The new benchmark:

Tests LLMs with more coding languages, to increase diversity and source a larger pool of problems.
Includes just the most challenging coding problems and excludes easy problems that are solvable by most of today’s top coding LLMs.
Includes more total coding problems, to enable more granularity of comparison.

The new benchmark is based on Exercism coding problems from 6 of the most popular programming languages:

C++
Go
Java
JavaScript
Python
Rust

Exercism provides a total of 697 coding problems in those 6 languages. A set of 7 of today’s top coding models each attempted all 697 of the Exercism problems:

Sonnet
Haiku
o1 Mini
DeepSeek
GPT-4o
Qwen 32B Coder Instruct
GPT-4o Mini

Depending on the difficulty of the problems, a different number of solutions were found by the collection of 7 models:

Solutions found	Number of problems	Cumulative number of problems
0	66	66
1	61	127
2	50	177
3	48	225
4	53	278
5	71	349
6	90	439
7	258	697

In the table above, you can see that 258 of the problems were solved by all 7 LLMs. These problems are far too easy, and wouldn’t be good choices for the new benchmark. Instead, we need hard problems like the 66 that none of the 7 models were able to solve.

The new benchmark uses the 225 problems that were solved by 3 or fewer models. This achieves a balance between hard and moderate problems, and provides a large but not excessive total pool of problems. It also represents a good diversity of coding languages:

Language	Problems
C++	26
Go	39
Java	47
JavaScript	49
Python	34
Rust	30
Total	225

o1

OpenAI’s new o1 model established a very strong top score of 62% on the new benchmark. This still leaves 86 problems of headroom for future models to solve. Given the incredible pace of recent advancements, it will be interesting to see how long it will take for this new benchmark to saturate.

Benchmark problems

The 225 coding problems are available in the aider polyglot benchmark repo on GitHub.

Results

Model	Percent completed correctly	Percent using correct edit format	Command	Edit format
o1-2024-12-17 (high)	61.7%	91.5%	`aider --model openrouter/openai/o1`	diff
claude-3-5-sonnet-20241022	45.3%	100.0%	`aider --model claude-3-5-sonnet-20241022`	diff
gemini-exp-1206	38.2%	98.2%	`aider --model gemini/gemini-exp-1206`	whole
o1-mini-2024-09-12	32.9%	96.9%	`aider --model o1-mini`	whole
claude-3-5-haiku-20241022	28.0%	91.1%	`aider --model claude-3-5-haiku-20241022`	diff
gemini-2.0-flash-exp	22.2%	100.0%	`aider --model gemini/gemini-2.0-flash-exp`	whole
DeepSeek Chat V2.5	17.8%	92.9%	`aider --model deepseek/deepseek-chat`	diff
gpt-4o-2024-11-20	15.1%	96.0%	`aider --model gpt-4o-2024-11-20`	diff
Qwen2.5-Coder-32B-Instruct	8.0%	71.6%	`aider --model openai/Qwen/Qwen2.5-Coder-32B-Instruct # via hyperbolic`	diff
gpt-4o-mini-2024-07-18	3.6%	100.0%	`aider --model gpt-4o-mini-2024-07-18`	whole

QwQ is a code architect, not an editor

2024-12-03T00:00:00+00:00

December 03, 2024

QwQ is a code architect, not an editor

QwQ 32B Preview is a “reasoning” model, which spends a lot of tokens thinking before rendering a final response. This is similar to OpenAI’s o1 models, which are most effective with aider when paired as an architect with a traditional LLM as an editor. In this mode, the reasoning model acts as an “architect” to propose a solution to the coding problem without regard for how to actually make edits to the source files. The “editor” model receives that proposal, and focuses solely on how to edit the existing source code to implement it.

Used alone without being paired with an editor, QwQ was unable to comply with even the simplest editing format. It was not able to reliably edit source code files. As a result, QwQ’s solo score on the benchmark was quite underwhelming (and far worse than the o1 models performing solo).

QwQ is based on Qwen 2.5 Coder 32B Instruct, and does better when paired with it as an architect + editor combo. Though this provided only a modest benchmark improvement over just using Qwen alone, and comes with a fairly high cost in terms of latency. Each request must wait for QwQ to return all its thinking text and the final solution proposal. And then one must wait for Qwen to turn that large response into actual file edits.

Pairing QwQ with other sensible editor models performed the same or worse than just using Qwen 2.5 Coder 32B Instruct alone.

QwQ+Qwen seems to be the best way to use QwQ, achieving a score of 74%. That is well below the SOTA results for this benchmark: Sonnet alone scores 84%, and o1-preview + o1-mini as architect + editor scores 85%.

QwQ specific editing formats

I spent some time experimenting with a variety of custom editing formats for QwQ. In particular, I tried to parse the QwQ response and discard the long sections of “thinking” and retain only the “final” solution. None of this custom work seemed to translate into any significant improvement in the benchmark results.

Results

Model	Percent completed correctly	Percent using correct edit format	Command	Edit format
o1-preview	79.7%	93.2%	`aider --model o1-preview`	diff
QwQ + Qwen2.5 Coder 32B-I	73.6%	100.0%	`aider --model openrouter/qwen/qwq-32b-preview --editor-model openrouter/qwen/qwen-2.5-coder-32b-instruct --editor-edit-format editor-whole`	architect
Qwen2.5 Coder 32B-I	71.4%	94.7%	`aider --model openai/hf:Qwen/Qwen2.5-Coder-32B-Instruct --openai-api-base https://glhf.chat/api/openai/v1 (via GLHF)`	diff
QwQ + Haiku	71.4%	100.0%	`aider --model openrouter/qwen/qwq-32b-preview --editor-model claude-3-5-haiku-20241022 --edit-format editor-whole`	architect
o1-mini	70.7%	90.0%	`aider --model o1-mini`	whole
QwQ + DeepSeek V2.5	67.7%	100.0%	`aider --model openrouter/qwen/qwq-32b-preview --editor-model deepseek/deepseek-chat --edit-format editor-whole`	architect
QwQ	42.1%	91.0%	`aider --model openrouter/qwen/qwq-32b-preview`	whole

Open source model caveats

As discussed in a recent blog post, details matter with open source models. For clarity, new benchmark runs for this article were performed against OpenRouter’s endpoints for QwQ 32B Preview and Qwen 2.5 Coder 32B Instruct. For the other models, the benchmark was direct to their providers’ APIs.

Having recently done extensive testing of OpenRouter’s Qwen 2.5 Coder 32B Instruct endpoint, it seems reliable. The provider Mancer was blocked due to the small context window it provides.

For QwQ 32B Preview, Fireworks was blocked because of its small context window.

Details matter with open source models

2024-11-21T00:00:00+00:00

November 21, 2024

Details matter with open source models

Open source models like Qwen 2.5 32B Instruct are performing very well on aider’s code editing benchmark, rivaling closed source frontier models.

But pay attention to how your model is being served and quantized, as it can impact code editing skill. Open source models are often available at a variety of quantizations, and can be served with different token limits. These details matter when working with code.

The graph above and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model, served both locally and from a variety of cloud providers.

The HuggingFace BF16 weights served via glhf.chat.
4bit and 8bit quants for mlx.
The results from OpenRouter’s mix of providers which serve the model with different levels of quantization.
Results from OpenRouter’s providers, both served via OpenRouter and directly to their own APIs.
Ollama locally serving different quantizations from the Ollama model library with 8k+ context windows.
An Ollama fp16 quantization served with Ollama’s default 2k context window.

Pitfalls and details

This benchmarking effort highlighted a number of pitfalls and details specific to open source models which can have a significant impact on their ability to correctly edit code:

Quantization – Open source models are often available at dozens of different quantizations. Most seem to only modestly decrease code editing skill, but stronger quantizations do have a real impact.
Context window – Cloud providers can decide how large a context window to accept, and they often choose differently. Ollama’s local API server defaults to a tiny 2k context window, and silently discards data that exceeds it. Such a small window has catastrophic effects on performance, without throwing obvious hard errors.
Output token limits – Open source models are often served with wildly differing output token limits. This has a direct impact on how much code the model can write or edit in a response.
Buggy cloud providers – While benchmarking Qwen 2.5 Coder 32B Instruct and DeepSeek V2.5, I discovered multiple cloud providers with broken or buggy API endpoints. They seemed to be returning results different from expected based on the advertised quantization and context sizes. The harm caused to the code editing benchmark varied from serious to catastrophic. One provider scored 0.5% on the benchmark with DeepSeek V2.5, a highly capable model.

Closed source, proprietary models don’t typically have these issues. They are owned and operated by the organization that created them, and typically served with specific, predictable context window and output token limits. Their quantization level is usually unknown, but fixed and unchanging for all users.

Conclusions

The best versions of the Qwen model rival GPT-4o, while the worst performing quantization is more like the older GPT-4 Turbo when served competently. Even an otherwise excellent fp16 quantization falls to GPT-3.5 Turbo levels of performance if run with Ollama’s default 2k context window.

Benchmark results

These are results from single benchmark runs, so expect normal variance of +/- 1-2%.

Model	Percent completed correctly	Percent using correct edit format	Command	Edit format
Fireworks: unknown	72.2%	94.0%	`aider --model fireworks_ai/accounts/fireworks/models/qwen2p5-coder-32b-instruct`	diff
Deepinfra: BF16	72.2%	94.7%	`aider --model deepinfra/Qwen/Qwen2.5-Coder-32B-Instruct`	diff
mlx-community: 8bit	72.2%	92.5%	`aider --model openai/mlx-community/Qwen2.5-Coder-32B-Instruct-8bit`	diff
mlx-community: 4bit	72.2%	88.7%	`aider --model openai/mlx-community/Qwen2.5-Coder-32B-Instruct-4bit`	diff
Ollama: fp16	71.4%	90.2%	`aider --model ollama/qwen2.5-coder:32b-instruct-fp16`	diff
HuggingFace via GLHF: BF16	71.4%	94.7%	`aider --model openai/hf:Qwen/Qwen2.5-Coder-32B-Instruct --openai-api-base https://glhf.chat/api/openai/v1`	diff
Deepinfra via OpenRouter: BF16	69.9%	89.5%	`aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct`	diff
Hyperbolic: BF16	69.2%	91.7%	`aider --model openai/Qwen/Qwen2.5-Coder-32B-Instruct --openai-api-base https://api.hyperbolic.xyz/v1/`	diff
Hyperbolic via OpenRouter: BF16	68.4%	89.5%	`aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct`	diff
Fireworks via OpenRouter: unknown	67.7%	94.0%	`aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct`	diff
OpenRouter: multiple	67.7%	95.5%	`aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct`	diff
Ollama: q4_K_M	66.9%	94.0%	`aider --model ollama/qwen2.5-coder:32b-instruct-q4_K_M`	diff
Ollama: q2_K	61.7%	91.7%	`aider --model ollama/qwen2.5-coder:32b-instruct-q2_K`	diff
Ollama: fp16, 2k ctx	51.9%	46.2%	`aider --model ollama/qwen2.5-coder:32b-instruct-fp16 # num_ctx: 2048`	diff

Setting Ollama’s context window size

Ollama uses a 2k context window by default, which is very small for working with aider. Unlike most other LLM servers, Ollama does not throw an error if you submit a request that exceeds the context window. Instead, it just silently truncates the request by discarding the “oldest” messages in the chat to make it fit within the context window.

Except for the single 2k context result, all of the Ollama results above were collected with at least an 8k context window. An 8k window is large enough to attempt all the coding problems in the benchmark. Aider sets Ollama’s context window to 8k by default, starting in aider v0.65.0.

You can change the Ollama server’s context window with a .aider.model.settings.yml file like this:

- name: ollama/qwen2.5-coder:32b-instruct-fp16
  extra_params:
    num_ctx: 8192

Choosing providers with OpenRouter

OpenRouter allows you to ignore specific providers in your preferences. This can be used to limit your OpenRouter requests to be served by only your preferred providers.

Notes

This article went through many revisions as I received feedback from numerous members of the community. Here are some of the noteworthy learnings and changes:

The first version of this article included incorrect Ollama models.
Earlier Ollama results used the too small default 2k context window, artificially harming the benchmark results.
The benchmark results appear to have uncovered a problem in the way OpenRouter was communicating with Hyperbolic. They fixed the issue 11/24/24, shortly after it was pointed out.

Separating code reasoning and editing

2024-09-26T00:00:00+00:00

September 26, 2024

Separating code reasoning and editing

Aider now has experimental support for using two models to complete each coding task:

An Architect model is asked to describe how to solve the coding problem.
An Editor model is given the Architect’s solution and asked to produce specific code editing instructions to apply those changes to existing source files.

Splitting up “code reasoning” and “code editing” in this manner has produced SOTA results on aider’s code editing benchmark. Using o1-preview as the Architect with either DeepSeek or o1-mini as the Editor produced the SOTA score of 85%. Using the Architect/Editor approach also significantly improved the benchmark scores of many models, compared to their previous “solo” baseline scores (striped bars).

Motivation

This approach was motivated by the release of OpenAI’s o1 models. They are strong at reasoning, but often fail to output properly formatted code editing instructions. It helps to instead let them describe the solution however they prefer and then pass that output to a more traditional LLM. This second Editor LLM can then interpret the solution description and produce the code editing instructions needed to update the existing source code.

This approach has recently become attractive for aider due to rapid improvements in the speed and costs of frontier models. In particular, chaining older LLMs would have been quite slow and incompatible with aider’s goal of providing an interactive, pair programming AI coding experience.

Code reasoning and code editing

Normally aider asks the model to solve a coding problem in one prompt, asking the LLM to explain the solution and return a well formatted series of file edits. All of aider’s editing formats require the LLM to return source code edits in a specific text format, so that aider can process the edits and apply them to the local source files.

Because this all happens in a single prompt/response round trip to the LLM, the model has to split its attention between solving the coding problem and conforming to the edit format.

The Architect/Editor approach splits this into two inference steps, possibly using two different LLMs:

Solve the coding problem (Architect).
Turn the proposed solution into a series of well formed code edits (Editor).

The Architect/Editor approach allows the Architect to focus on solving the coding problem and describe the solution however comes naturally to it. Similarly, the Editor can focus all of its attention on properly formatting the edits without needing to reason much about how to solve the coding problem.

We can assign the Architect and Editor roles to LLMs which are well suited to their needs. Strong reasoning model like o1-preview make excellent Architects, while the Editor role can be assigned to an appropriate model based on cost, speed and code editing skill.

Results

The graph above and the table below show the aider’s code editing benchmark score for various combinations of Architect and Editor models.

Some noteworthy observations:

Pairing o1-preview as Architect with either Deepseek or o1-mini as Editor sets a SOTA significantly above the previous best score. This result is obtained with the “whole” editing format, requiring the Editor to output a full update copy of each edited source file. Both of these steps are therefore quite slow, so probably not practical for interactive use with aider.
Pairing OpenAI’s o1-preview with Anthropic’s Sonnet as the Editor produces the second best result. This is an entirely practical configuration for users able to work with both providers.
Pairing many models with themselves in the Architect/Editor configuration can provide significant benefits. Sonnet, GPT-4o and GPT-4o-mini all scored higher when used as an Architect/Editor pair.
Deepseek is surprisingly effective as an Editor model. It seems remarkably capable at turning proposed coding solutions into new, updated versions of the source files. Using the efficient “diff” editing format, Deepseek helps all the Architect models except for Sonnet.

Try it!

The development version of aider has built in defaults to support Architect/Editor coding with o1-preview, o1-mini, GPT-4o and Claude 3.5 Sonnet. Run aider with --architect or get started quickly like this:

pip install -U aider-chat

# Change directory into a git repo
cd /to/your/git/repo

# Work with Claude 3.5 Sonnet as the Architect and Editor
export ANTHROPIC_API_KEY=your-key-goes-here
aider --sonnet --architect

# Work with OpenAI models, using gpt-4o as the Editor
export OPENAI_API_KEY=your-key-goes-here
aider --4o --architect
aider --o1-mini --architect
aider --o1-preview --architect

More info

Aider has a number of “chat modes”, and “architect” is available as a new chat mode. The --architect switch is a shortcut for --chat-mode architect. For more details, see documentation on aider’s chat modes.

Full results

Below are the benchmark results using various models as the Architect, paired with various models as the Editor. Each section includes a “baseline” result, where the model works by itself in aider’s normal “code” editing mode (not as part of an Architect/Editor configuration). This “solo” baseline represents the performance previously available when using this model with aider.

Architect	Editor	Edit Format	Pass Rate
o1-preview	o1-mini	whole	85.0%
o1-preview	deepseek	whole	85.0%
o1-preview	claude-3-5-sonnet	diff	82.7%
o1-preview	deepseek	diff	80.5%
o1-preview	gpt-4o	diff	80.5%
o1-preview	Baseline	diff	79.7%
claude-3.5-sonnet	claude-3.5-sonnet	diff	80.5%
claude-3.5-sonnet	deepseek	diff	78.9%
claude-3.5-sonnet	deepseek	whole	78.9%
claude-3.5-sonnet	Baseline	diff	77.4%
gpt-4o	gpt-4o	diff	75.2%
gpt-4o	deepseek	diff	74.4%
gpt-4o	deepseek	whole	73.7%
gpt-4o	Baseline	diff	71.4%
o1-mini	deepseek	whole	71.4%
o1-mini	gpt-4o	diff	70.7%
o1-mini	deepseek	diff	69.2%
o1-mini	Baseline	diff	61.1%
gpt-4o-mini	gpt-4o-mini	whole	60.2%
gpt-4o-mini	Baseline	whole	55.6%

o1-preview is SOTA on the aider leaderboard

2024-09-12T00:00:00+00:00

September 12, 2024

OpenAI o1-preview is SOTA on the aider leaderboard

o1-preview

OpenAI o1-preview scored 79.7% on aider’s code editing benchmark, a state of the art result. It achieved this result with the “whole” edit format, where the LLM returns a full copy of the source code file with changes.

It is much more practical to use aider’s “diff” edit format, which allows the LLM to return search/replace blocks to efficiently edit the source code. This saves significant time and token costs.

Using the diff edit format the o1-preview model had a strong benchmark score of 75.2%. This likely places o1-preview between Sonnet and GPT-4o for practical use, but at significantly higher cost.

o1-mini

OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet, but scored below those models. It also works best with the whole edit format.

Future work

The o1-preview model had trouble conforming to aider’s diff edit format. The o1-mini model had trouble conforming to both the whole and diff edit formats. Aider is extremely permissive and tries hard to accept anything close to the correct formats.

It is surprising that such strong models had trouble with the syntactic requirements of simple text output formats. It seems likely that aider could optimize its prompts and edit formats to better harness the o1 models.

Using aider with o1

OpenAI’s new o1 models are supported in v0.57.0 of aider:

aider --model o1-mini
aider --model o1-preview

These are initial benchmark results for the o1 models, based on aider v0.56.1-dev. See the aider leaderboards for up-to-date results based on the latest aider releases.

Model	Percent completed correctly	Percent using correct edit format	Command	Edit format
o1-preview (whole)	79.7%	100.0%	`aider --model o1-preview`	whole
claude-3.5-sonnet (diff)	77.4%	99.2%	`aider --sonnet`	diff
o1-preview (diff)	75.2%	84.2%	`aider --model o1-preview`	diff
claude-3.5-sonnet (whole)	75.2%	100.0%	`aider --model openrouter/anthropic/claude-3.5-sonnet --edit-format whole`	whole
gpt-4o-2024-08-06 (diff)	71.4%	98.5%	`aider --model openai/gpt-4o-2024-08-06`	diff
o1-mini (whole)	70.7%	90.0%	`aider --model o1-mini`	whole
o1-mini (diff)	62.4%	85.7%	`aider --model o1-mini --edit-format diff`	diff
gpt-4o-mini (whole)	55.6%	100.0%	`aider --model gpt-4o-mini`	whole

Sonnet seems as good as ever

2024-08-26T00:00:00+00:00

August 26, 2024

Sonnet seems as good as ever

Recently there has been a lot of speculation that Sonnet has been dumbed-down, nerfed or is otherwise performing worse. Sonnet seems as good as ever, when performing the aider code editing benchmark via the API.

Below is a graph showing the performance of Claude 3.5 Sonnet over time. It shows every clean, comparable benchmark run performed since Sonnet launched. Benchmarks were performed for various reasons, usually to evaluate the effects of small changes to aider’s system prompts.

The graph shows variance, but no indication of a noteworthy degradation. There is always some variance in benchmark results, typically +/- 2% between runs with identical prompts.

It’s worth noting that these results would not capture any changes made to Anthropic web chat’s use of Sonnet.

This graph shows the performance of Claude 3.5 Sonnet on aider’s code editing benchmark over time. ‘Pass Rate 1’ represents the initial success rate, while ‘Pass Rate 2’ shows the success rate after a second attempt with a chance to fix testing errors. The aider LLM code editing leaderboard ranks models based on Pass Rate 2.

LLMs are bad at returning code in JSON

2024-08-14T00:00:00+00:00

August 14, 2024

LLMs are bad at returning code in JSON

LLMs produce lower quality code if they’re asked to return it as part of a structured JSON response. This seems to be true for many top models, including those with specialized support for JSON. Benchmarks show that models struggle with syntax errors in the code they write, related to quoting and escaping it into JSON. The benchmark results also imply a decreased capacity for solving coding problems due to the burden of JSON formatting.

Figure 1: Aider coding benchmark scores of models using either plain markdown text or JSON to return code. Pass rate (%) averaged over 5 runs. Models produce better code when they return it as markdown text, as compared to returning code in a structured JSON response.

Background

People often ask why aider uses a plain text format for LLMs to specify code edits (below), rather than relying on LLM tools and structured JSON responses.

greeting.py
<<<<<<< SEARCH
def greeting():
    print("Hello")
=======
def greeting():
    print("Goodbye")
>>>>>>> REPLACE

People expect that it would be easier and more reliable to use tool calls, which would involve a structured JSON response more like this:

{
    "filename": "greeting.py",
    "search": "def greeting():\n    print(\"Hello\")\n"
    "replace": "def greeting():\n    print(\"Goodbye\")\n"
}

This question becomes increasingly relevant as LLM providers continue to improve their tooling for reliably generating JSON. For example, OpenAI recently announced the ability to strictly enforce that JSON responses will be syntactically correct and conform to a specified schema.

But just producing valid JSON is not sufficient for AI code generation – the code inside the JSON matters too. It has to be high quality code that solves the assigned coding task without errors or bugs. Unfortunately, LLMs write worse code when they’re asked to wrap it in JSON.

In some sense this shouldn’t be surprising. Just look at the very simple JSON example above, with the escaped quotes \" and newlines \n mixed into the code. Imagine the additional complexity if the code itself contained quoted strings with their own escape sequences.

Would you write better code by typing it out normally or typing it as a properly escaped JSON string?

Quantifying the benefits of plain text

Previous aider benchmark results showed the superiority of returning code as plain text compared to JSON-wrapped function calls. Those results were obtained over a year ago, against models far less capable than those available today. OpenAI’s newly announced support for “strict” JSON suggests the possibility that modern models might be able to return quality code inside a structured JSON response.

The results presented here are based on the aider “code editing” benchmark of 133 practice exercises from the Exercism python repository. The benchmark was simplified somewhat to focus on the differences between plain text and JSON responses. In particular, models were restricted to a single attempt to solve each task without a second try to fix errors.

The performance of each model was compared across different strategies for returning code:

Markdown – the model returned the whole source code file in standard markdown triple-backtick fences.
JSON – the model used a tool function call to return the whole source code file in a structured JSON response.
JSON (strict) – the same as the “JSON” strategy, but with strict=True. Only gpt-4o-2024-08-06 supported this setting.

The markdown strategy was the same as aider’s “whole” edit format, where the LLM returns an entire updated copy of the source file like this:

Here is the program you asked for which prints "Hello":

greeting.py
```
def greeting():
    print("Hello")
```

Both JSON strategies required the LLM to call the write_file function with an explanation/plan and the entire updated copy of the source file. The LLM didn’t have to specify the filename, since the benchmark operates on one source file at a time.

{
    "explanation": "Here is the program you asked for which prints \"Hello\"",
    "content": "def greeting():\n    print(\"Hello\")\n"
}

This experimental setup was designed to quantify the effects of JSON-wrapping on the LLMs ability to write code to solve a task.

Results

Four of the strongest code editing models were benchmarked to assess the impact of JSON-wrapping code:

claude-3-5-sonnet-20240620
deepseek-coder (V2 0724)
gpt-4o-2024-05-13
gpt-4o-2024-08-06

Each combination of model and code wrapping strategy was benchmarked 5 times on all 133 problems.

Overall coding skill

As shown in Figure 1, all of the models did worse on the benchmark when asked to return code in a structured JSON response. Most did significantly worse, performing well below their result with the markdown strategy.

Some noteworthy observations:

OpenAI’s gpt-4o-2024-05-13 was the only model where the markdown and JSON results were close. Using JSON only dropped the score by 0.4 percent, a difference which is within the margin of error for 5 trials.
The use of OpenAI’s new strict mode offered no improvement as compared to non-strict JSON. Both JSON results were well below the markdown result.
The results from Sonnet and DeepSeek Coder suffered the worst harm from JSON wrapping.

Syntax errors

Models tend to make more syntax errors in the code they write when asked to wrap it in JSON. The models can reliably produce valid JSON, but code inside is more prone to syntax errors.

Figure 2 shows the number of syntax errors found in the code produced by each model and code wrapping strategy. It totals up the SyntaxError and IndentationError errors from all 5 runs, for each model and strategy combination.

Below is an example of a SyntaxError created by gpt-4o-2024-05-13 using the JSON code wrapping strategy. It appears that the model got confused about escaping and quoting while trying to format the JSON response.

Traceback (most recent call last):
  ...   
  File "bottle-song/bottle_song.py", line 9
    lyrics.append(f'There'll be {i - 1} green bottles hanging on the wall.')
                                                                          ^
SyntaxError: unterminated string literal (detected at line 9)

The problematic line of code contains a single-quoted string which also contains a single-quote character. It should have been output as the following chunk of JSON, with a double slash in There\\'ll. That is needed to JSON-escape the \ so that it survives JSON-decoding to produce There\'ll in the resulting code. That would correctly escape the single-quote inside the single-quoted string.

...lyrics.append(f'There\\'ll be {i - 1} green bottles hanging on the wall.')\n...

Figure 2: Number of SyntaxError and IndentationError errors found in model generated code, totaled from 5 runs. Models tend to make more syntax and formatting errors when asked to wrap code in JSON.

Beyond syntax errors

Sonnet’s results seems to indicate that the negative effects of JSON-wrapping go beyond just syntactic difficulties. Sonnet avoided syntax errors regardless of the code wrapping strategy, but its benchmark scores in Figure 1 were nonetheless lower with JSON. This implies that JSON-wrapping may distract or challenge models in a way that reduces their ability to reason about solving coding problems.

Conclusions

While the specific results differ from the similar July 2023 experiments, the conclusion remains unchanged: LLMs are bad at returning code in structured JSON responses.

OpenAI appears to be making progress in allowing LLMs to return JSON-wrapped code without harming the code quality. But it seems premature to consider switching from plain text to JSON-wrapped code at this time.

Notes on the aider leaderboard

The results presented here are not directly comparable to results from the main aider LLM leaderboard. A number of settings were changed to simplify the benchmark in order to focus on comparing plain text and JSON-wrapped code.

Coding with Llama 3.1, new DeepSeek Coder & Mistral Large

2024-07-25T00:00:00+00:00

July 25, 2024

Coding with Llama 3.1, new DeepSeek Coder & Mistral Large

Five noteworthy models have been released in the last few days, with a wide range of code editing capabilities. Here are their results from aider’s code editing leaderboard with Claude 3.5 Sonnet and the best GPT-3.5 model included for scale.

77% claude-3.5-sonnet
73% DeepSeek Coder V2 0724
66% llama-3.1-405b-instruct
60% Mistral Large 2 (2407)
59% llama-3.1-70b-instruct
58% gpt-3.5-turbo-0301
38% llama-3.1-8b-instruct

You can code with all of these models using aider like this:

$ python -m pip install -U aider-chat

# Change directory into a git repo to work on
$ cd /to/your/git/repo

$ export DEEPSEEK_API_KEY=your-key-goes-here
$ aider --model deepseek/deepseek-coder

$ export MISTRAL_API_KEY=your-key-goes-here
$ aider --model mistral/mistral-large-2407

$ export OPENROUTER_API_KEY=your-key-goes-here
$ aider --model openrouter/meta-llama/llama-3.1-405b-instruct
$ aider --model openrouter/meta-llama/llama-3.1-70b-instruct
$ aider --model openrouter/meta-llama/llama-3.1-8b-instruct

See the installation instructions and other documentation for more details.

DeepSeek Coder V2 0724

DeepSeek Coder V2 0724 was by far the biggest surprise and strongest code editing model, coming in 2nd on the leaderboard. It can efficiently edit code with SEARCH/REPLACE, unlike the prior DeepSeek Coder version. This unlocks the ability to edit large files.

This new Coder version got 73% on the benchmark, very close to Sonnet’s 77% but 20-50X less expensive!

LLama 3.1

Meta released the Llama 3.1 family of models, which have performed well on many evals.

The flagship Llama 3.1 405B instruct only secured #7 on aider’s leaderboard, well behind frontier models like Claude 3.5 Sonnet & GPT-4o.

The 405B model can use SEARCH/REPLACE to efficiently edit code, but with a decrease in the benchmark score. When using this “diff” editing format, its score dropped from 66% to 64%.

The smaller 70B model was competitive with GPT-3.5, while the 8B model lags far behind. Both seem unable to reliably use SEARCH/REPLACE to edit files. This limits them to editing smaller files that can fit into their output token limit.

Mistral Large 2 (2407)

Mistral Large 2 (2407) scored only 60% on aider’s code editing benchmark. This puts it just ahead of the best GPT-3.5 model. It doesn’t seem able to reliably use SEARCH/REPLACE to efficiently edit code, which limits its use to small source files.

Sonnet is the opposite of lazy

2024-07-01T00:00:00+00:00

July 01, 2024

Sonnet is the opposite of lazy

Claude 3.5 Sonnet represents a step change in AI coding. It is incredibly industrious, diligent and hard working. Unexpectedly, this presented a challenge: Sonnet was often writing so much code that it was hitting the 4k output token limit, truncating its coding in mid-stream.

Aider now works around this 4k limit and allows Sonnet to produce as much code as it wants. The result is surprisingly powerful. Sonnet’s score on aider’s refactoring benchmark jumped from 55.1% up to 64.0%. This moved Sonnet into second place, ahead of GPT-4o and behind only Opus.

Users who tested Sonnet with a preview of aider’s latest release were thrilled:

Works like a charm. It is a monster. It refactors files of any size like it is nothing. The continue trick with Sonnet is truly the holy grail. Aider beats [other tools] hands down. I’m going to cancel both subscriptions. – Emasoft
Thanks heaps for this feature - it’s a real game changer. I can be more ambitious when asking Claude for larger features. – cngarrison
Fantastic…! It’s such an improvement not being constrained by output token length issues. [I refactored] a single JavaScript file into seven smaller files using a single Aider request. – John Galt

Hitting the 4k token output limit

All LLMs have various token limits, the most familiar being their context window size. But they also have a limit on how many tokens they can output in response to a single request. Sonnet and the majority of other models are limited to returning 4k tokens.

Sonnet’s amazing work ethic caused it to regularly hit this 4k output token limit for a few reasons:

Sonnet is capable of outputting a very large amount of correct, complete new code in one response.
Similarly, Sonnet can specify long sequences of edits in one go, like changing a majority of lines while refactoring a large file.
Sonnet tends to quote large chunks of a file when performing a SEARCH & REPLACE edits. Beyond token limits, this is very wasteful.

Good problems

Problems (1) and (2) are “good problems” in the sense that Sonnet is able to write more high quality code than any other model! We just don’t want it to be interrupted prematurely by the 4k output limit.

Aider now allows Sonnet to return code in multiple 4k token responses. Aider seamlessly combines them so that Sonnet can return arbitrarily long responses. This gets all the upsides of Sonnet’s prolific coding skills, without being constrained by the 4k output token limit.

Wasting tokens

Problem (3) is more complicated, as Sonnet isn’t just being stopped early – it’s actually wasting a lot of tokens, time and money.

Faced with a few small changes spread far apart in a source file, Sonnet would often prefer to do one giant SEARCH/REPLACE operation of almost the entire file. It would be far faster and less expensive to instead do a few surgical edits.

Aider now prompts Sonnet to discourage these long-winded SEARCH/REPLACE operations and promotes much more concise edits.

Aider with Sonnet

The latest release of aider has specialized support for Claude 3.5 Sonnet:

Aider allows Sonnet to produce as much code as it wants, by automatically and seamlessly spreading the response out over a sequence of 4k token API responses.
Aider carefully prompts Sonnet to be concise when proposing code edits. This reduces Sonnet’s tendency to waste time, tokens and money returning large chunks of unchanging code.
Aider now uses Claude 3.5 Sonnet by default if the ANTHROPIC_API_KEY is set in the environment.

See aider’s install instructions for more details, but you can get started quickly with aider and Sonnet like this:

$ python -m pip install -U aider-chat

$ export ANTHROPIC_API_KEY= # Mac/Linux
$ setx   ANTHROPIC_API_KEY  # Windows, restart shell after setx

$ aider

Aider is SOTA for both SWE Bench and SWE Bench Lite

2024-06-02T00:00:00+00:00

June 02, 2024

Aider is SOTA for both SWE Bench and SWE Bench Lite

Aider scored 18.9% on the main SWE Bench benchmark, achieving a state-of-the-art result. The current top leaderboard entry is 13.8% from Amazon Q Developer Agent. The best result reported elsewhere seems to be 13.9% from Devin.

This result on the main SWE Bench builds on aider’s recent SOTA result on the easier SWE Bench Lite.

All of aider’s results reported here are pass@1 results, obtained without using the SWE Bench hints_text. Aider was benchmarked on the same 570 randomly selected SWE Bench problems that were used in the Devin evaluation. See the references for more details on the data presented in this chart.

Interactive, not agentic

Aider achieved this result mainly through its existing features that focus on static code analysis, reliable LLM code editing, and pragmatic UX for automatically fixing linting and testing errors. Aider intentionally has quite limited and narrow “agentic behavior” to avoid long delays, high token costs and the need for users to repeatedly code review incorrect solutions. It’s also worth noting that aider currently does not use RAG, vector search, tools or give the LLM access to search the web or unilaterally execute code.

Aider is first and foremost an interactive tool for engineers to get real work done in real code bases using a chat interface. Aider provides a pair programming UX where users can ask for a change and see code edits performed in real-time. Aider can also offer additional help like fixing lint or test errors, but the user is always in full interactive control. This allows them to quickly steer misunderstandings back on course and avoid wasting time and token costs.

Benchmark methodology

Benchmarking was conducted as follows:

Aider with GPT-4o was launched in each problem’s git repository with the problem statement submitted as the opening chat message from “the user”.
After that aider ran as normal, except all of aider’s suggestions were always accepted without user approval.
A simple harness was used to retry the SWE Bench problem if aider produced code that wasn’t plausibly correct. Plausibly correct means that aider reported that it had successfully edited the repo without causing syntax errors or breaking any pre-existing tests.
If the solution from aider with GPT-4o wasn’t plausible, the harness launched aider to try again from scratch using Claude 3 Opus.
If no plausible solution was found after those two tries, the harness picked the “most plausible” solution with the fewest edit/lint/test problems.

It’s important to be clear that aider and the benchmark harness only had access to the pre-existing tests in each problem’s repo. The held out “acceptance tests” were only used after benchmarking to compute statistics on which problems aider correctly resolved.

This is the same approach that was used for aider’s recent SOTA result on SWE Bench Lite. For the Lite benchmark, aider alternated between GPT-4o and Opus for up to six total attempts. To manage the cost of running the main SWE Bench benchmark, aider was limited to two total attempts: one with GPT-4o and one with Opus.

For a detailed discussion of the benchmark methodology, see the article about aider’s SWE Bench Lite results. Also, the aider SWE Bench repository on GitHub contains the harness and statistics code used for the benchmarks.

The benchmarking process was similar to how a developer might use aider to resolve a GitHub issue:

They could launch aider in their repo with the command below, which tells aider they want to accept every suggestion and to use pytest to run tests.
- aider --yes --test-cmd pytest
They could start the chat by pasting in the URL or text of a GitHub issue. Aider will pull in the URL’s content and then try and resolve the issue.
If aider doesn’t produce code that lints and tests clean, the user might decide to use git to revert the changes, and try again with aider --opus.

Aider with GPT-4o alone was SOTA

Using aider with GPT-4o to make a single attempt at resolving each problem achieved a score of 17.0%. This was itself a state-of-the-art result, before being surpassed by the main result being reported here that used aider with both GPT-4o & Opus.

Aider with GPT-4o & Opus

The benchmark harness started by using aider with GPT-4o to try and resolve each problem. For problems where this didn’t produce a plausible solution, the harness tried again using aider with Opus. So at most, two attempts were made for each problem.

The table below breaks down the proposed solutions that were found from each attempt at the 570 problems. A proposed solution is either:

A plausible solution where aider reported no outstanding errors from editing, linting and testing.
Or, the “most plausible” solution generated by either attempt, with the fewest outstanding editing, linting or testing errors.

The table also provides details on the 108 solutions that were ultimately verified as correctly resolving their issue.

Attempt	Agent	Number of proposed solutions	Percent of proposed solutions	Number of correctly resolved solutions	Percent of correctly resolved solutions	Score on SWE Bench Lite
1	Aider with GPT-4o	419	73.5%	87	80.6%	15.3%
2	Aider with Opus	151	26.5%	21	19.4%	3.7%
Total		570	100%	108	100%	18.9%

Non-plausible but correct solutions?

A solution doesn’t actually have to be plausible in order to correctly resolve the issue. Recall that plausible is simply defined as aider reporting that it successfully completed all file edits, repaired and resolved any linting errors and resolved any test failures. But there are many reasons why aider might fail to do those things and yet still produce a solution that will pass acceptance testing:

There may have been pre-existing failing tests in the repo, before aider even started working on the SWE Bench problem. Aider may not have resolved such issues, and yet they may not be relevant to the acceptance testing. The SWE Bench acceptance testing just confirms that tests pass or fail in the same pattern as the “gold patch” developed by a human to resolve the problem. Some tests may fail during acceptance testing, and that’s ok as long as they failed for the gold patch too.
There may have been pre-existing linting problems in the repo. If lingering linting issues affected code paths that are not well tested, they may not impact acceptance testing.
Aider may have reported file editing errors because it thought the LLM specified edits that it wasn’t able to successfully apply. This can only happen when the LLM specified edits in a way that doesn’t comply with the editing instructions in the system prompt. Given that the LLM isn’t complying with the system prompt, it may have become confused and asked for redundant or otherwise irrelevant edits. Such outstanding edit errors might not be fatal for acceptance testing.
Etc.

Keeping all this in mind, we can understand why GPT-4o accounts for 15.3% of the benchmark score in the table above, but benchmarking with just one attempt of aider with GPT-4o scored 17.0%. When an Opus attempt is allowed after GPT-4o, it may propose some incorrect solutions which are “more plausible” than some of GPT-4o’s non-plausible solutions. These more plausible, incorrect solutions can eclipse some of the earlier non-plausible correct solutions that GPT-4o generated. This is why GPT-4o’s score in the table showing the combined GPT-4o & Opus results (15.3%) is lower than the result from just one try using aider with GPT-4o (17.0%).

For these reasons, adding additional attempts is not guaranteed to monotonically increase the number of resolved problems. New solutions may resolve some new problems but they may also eclipse and discard some of the previous non-plausible correct solutions.

Luckily, the net effect of additional attempts usually increases or at least maintains the number of resolved solutions. This was the case for all the attempts made in both this main SWE Bench result and the earlier Lite result.

Computing the benchmark score

The benchmark harness produced one proposed solution for each of the 570 SWE Bench problems.

A separate evaluation script was used to test each of these solutions with the full test suite, including the held out acceptance tests. For this final acceptance testing, any edits that aider made to tests were discarded. This ensured that the correct, unmodified test suite was used for acceptance testing. The evaluation script compared each proposed solution’s test results with results from testing the “gold” patch that was developed by a human to correctly resolve the issue. If they matched, the proposed solution correctly resolved the issue.

These acceptance tests were only ever run outside of aider and the benchmark harness, and only to compute statistics about the correctly resolved instances. They were never run, used, or even visible during aider’s attempts to resolve the problems.

Aider correctly resolved 108 out of 570 SWE Bench instances that were benchmarked, or 18.9%.

Acknowledgments

Much thanks to the team behind the SWE Bench family of AI coding benchmarks. Also thanks to Albert Örwall who has dockerized the SWE Bench evaluation scripts making it faster, easier, and more reliable to run the acceptance tests.

References

All of aider’s results reported here are pass@1 results, obtained without using the SWE Bench hints_text.

The “aider agent” internally makes multiple “attempts” at solving the problem, but it picks and returns one single candidate solution. Only that one candidate solution is evaluated with the acceptance tests and contributes to the benchmark score. Thus it is a pass@1 result.

This is contrast to a pass@N result for N>1, where N attempts are made and all N solutions are evaluated by the acceptance tests. If any of the N solution pass, that counts as a pass@N success.

Below are the references for the other pass@1 unhinted SWE-Bench results displayed in the graph at the beginning of this article.

The graph contains average pass@1 results for AutoCodeRover. The AutoCodeRover GitHub page features their pass@3 results without being clearly labeled. Table 2 of their paper reports an ACR-avg result of 10.59% which is an average pass@1 result.