<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://aider.chat/feed.xml" rel="self" type="application/atom+xml" /><link href="https://aider.chat/" rel="alternate" type="text/html" /><updated>2025-01-02T21:26:38+00:00</updated><id>https://aider.chat/feed.xml</id><title type="html">aider</title><subtitle>aider is AI pair programming in your terminal</subtitle><entry><title type="html">o1 tops aider’s new polyglot leaderboard</title><link href="https://aider.chat/2024/12/21/polyglot.html" rel="alternate" type="text/html" title="o1 tops aider’s new polyglot leaderboard" /><published>2024-12-21T00:00:00+00:00</published><updated>2024-12-21T00:00:00+00:00</updated><id>https://aider.chat/2024/12/21/polyglot</id><content type="html" xml:base="https://aider.chat/2024/12/21/polyglot.html"><![CDATA[<p class="post-date">December 21, 2024</p>

<h1 class="no_toc" id="o1-tops-aiders-new-polyglot-leaderboard">o1 tops aider’s new polyglot leaderboard</h1>

<canvas id="editChart" width="800" height="450" style="margin-top: 20px"></canvas>

<p>OpenAI’s new o1 model with “high” reasoning effort
gets the top score on the
new 
<a href="/docs/leaderboards/">aider polyglot leaderboard</a>, significantly ahead of
other top LLMs.
The new polyglot benchmark uses many popular coding languages
and was designed to be 
<em>much more challenging</em> than aider’s original
<a href="/docs/leaderboards/edit.html">code editing benchmark</a>.
This more clearly distinguishes 
the performance of
today’s strongest coding models and
leaves headroom for future LLMs.</p>

<p class="note">See the main 
<a href="https://aider.chat/docs/leaderboards/">aider leaderboard</a>
for benchmark results from more models.
This article only contains a snapshot
of results at the time of publication.</p>

<h2 id="the-polyglot-benchmark">The polyglot benchmark</h2>

<p>Like aider’s original code editing benchmark,
the new polyglot benchmark is based on Exercism
coding exercises.</p>

<p>The new polyglot benchmark:</p>

<ul>
  <li>Contains coding problems in C++, Go, Java, JavaScript, Python and Rust. 
The old benchmark was solely based on Python exercises.</li>
  <li>Focuses on the <em>most difficult</em> 225 exercises out of the 697 that
Exercism provides for those languages.
The old benchmark simply included all 133 Python exercises,
regardless of difficulty.</li>
</ul>

<h2 id="motivation-and-goals">Motivation and goals</h2>

<p>Aider’s original code editing benchmark was 
saturating as the top scores approached and then surpassed 80%.
Sonnet’s score of 84.2% was based on solving 112 of the 133
exercises, leaving only 21 unsolved exercises.
New champions were advancing the top score by
solving just 1-2 more problems than the previous record.
This made it hard to clearly 
measure the
difference in code editing skill between these top models.</p>

<p>Part of the problem is that many of the original
133 Python problems are very easy 
and provide
little challenge to today’s frontier LLMs.
Models as old as GPT 3.5 Turbo were able to solve half of the
133 problems.
Such easy problems simply inflate the benchmark scores 
of modern LLMs without
providing any data about which models are better or worse.</p>

<p>The main goal for a new benchmark 
was to re-calibrate the scale so that
today’s top coding LLMs 
would occupy a wide range of scores between about 5% and 50%.
This should leave headroom for future LLMs and
make it possible to
more clearly compare the relative performance of top models.</p>

<h2 id="designing-the-polyglot-benchmark">Designing the polyglot benchmark</h2>

<p>The new benchmark:</p>

<ul>
  <li>Tests LLMs with more coding languages, to increase diversity and source a larger pool of problems.</li>
  <li>Includes just the most challenging coding problems and excludes easy problems that are solvable by most of today’s top coding LLMs.</li>
  <li>Includes more total coding problems, to enable more granularity of comparison.</li>
</ul>

<p>The new benchmark is based on Exercism coding problems
from 6 of the most popular programming languages:</p>

<ul>
  <li>C++</li>
  <li>Go</li>
  <li>Java</li>
  <li>JavaScript</li>
  <li>Python</li>
  <li>Rust</li>
</ul>

<p>Exercism provides a total of 697 coding problems in those 6 languages.
A set of 7 of today’s top coding models each attempted all 697 of
the Exercism problems:</p>

<ul>
  <li>Sonnet</li>
  <li>Haiku</li>
  <li>o1 Mini</li>
  <li>DeepSeek</li>
  <li>GPT-4o</li>
  <li>Qwen 32B Coder Instruct</li>
  <li>GPT-4o Mini</li>
</ul>

<p>Depending on the difficulty of the problems,
a different number of solutions were found by the collection of
7 models:</p>

<table>
  <thead>
    <tr>
      <th>Solutions<br />found</th>
      <th>Number of<br />problems</th>
      <th>Cumulative number<br />of problems</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0</td>
      <td>66</td>
      <td>66</td>
    </tr>
    <tr>
      <td>1</td>
      <td>61</td>
      <td>127</td>
    </tr>
    <tr>
      <td>2</td>
      <td>50</td>
      <td>177</td>
    </tr>
    <tr>
      <td>3</td>
      <td>48</td>
      <td>225</td>
    </tr>
    <tr>
      <td>4</td>
      <td>53</td>
      <td>278</td>
    </tr>
    <tr>
      <td>5</td>
      <td>71</td>
      <td>349</td>
    </tr>
    <tr>
      <td>6</td>
      <td>90</td>
      <td>439</td>
    </tr>
    <tr>
      <td>7</td>
      <td>258</td>
      <td>697</td>
    </tr>
  </tbody>
</table>

<p>In the table above, you can see that 258 of the problems were solved
by all 7 LLMs.
These problems are far too easy, and wouldn’t be good choices for the new benchmark.
Instead, we need hard problems like the
66 that none of the 7 models were able to solve.</p>

<p>The new benchmark uses 
the 225 problems that were solved by 3 or fewer models.
This achieves a balance between hard and moderate problems,
and provides a large but not excessive total pool of problems.
It also represents a good diversity of coding languages:</p>

<table>
  <thead>
    <tr>
      <th>Language</th>
      <th>Problems</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>C++</td>
      <td>26</td>
    </tr>
    <tr>
      <td>Go</td>
      <td>39</td>
    </tr>
    <tr>
      <td>Java</td>
      <td>47</td>
    </tr>
    <tr>
      <td>JavaScript</td>
      <td>49</td>
    </tr>
    <tr>
      <td>Python</td>
      <td>34</td>
    </tr>
    <tr>
      <td>Rust</td>
      <td>30</td>
    </tr>
    <tr>
      <td><strong>Total</strong></td>
      <td><strong>225</strong></td>
    </tr>
  </tbody>
</table>

<h2 id="o1">o1</h2>

<p>OpenAI’s new o1 model established a very strong
top score of 62% on the new benchmark.
This still leaves 86 problems of headroom for future models
to solve.
Given the incredible pace of recent advancements, it
will be interesting to see
how long it will take for this new benchmark to saturate.</p>

<h2 id="benchmark-problems">Benchmark problems</h2>

<p>The 225 coding problems are available in the
<a href="https://github.com/Aider-AI/polyglot-benchmark">aider polyglot benchmark repo</a>
on GitHub.</p>

<h2 id="results">Results</h2>

<table style="width: 100%; max-width: 800px; margin: auto; border-collapse: collapse; box-shadow: 0 2px 4px rgba(0,0,0,0.1); font-size: 14px;">
  <thead style="background-color: #f2f2f2;">
    <tr>
      <th style="padding: 8px; text-align: left;">Model</th>
      <th style="padding: 8px; text-align: center;">Percent completed correctly</th>
      <th style="padding: 8px; text-align: center;">Percent using correct edit format</th>
      <th style="padding: 8px; text-align: left;">Command</th>
      <th style="padding: 8px; text-align: center;">Edit format</th>
    </tr>
  </thead>
  <tbody>
    
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">o1-2024-12-17 (high)</td>
        <td style="padding: 8px; text-align: center;">61.7%</td>
        <td style="padding: 8px; text-align: center;">91.5%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/openai/o1</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">claude-3-5-sonnet-20241022</td>
        <td style="padding: 8px; text-align: center;">45.3%</td>
        <td style="padding: 8px; text-align: center;">100.0%</td>
        <td style="padding: 8px;"><code>aider --model claude-3-5-sonnet-20241022</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">gemini-exp-1206</td>
        <td style="padding: 8px; text-align: center;">38.2%</td>
        <td style="padding: 8px; text-align: center;">98.2%</td>
        <td style="padding: 8px;"><code>aider --model gemini/gemini-exp-1206</code></td>
        <td style="padding: 8px; text-align: center;">whole</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">o1-mini-2024-09-12</td>
        <td style="padding: 8px; text-align: center;">32.9%</td>
        <td style="padding: 8px; text-align: center;">96.9%</td>
        <td style="padding: 8px;"><code>aider --model o1-mini</code></td>
        <td style="padding: 8px; text-align: center;">whole</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">claude-3-5-haiku-20241022</td>
        <td style="padding: 8px; text-align: center;">28.0%</td>
        <td style="padding: 8px; text-align: center;">91.1%</td>
        <td style="padding: 8px;"><code>aider --model claude-3-5-haiku-20241022</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">gemini-2.0-flash-exp</td>
        <td style="padding: 8px; text-align: center;">22.2%</td>
        <td style="padding: 8px; text-align: center;">100.0%</td>
        <td style="padding: 8px;"><code>aider --model gemini/gemini-2.0-flash-exp</code></td>
        <td style="padding: 8px; text-align: center;">whole</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">DeepSeek Chat V2.5</td>
        <td style="padding: 8px; text-align: center;">17.8%</td>
        <td style="padding: 8px; text-align: center;">92.9%</td>
        <td style="padding: 8px;"><code>aider --model deepseek/deepseek-chat</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">gpt-4o-2024-11-20</td>
        <td style="padding: 8px; text-align: center;">15.1%</td>
        <td style="padding: 8px; text-align: center;">96.0%</td>
        <td style="padding: 8px;"><code>aider --model gpt-4o-2024-11-20</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Qwen2.5-Coder-32B-Instruct</td>
        <td style="padding: 8px; text-align: center;">8.0%</td>
        <td style="padding: 8px; text-align: center;">71.6%</td>
        <td style="padding: 8px;"><code>aider --model openai/Qwen/Qwen2.5-Coder-32B-Instruct # via hyperbolic</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">gpt-4o-mini-2024-07-18</td>
        <td style="padding: 8px; text-align: center;">3.6%</td>
        <td style="padding: 8px; text-align: center;">100.0%</td>
        <td style="padding: 8px;"><code>aider --model gpt-4o-mini-2024-07-18</code></td>
        <td style="padding: 8px; text-align: center;">whole</td>
      </tr>
    
  </tbody>
</table>

<script src="https://unpkg.com/patternomaly/dist/patternomaly.js"></script>

<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>

<script>



document.addEventListener('DOMContentLoaded', function () {
  var ctx = document.getElementById('editChart').getContext('2d');
  const blueDiagonalPattern = pattern.draw('diagonal', 'rgba(54, 162, 235, 0.2)');
  const redDiagonalPattern = pattern.draw('diagonal', 'rgba(255, 99, 132, 0.2)');
  let displayedData = [];

  const HIGHLIGHT_MODEL = 'o1-2024';
  var leaderboardData = {
    labels: [],
    datasets: [{
      label: 'Percent completed correctly',
      data: [],
      backgroundColor: function(context) {
        const row = allData[context.dataIndex];
        if (row && row.edit_format === 'whole') {
          return diagonalPattern;
        }
        const label = leaderboardData.labels[context.dataIndex] || '';
        return (label && label.includes(HIGHLIGHT_MODEL)) ? 'rgba(255, 99, 132, 0.2)' : 'rgba(54, 162, 235, 0.2)';
      },
      borderColor: function(context) {
        const label = context.chart.data.labels[context.dataIndex] || '';
        return (label && label.includes(HIGHLIGHT_MODEL)) ? 'rgba(255, 99, 132, 1)' : 'rgba(54, 162, 235, 1)';
      },
      borderWidth: 1
    }]
  };

  var allData = [];
  
    allData.push({
      model: 'o1-2024-12-17 (high)',
      pass_rate: 61.7,
      percent_cases_well_formed: 91.5,
      edit_format: 'diff'
    });
  
    allData.push({
      model: 'claude-3-5-sonnet-20241022',
      pass_rate: 45.3,
      percent_cases_well_formed: 100.0,
      edit_format: 'diff'
    });
  
    allData.push({
      model: 'gemini-exp-1206',
      pass_rate: 38.2,
      percent_cases_well_formed: 98.2,
      edit_format: 'whole'
    });
  
    allData.push({
      model: 'o1-mini-2024-09-12',
      pass_rate: 32.9,
      percent_cases_well_formed: 96.9,
      edit_format: 'whole'
    });
  
    allData.push({
      model: 'claude-3-5-haiku-20241022',
      pass_rate: 28.0,
      percent_cases_well_formed: 91.1,
      edit_format: 'diff'
    });
  
    allData.push({
      model: 'gemini-2.0-flash-exp',
      pass_rate: 22.2,
      percent_cases_well_formed: 100.0,
      edit_format: 'whole'
    });
  
    allData.push({
      model: 'DeepSeek Chat V2.5',
      pass_rate: 17.8,
      percent_cases_well_formed: 92.9,
      edit_format: 'diff'
    });
  
    allData.push({
      model: 'gpt-4o-2024-11-20',
      pass_rate: 15.1,
      percent_cases_well_formed: 96.0,
      edit_format: 'diff'
    });
  
    allData.push({
      model: 'Qwen2.5-Coder-32B-Instruct',
      pass_rate: 8.0,
      percent_cases_well_formed: 71.6,
      edit_format: 'diff'
    });
  
    allData.push({
      model: 'gpt-4o-mini-2024-07-18',
      pass_rate: 3.6,
      percent_cases_well_formed: 100.0,
      edit_format: 'whole'
    });
  

  function updateChart() {
    var selectedRows = document.querySelectorAll('tr.selected');
    var showAll = selectedRows.length === 0;

    displayedData = [];
    leaderboardData.labels = [];
    leaderboardData.datasets[0].data = [];

    allData.forEach(function(row, index) {
      var rowElement = document.getElementById('edit-row-' + index);
      if (showAll) {
        rowElement.classList.remove('selected');
      }
      if (showAll || rowElement.classList.contains('selected')) {
        displayedData.push(row);
        leaderboardData.labels.push(row.model);
        leaderboardData.datasets[0].data.push(row.pass_rate);
      }
    });

    leaderboardChart.update();
    leaderboardChart.render();
  }

  // Use displayedData in the backgroundColor callback instead of allData
  leaderboardData.datasets[0].backgroundColor = function(context) {
    const row = displayedData[context.dataIndex];
    const label = leaderboardData.labels[context.dataIndex] || '';
    if (label && label.includes(HIGHLIGHT_MODEL)) {
      if (row && row.edit_format === 'whole') return redDiagonalPattern;
      else return 'rgba(255, 99, 132, 0.2)';
    } else if (row && row.edit_format === 'whole') {
      return blueDiagonalPattern;
    } else {
      return 'rgba(54, 162, 235, 0.2)';
    }
  };

  var tableBody = document.querySelector('table tbody');
  allData.forEach(function(row, index) {
    var tr = tableBody.children[index];
    if (!tr) {
      // If the row doesn't exist, create it
      tr = document.createElement('tr');
      tableBody.appendChild(tr);
    }
    tr.id = 'edit-row-' + index;
    tr.style.cursor = 'pointer';
    tr.onclick = function() {
      this.classList.toggle('selected');
      updateChart();
    };
  });

  var leaderboardChart = new Chart(ctx, {
    type: 'bar',
    data: leaderboardData,
    options: {
      plugins: {
        legend: {
          display: true,
          labels: {
            generateLabels: function(chart) {
              return [
                {
                  text: 'Diff-like format',
                  fillStyle: 'rgba(54, 162, 235, 0.2)',
                  strokeStyle: 'rgba(54, 162, 235, 1)',
                  lineWidth: 1
                },
                {
                  text: 'Whole format',
                  fillStyle: blueDiagonalPattern,
                  strokeStyle: 'rgba(54, 162, 235, 1)',
                  lineWidth: 1
                }
              ];
            }
          }
        }
      },
      scales: {
        y: {
          beginAtZero: true,
          title: {
            display: true,
            text: 'Percent completed correctly'
          }
        },
        x: {
          ticks: {
            callback: function(value, index) {
              const label = this.getLabelForValue(value);
              if (label.length <= "claude-3-5-sonnet".length) {
                return label;
              }
              
              // Find all possible split positions
              const splitPositions = [];
              for (let i = 0; i < label.length; i++) {
                if (label[i] === '-' || label[i] === ' ') {
                  splitPositions.push(i);
                }
              }
              
              if (splitPositions.length === 0) {
                return label;
              }
              
              // Find split position closest to middle
              const middle = label.length / 2;
              const splitIndex = splitPositions.reduce((closest, current) => {
                return Math.abs(current - middle) < Math.abs(closest - middle) ? current : closest;
              });
              
              return [
                label.slice(0, splitIndex),
                label.slice(splitIndex + 1)
              ];
            }
          }
        }
      }
    }
  });

  updateChart();
  
  // Add search functionality for edit table
  document.getElementById('editSearchInput').addEventListener('keyup', function() {
    var searchWords = this.value.toLowerCase().split(' ').filter(word => word.length > 0);
    var tableBody = document.querySelector('table:first-of-type tbody');
    var rows = tableBody.getElementsByTagName('tr');
    
    displayedData = [];
    leaderboardData.labels = [];
    leaderboardData.datasets[0].data = [];
    
    for (var i = 0; i < rows.length; i++) {
      var rowText = rows[i].textContent;
      if (searchWords.every(word => rowText.toLowerCase().includes(word))) {
        rows[i].style.display = '';
        displayedData.push(allData[i]);
        leaderboardData.labels.push(allData[i].model);
        leaderboardData.datasets[0].data.push(allData[i].pass_rate);
      } else {
        rows[i].style.display = 'none';
      }
    }
    leaderboardChart.update();
  });
});

</script>

<style>
  tr.selected {
    color: #0056b3;
  }
  table {
    table-layout: fixed;
  }
  td, th {
    word-wrap: break-word;
    overflow-wrap: break-word;
  }
  td:nth-child(3), td:nth-child(4) {
    font-size: 12px;
  }
</style>]]></content><author><name></name></author><summary type="html"><![CDATA[o1 scores the top result on aider's new multi-language, more challenging coding benchmark.]]></summary></entry><entry><title type="html">QwQ is a code architect, not an editor</title><link href="https://aider.chat/2024/12/03/qwq.html" rel="alternate" type="text/html" title="QwQ is a code architect, not an editor" /><published>2024-12-03T00:00:00+00:00</published><updated>2024-12-03T00:00:00+00:00</updated><id>https://aider.chat/2024/12/03/qwq</id><content type="html" xml:base="https://aider.chat/2024/12/03/qwq.html"><![CDATA[<p class="post-date">December 03, 2024</p>

<h1 class="no_toc" id="qwq-is-a-code-architect-not-an-editor">QwQ is a code architect, not an editor</h1>

<canvas id="qwqChart" width="800" height="500" style="margin: 20px 0"></canvas>

<p>QwQ 32B Preview is a “reasoning” model, which spends a lot of tokens thinking before
rendering a final response.
This is similar to OpenAI’s o1 models, which are most effective with aider
<a href="https://aider.chat/2024/09/26/architect.html">when paired as an architect with a traditional LLM as an editor</a>.
In this mode, the reasoning model acts as an “architect” to propose a solution to the
coding problem without regard for how to actually make edits to the source files.
The “editor” model receives that proposal, and focuses solely on how to
edit the existing source code to implement it.</p>

<p>Used alone without being paired with an editor, 
QwQ was unable to comply with even the simplest 
<a href="https://aider.chat/docs/more/edit-formats.html">editing format</a>.
It was not able to reliably edit source code files.
As a result, QwQ’s solo score on the benchmark was quite underwhelming
(and far worse than the o1 models performing solo).</p>

<p>QwQ is based on
Qwen 2.5 Coder 32B Instruct,
and does better when paired with it as an architect + editor combo.
Though this provided only a modest benchmark improvement over just using Qwen alone,
and comes with a fairly high cost in terms of latency.
Each request must wait for QwQ to return all its thinking text
and the final solution proposal.
And then one must wait for Qwen to turn that large
response into actual file edits.</p>

<p>Pairing QwQ with other sensible editor models performed the same or worse than
just using Qwen 2.5 Coder 32B Instruct alone.</p>

<p>QwQ+Qwen seems to be the best way to use QwQ, achieving a score of 74%.
That is well below the
SOTA results for this benchmark: Sonnet alone scores 84%, and
o1-preview + o1-mini as architect + editor scores 85%.</p>

<h2 id="qwq-specific-editing-formats">QwQ specific editing formats</h2>

<p>I spent some time experimenting with a variety of custom editing formats
for QwQ.
In particular, I tried to parse the QwQ response and discard the long
sections of “thinking” and retain only the “final” solution.
None of this custom work seemed to translate 
into any significant improvement in the benchmark results.</p>

<h2 id="results">Results</h2>

<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>

<script>
document.addEventListener('DOMContentLoaded', function () {
  var ctx = document.getElementById('qwqChart').getContext('2d');
  var allData = [];
  
    allData.push({
      model: 'QwQ + Haiku',
      pass_rate_2: 71.4
    });
  
    allData.push({
      model: 'QwQ + DeepSeek V2.5',
      pass_rate_2: 67.7
    });
  
    allData.push({
      model: 'Qwen2.5 Coder 32B-I',
      pass_rate_2: 71.4
    });
  
    allData.push({
      model: 'QwQ + Qwen2.5 Coder 32B-I',
      pass_rate_2: 73.6
    });
  
    allData.push({
      model: 'QwQ',
      pass_rate_2: 42.1
    });
  
    allData.push({
      model: 'o1-mini',
      pass_rate_2: 70.7
    });
  
    allData.push({
      model: 'o1-preview',
      pass_rate_2: 79.7
    });
  

  // Sort data by pass_rate_2 in descending order
  allData.sort((a, b) => b.pass_rate_2 - a.pass_rate_2);

  var chart;
  
  function updateChart(filterText) {
    var filteredData = allData.filter(row => 
      row.model.toLowerCase().includes(filterText.toLowerCase())
    );
    
    var chartData = {
      labels: filteredData.map(row => row.model),
      datasets: [{
        data: filteredData.map(row => row.pass_rate_2),
        backgroundColor: filteredData.map(row => 
          (row.model === 'Qwen2.5 Coder 32B-I' || row.model === 'Sonnet (SOTA)' || row.model === 'o1-mini' || row.model === 'o1-preview' || row.model === 'QwQ') 
            ? 'rgba(75, 192, 192, 0.2)'   // Green for solo models
            : 'rgba(54, 162, 235, 0.2)'   // Blue for architect+editor
        ),
        borderColor: filteredData.map(row => 
          (row.model === 'Qwen2.5 Coder 32B-I' || row.model === 'Sonnet (SOTA)' || row.model === 'o1-mini' || row.model === 'o1-preview' || row.model === 'QwQ')
            ? 'rgba(75, 192, 192, 1)'     // Green border for solo models
            : 'rgba(54, 162, 235, 1)'     // Blue border for architect+editor
        ),
        borderWidth: 1
      }]
    };

    if (chart) {
      chart.data = chartData;
      chart.update();
    } else {
      chart = new Chart(ctx, {
        type: 'bar',
        data: chartData,
        options: {
          plugins: {
            legend: {
              display: true,
              position: 'top',
              labels: {
                font: {
                  size: 14
                },
                generateLabels: function(chart) {
                  return [
                    {
                      text: 'Solo model',
                      fillStyle: 'rgba(75, 192, 192, 0.2)',
                      strokeStyle: 'rgba(75, 192, 192, 1)',
                      lineWidth: 1,
                      fontColor: '#666'
                    },
                    {
                      text: 'Architect + Editor',
                      fillStyle: 'rgba(54, 162, 235, 0.2)',
                      strokeStyle: 'rgba(54, 162, 235, 1)',
                      lineWidth: 1,
                      fontColor: '#666'
                    }
                  ];
                }
              }
            }
          },
          scales: {
            y: {
              beginAtZero: true,
              title: {
                display: true,
                text: 'Aider code editing benchmark (%)',
                font: {
                  size: 18
                }
              },
              ticks: {
                font: {
                  size: 16
                }
              }
            },
            x: {
              ticks: {
                font: {
                  size: 16
                },
                callback: function(value, index) {
                  const label = this.getLabelForValue(value);
                  if (label.includes(" + ")) {
                    const parts = label.split(" + ");
                    return [parts[0] + " +", parts[1]];
                  }
                  return label;
                }
              }
            }
          }
        }
      });
    }
  }

  // Initial chart render
  updateChart('');

  // Connect search input to chart filtering
  document.getElementById('qwqSearchInput').addEventListener('keyup', function() {
    updateChart(this.value);
  });
});

</script>

<table style="width: 100%; max-width: 800px; margin: auto; border-collapse: collapse; box-shadow: 0 2px 4px rgba(0,0,0,0.1); font-size: 14px;">
  <thead style="background-color: #f2f2f2;">
    <tr>
      <th style="padding: 8px; text-align: left;">Model</th>
      <th style="padding: 8px; text-align: center;">Percent completed correctly</th>
      <th style="padding: 8px; text-align: center;">Percent using correct edit format</th>
      <th style="padding: 8px; text-align: left;">Command</th>
      <th style="padding: 8px; text-align: center;">Edit format</th>
    </tr>
  </thead>
  <tbody>
    
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">o1-preview</td>
        <td style="padding: 8px; text-align: center;">79.7%</td>
        <td style="padding: 8px; text-align: center;">93.2%</td>
        <td style="padding: 8px;"><code>aider --model o1-preview</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">QwQ + Qwen2.5 Coder 32B-I</td>
        <td style="padding: 8px; text-align: center;">73.6%</td>
        <td style="padding: 8px; text-align: center;">100.0%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/qwen/qwq-32b-preview --editor-model openrouter/qwen/qwen-2.5-coder-32b-instruct --editor-edit-format editor-whole</code></td>
        <td style="padding: 8px; text-align: center;">architect</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Qwen2.5 Coder 32B-I</td>
        <td style="padding: 8px; text-align: center;">71.4%</td>
        <td style="padding: 8px; text-align: center;">94.7%</td>
        <td style="padding: 8px;"><code>aider --model openai/hf:Qwen/Qwen2.5-Coder-32B-Instruct --openai-api-base https://glhf.chat/api/openai/v1 (via GLHF)</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">QwQ + Haiku</td>
        <td style="padding: 8px; text-align: center;">71.4%</td>
        <td style="padding: 8px; text-align: center;">100.0%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/qwen/qwq-32b-preview --editor-model claude-3-5-haiku-20241022 --edit-format editor-whole</code></td>
        <td style="padding: 8px; text-align: center;">architect</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">o1-mini</td>
        <td style="padding: 8px; text-align: center;">70.7%</td>
        <td style="padding: 8px; text-align: center;">90.0%</td>
        <td style="padding: 8px;"><code>aider --model o1-mini</code></td>
        <td style="padding: 8px; text-align: center;">whole</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">QwQ + DeepSeek V2.5</td>
        <td style="padding: 8px; text-align: center;">67.7%</td>
        <td style="padding: 8px; text-align: center;">100.0%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/qwen/qwq-32b-preview --editor-model deepseek/deepseek-chat --edit-format editor-whole</code></td>
        <td style="padding: 8px; text-align: center;">architect</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">QwQ</td>
        <td style="padding: 8px; text-align: center;">42.1%</td>
        <td style="padding: 8px; text-align: center;">91.0%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/qwen/qwq-32b-preview</code></td>
        <td style="padding: 8px; text-align: center;">whole</td>
      </tr>
    
  </tbody>
</table>

<style>
  tr.selected {
    color: #0056b3;
  }
  table {
    table-layout: fixed;
  }
  td, th {
    word-wrap: break-word;
    overflow-wrap: break-word;
  }
  td:nth-child(3), td:nth-child(4) {
    font-size: 12px;
  }
</style>

<script>
document.getElementById('qwqSearchInput').addEventListener('keyup', function() {
    var input = this.value.toLowerCase();
    var rows = document.querySelectorAll('tbody tr');
    
    rows.forEach(function(row) {
        var text = row.textContent.toLowerCase();
        if(text.includes(input)) {
            row.style.display = '';
            row.classList.add('selected');
        } else {
            row.style.display = 'none';
            row.classList.remove('selected');
        }
    });
});
</script>

<h2 id="open-source-model-caveats">Open source model caveats</h2>

<p>As discussed in a recent blog post,
<a href="https://aider.chat/2024/11/21/quantization.html">details matter with open source models</a>.
For clarity, new benchmark runs for this article were
performed against OpenRouter’s endpoints for
QwQ 32B Preview and Qwen 2.5 Coder 32B Instruct.
For the other models, the benchmark was direct to their providers’ APIs.</p>

<p>Having recently done extensive testing of OpenRouter’s Qwen 2.5 Coder 32B Instruct endpoint,
it seems reliable.
The provider Mancer was blocked due to the small context window it provides.</p>

<p>For QwQ 32B Preview, Fireworks was blocked because of its small context window.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[QwQ is reasoning model like o1, and needs to be used as an architect with another model as editor.]]></summary></entry><entry><title type="html">Details matter with open source models</title><link href="https://aider.chat/2024/11/21/quantization.html" rel="alternate" type="text/html" title="Details matter with open source models" /><published>2024-11-21T00:00:00+00:00</published><updated>2024-11-21T00:00:00+00:00</updated><id>https://aider.chat/2024/11/21/quantization</id><content type="html" xml:base="https://aider.chat/2024/11/21/quantization.html"><![CDATA[<p class="post-date">November 21, 2024</p>

<h1 class="no_toc" id="details-matter-with-open-source-models">Details matter with open source models</h1>

<canvas id="quantChart" width="800" height="600" style="margin: 20px 0"></canvas>

<p>Open source models like Qwen 2.5 32B Instruct are performing very well on
aider’s code editing benchmark, rivaling closed source frontier models.</p>

<p>But pay attention to how your model is being served and quantized, 
as it can impact code editing skill.
Open source models are often available at a variety of quantizations,
and can be served with different token limits.
These details matter when working with code.</p>

<p>The graph above and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model,
served both locally and from a variety of cloud providers.</p>

<ul>
  <li>The <a href="https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct">HuggingFace BF16 weights</a> served via <a href="https://glhf.chat">glhf.chat</a>.</li>
  <li><a href="https://t.co/cwX3DYX35D">4bit and 8bit quants for mlx</a>.</li>
  <li>The results from <a href="https://openrouter.ai/qwen/qwen-2.5-coder-32b-instruct/providers">OpenRouter’s mix of providers</a> which serve the model with different levels of quantization.</li>
  <li>Results from OpenRouter’s providers, both served via OpenRouter and directly to their own APIs.</li>
  <li>Ollama locally serving different quantizations from the <a href="https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M">Ollama model library</a> with 8k+
context windows.</li>
  <li>An Ollama fp16 quantization served with Ollama’s default 2k context window.</li>
</ul>

<h3 id="pitfalls-and-details">Pitfalls and details</h3>

<p>This benchmarking effort highlighted a number of pitfalls and details specific to open source
models which
can have a significant impact on their ability to correctly edit code:</p>

<ul>
  <li><strong>Quantization</strong> – Open source models are often available at dozens of different quantizations.
Most seem to only modestly decrease code editing skill, but stronger quantizations
do have a real impact.</li>
  <li><strong>Context window</strong> – Cloud providers can decide how large a context window to accept,
and they often choose differently. Ollama’s local API server
defaults to a tiny 2k context window,
and silently discards data that exceeds it. Such a small window has
catastrophic effects on performance, without throwing obvious hard errors.</li>
  <li><strong>Output token limits</strong> – Open source models are often served with wildly
differing output token limits. This has a direct impact on how much code the
model can write or edit in a response.</li>
  <li><strong>Buggy cloud providers</strong> – While benchmarking Qwen 2.5 Coder 32B Instruct
and DeepSeek V2.5, I discovered
multiple cloud providers with broken or buggy API endpoints.
They seemed
to be returning results different from expected based on the advertised
quantization and context sizes.
The harm caused to the code editing benchmark varied from serious
to catastrophic.
One provider scored 0.5% on the benchmark with DeepSeek V2.5, a highly capable model.</li>
</ul>

<p>Closed source, proprietary models don’t typically have these issues.
They are owned and operated by the organization that created them,
and typically served with specific, predictable context window and output token limits.
Their quantization level is usually unknown, but fixed and unchanging for all users.</p>

<h3 id="conclusions">Conclusions</h3>

<p>The best versions of the Qwen model rival GPT-4o, while the worst performing
quantization is more like the older GPT-4 Turbo when served competently.
Even an otherwise excellent fp16 quantization falls to GPT-3.5 Turbo levels of performance
if run with Ollama’s default 2k context window.</p>

<h3 class="no_toc" id="sections">Sections</h3>

<ul id="markdown-toc">
  <li><a href="#pitfalls-and-details" id="markdown-toc-pitfalls-and-details">Pitfalls and details</a></li>
  <li><a href="#conclusions" id="markdown-toc-conclusions">Conclusions</a></li>
  <li><a href="#benchmark-results" id="markdown-toc-benchmark-results">Benchmark results</a></li>
  <li><a href="#setting-ollamas-context-window-size" id="markdown-toc-setting-ollamas-context-window-size">Setting Ollama’s context window size</a></li>
  <li><a href="#choosing-providers-with-openrouter" id="markdown-toc-choosing-providers-with-openrouter">Choosing providers with OpenRouter</a></li>
  <li><a href="#notes" id="markdown-toc-notes">Notes</a></li>
</ul>

<h2 id="benchmark-results">Benchmark results</h2>

<p class="note">These are results from single benchmark runs, so expect normal variance of +/- 1-2%.</p>

<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>

<script>
document.addEventListener('DOMContentLoaded', function () {
  var ctx = document.getElementById('quantChart').getContext('2d');
  var allData = [];
  
    allData.push({
      model: 'HuggingFace via GLHF: BF16',
      pass_rate_2: 71.4
    });
  
    allData.push({
      model: 'Ollama: fp16',
      pass_rate_2: 71.4
    });
  
    allData.push({
      model: 'Hyperbolic: BF16',
      pass_rate_2: 69.2
    });
  
    allData.push({
      model: 'mlx-community: 4bit',
      pass_rate_2: 72.2
    });
  
    allData.push({
      model: 'mlx-community: 8bit',
      pass_rate_2: 72.2
    });
  
    allData.push({
      model: 'OpenRouter: multiple',
      pass_rate_2: 67.7
    });
  
    allData.push({
      model: 'Ollama: q4_K_M',
      pass_rate_2: 66.9
    });
  
    allData.push({
      model: 'Deepinfra: BF16',
      pass_rate_2: 72.2
    });
  
    allData.push({
      model: 'Fireworks: unknown',
      pass_rate_2: 72.2
    });
  
    allData.push({
      model: 'Ollama: q2_K',
      pass_rate_2: 61.7
    });
  
    allData.push({
      model: 'Fireworks via OpenRouter: unknown',
      pass_rate_2: 67.7
    });
  
    allData.push({
      model: 'Hyperbolic via OpenRouter: BF16',
      pass_rate_2: 68.4
    });
  
    allData.push({
      model: 'Deepinfra via OpenRouter: BF16',
      pass_rate_2: 69.9
    });
  
    allData.push({
      model: 'Ollama: fp16, 2k ctx',
      pass_rate_2: 51.9
    });
  

  // Sort data by pass_rate_2 in descending order
  allData.sort((a, b) => b.pass_rate_2 - a.pass_rate_2);

  var chart;
  
  function updateChart(filterText) {
    var filteredData = allData.filter(row => 
      row.model.toLowerCase().includes(filterText.toLowerCase())
    );
    
    var chartData = {
      labels: filteredData.map(row => row.model),
      datasets: [{
        label: 'Percent completed correctly',
        data: filteredData.map(row => row.pass_rate_2),
        backgroundColor: 'rgba(54, 162, 235, 0.2)',
        borderColor: 'rgba(54, 162, 235, 1)',
        borderWidth: 1
      }]
    };

    if (chart) {
      chart.data = chartData;
      chart.update();
    } else {
      chart = new Chart(ctx, {
        type: 'bar',
        data: chartData,
        options: {
          plugins: {
            legend: {
              display: false
            },
            title: {
              display: true,
              text: 'Aider code editing benchmark',
              font: {
                size: 16
              }
            }
          },
          scales: {
            y: {
              beginAtZero: true,
              title: {
                display: true,
                text: 'Percent completed correctly',
                font: {
                  size: 14
                }
              },
              ticks: {
                font: {
                  size: 16
                }
              }
            },
            x: {
              ticks: {
                font: {
                  size: 16
                }
              },
              title: {
                display: true,
                text: 'Provider: quantization',
                font: {
                  size: 14
                }
              }
            }
          }
        }
      });
    }
  }

  // Initial chart render
  updateChart('');

  // Connect search input to chart filtering
  document.getElementById('quantSearchInput').addEventListener('keyup', function() {
    updateChart(this.value);
  });
});

</script>

<p><input type="text" id="quantSearchInput" placeholder="Search..." style="width: 100%; max-width: 800px; margin: 10px auto; padding: 8px; display: block; border: 1px solid #ddd; border-radius: 4px;" /></p>

<table style="width: 100%; max-width: 800px; margin: auto; border-collapse: collapse; box-shadow: 0 2px 4px rgba(0,0,0,0.1); font-size: 14px;">
  <thead style="background-color: #f2f2f2;">
    <tr>
      <th style="padding: 8px; text-align: left;">Model</th>
      <th style="padding: 8px; text-align: center;">Percent completed correctly</th>
      <th style="padding: 8px; text-align: center;">Percent using correct edit format</th>
      <th style="padding: 8px; text-align: left;">Command</th>
      <th style="padding: 8px; text-align: center;">Edit format</th>
    </tr>
  </thead>
  <tbody>
    
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Fireworks: unknown</td>
        <td style="padding: 8px; text-align: center;">72.2%</td>
        <td style="padding: 8px; text-align: center;">94.0%</td>
        <td style="padding: 8px;"><code>aider --model fireworks_ai/accounts/fireworks/models/qwen2p5-coder-32b-instruct</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Deepinfra: BF16</td>
        <td style="padding: 8px; text-align: center;">72.2%</td>
        <td style="padding: 8px; text-align: center;">94.7%</td>
        <td style="padding: 8px;"><code>aider --model deepinfra/Qwen/Qwen2.5-Coder-32B-Instruct</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">mlx-community: 8bit</td>
        <td style="padding: 8px; text-align: center;">72.2%</td>
        <td style="padding: 8px; text-align: center;">92.5%</td>
        <td style="padding: 8px;"><code>aider --model openai/mlx-community/Qwen2.5-Coder-32B-Instruct-8bit</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">mlx-community: 4bit</td>
        <td style="padding: 8px; text-align: center;">72.2%</td>
        <td style="padding: 8px; text-align: center;">88.7%</td>
        <td style="padding: 8px;"><code>aider --model openai/mlx-community/Qwen2.5-Coder-32B-Instruct-4bit</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Ollama: fp16</td>
        <td style="padding: 8px; text-align: center;">71.4%</td>
        <td style="padding: 8px; text-align: center;">90.2%</td>
        <td style="padding: 8px;"><code>aider --model ollama/qwen2.5-coder:32b-instruct-fp16</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">HuggingFace via GLHF: BF16</td>
        <td style="padding: 8px; text-align: center;">71.4%</td>
        <td style="padding: 8px; text-align: center;">94.7%</td>
        <td style="padding: 8px;"><code>aider --model openai/hf:Qwen/Qwen2.5-Coder-32B-Instruct --openai-api-base https://glhf.chat/api/openai/v1</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Deepinfra via OpenRouter: BF16</td>
        <td style="padding: 8px; text-align: center;">69.9%</td>
        <td style="padding: 8px; text-align: center;">89.5%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Hyperbolic: BF16</td>
        <td style="padding: 8px; text-align: center;">69.2%</td>
        <td style="padding: 8px; text-align: center;">91.7%</td>
        <td style="padding: 8px;"><code>aider --model openai/Qwen/Qwen2.5-Coder-32B-Instruct --openai-api-base https://api.hyperbolic.xyz/v1/</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Hyperbolic via OpenRouter: BF16</td>
        <td style="padding: 8px; text-align: center;">68.4%</td>
        <td style="padding: 8px; text-align: center;">89.5%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Fireworks via OpenRouter: unknown</td>
        <td style="padding: 8px; text-align: center;">67.7%</td>
        <td style="padding: 8px; text-align: center;">94.0%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">OpenRouter: multiple</td>
        <td style="padding: 8px; text-align: center;">67.7%</td>
        <td style="padding: 8px; text-align: center;">95.5%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Ollama: q4_K_M</td>
        <td style="padding: 8px; text-align: center;">66.9%</td>
        <td style="padding: 8px; text-align: center;">94.0%</td>
        <td style="padding: 8px;"><code>aider --model ollama/qwen2.5-coder:32b-instruct-q4_K_M</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Ollama: q2_K</td>
        <td style="padding: 8px; text-align: center;">61.7%</td>
        <td style="padding: 8px; text-align: center;">91.7%</td>
        <td style="padding: 8px;"><code>aider --model ollama/qwen2.5-coder:32b-instruct-q2_K</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">Ollama: fp16, 2k ctx</td>
        <td style="padding: 8px; text-align: center;">51.9%</td>
        <td style="padding: 8px; text-align: center;">46.2%</td>
        <td style="padding: 8px;"><code>aider --model ollama/qwen2.5-coder:32b-instruct-fp16 # num_ctx: 2048</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
  </tbody>
</table>

<style>
  tr.selected {
    color: #0056b3;
  }
  table {
    table-layout: fixed;
  }
  td, th {
    word-wrap: break-word;
    overflow-wrap: break-word;
  }
  td:nth-child(3), td:nth-child(4) {
    font-size: 12px;
  }
</style>

<script>
document.getElementById('quantSearchInput').addEventListener('keyup', function() {
    var input = this.value.toLowerCase();
    var rows = document.querySelectorAll('tbody tr');
    
    rows.forEach(function(row) {
        var text = row.textContent.toLowerCase();
        if(text.includes(input)) {
            row.style.display = '';
            row.classList.add('selected');
        } else {
            row.style.display = 'none';
            row.classList.remove('selected');
        }
    });
});
</script>

<h2 id="setting-ollamas-context-window-size">Setting Ollama’s context window size</h2>

<p><a href="https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-specify-the-context-window-size">Ollama uses a 2k context window by default</a>,
which is very small for working with aider.
Unlike most other LLM servers, Ollama does not throw an error if you submit
a request that exceeds the context window.
Instead, it just silently truncates the request by discarding the “oldest” messages
in the chat to make it fit within the context window.</p>

<p>Except for the single 2k context result,
all of the Ollama results above were collected with at least an 8k context window.
An 8k window is large enough to attempt all the coding problems in the benchmark.
Aider sets Ollama’s context window to 8k by default, starting in aider v0.65.0.</p>

<p>You can change the Ollama server’s context window with a 
<a href="https://aider.chat/docs/config/adv-model-settings.html#model-settings"><code class="language-plaintext highlighter-rouge">.aider.model.settings.yml</code> file</a>
like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>- name: ollama/qwen2.5-coder:32b-instruct-fp16
  extra_params:
    num_ctx: 8192
</code></pre></div></div>

<h2 id="choosing-providers-with-openrouter">Choosing providers with OpenRouter</h2>

<p>OpenRouter allows you to ignore specific providers in your
<a href="https://openrouter.ai/settings/preferences">preferences</a>.
This can be used to limit your OpenRouter requests to be
served by only your preferred providers.</p>

<h2 id="notes">Notes</h2>

<p>This article went through many revisions as I received feedback from
numerous members of the community.
Here are some of the noteworthy learnings and changes:</p>

<ul>
  <li>The first version of this article included incorrect Ollama models.</li>
  <li>Earlier Ollama results used the too small default 2k context window,
artificially harming the benchmark results.</li>
  <li>The benchmark results appear to have uncovered a problem in the way
OpenRouter was communicating with Hyperbolic.
They fixed the issue 11/24/24, shortly after it was pointed out.</li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[Open source LLMs are becoming very powerful, but pay attention to how you (or your provider) are serving the model. It can affect code editing skill.]]></summary></entry><entry><title type="html">Separating code reasoning and editing</title><link href="https://aider.chat/2024/09/26/architect.html" rel="alternate" type="text/html" title="Separating code reasoning and editing" /><published>2024-09-26T00:00:00+00:00</published><updated>2024-09-26T00:00:00+00:00</updated><id>https://aider.chat/2024/09/26/architect</id><content type="html" xml:base="https://aider.chat/2024/09/26/architect.html"><![CDATA[<p class="post-date">September 26, 2024</p>

<h1 id="separating-code-reasoning-and-editing">Separating code reasoning and editing</h1>

<p>Aider now has experimental support for using two models to complete each coding task:</p>

<ul>
  <li>An Architect model is asked to describe how to solve the coding problem.</li>
  <li>An Editor model is given the Architect’s solution and asked to produce specific code editing instructions to apply those changes to existing source files.</li>
</ul>

<p>Splitting up “code reasoning” and “code editing” in this manner
has produced SOTA results on
<a href="/docs/benchmarks.html#the-benchmark">aider’s code editing benchmark</a>.
Using o1-preview as the Architect with either DeepSeek or o1-mini as the
Editor produced the SOTA score of 85%.
Using the Architect/Editor approach
also significantly improved the benchmark scores of many
models, compared to their previous “solo” baseline scores (striped bars).</p>

<style>
  .shaded td {
    background-color: #f2f2f2;
    border-top: 1px solid #ccc;
  }
  .table-container {
    max-width: 100%;
    overflow-x: auto;
  }
  .responsive-table {
    border-collapse: separate;
    border-spacing: 0;
    width: 100%;
    font-size: 16px;
    border: 1px solid #ddd;
  }
  .responsive-table th, .responsive-table td {
    padding: 8px;
    text-align: left;
    border-bottom: 1px solid #ddd;
    word-break: break-word;
  }
  .responsive-table th {
    background-color: #e2e2e2;
  }
  .responsive-table th:first-child,
  .responsive-table td:first-child {
    border-left: 1px solid #ddd;
  }
  .responsive-table th:last-child,
  .responsive-table td:last-child {
    border-right: 1px solid #ddd;
  }
  
  @media screen and (max-width: 600px) {
    .responsive-table {
      font-size: 12px;
    }
    .responsive-table th, .responsive-table td {
      padding: 4px;
    }
  }
</style>

<style>
  #passRateChart {
    max-width: 100%;
    height: auto !important;
  }
</style>

<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>

<script src="https://cdn.jsdelivr.net/npm/chartjs-plugin-annotation@1.0.2"></script>

<canvas id="passRateChart" width="400" height="250"></canvas>
<script>
  document.addEventListener("DOMContentLoaded", function() {
    var ctx = document.getElementById('passRateChart').getContext('2d');
    
    // Function to determine aspect ratio and base font size based on screen width
    function getChartSettings() {
      if (window.innerWidth < 600) {
        return { aspectRatio: 1, baseFontSize: 8 }; // Slightly taller for small screens
      } else if (window.innerWidth < 800) {
        return { aspectRatio: 1.2, baseFontSize: 10 }; // Slightly taller for small screens
      } else {
        return { aspectRatio: 1.4, baseFontSize: 12 }; // Slightly taller for larger screens
      }
    }

    var chartSettings = getChartSettings();
    var baseFontSize = chartSettings.baseFontSize;

    var labels = [];
    var data = [];
    var colorMapping = {
      "claude-3.5-sonnet": "rgba(75, 192, 192, 0.2)",
      "gpt-4o": "rgba(255, 99, 132, 0.2)",
      "o1-preview": "rgba(54, 162, 235, 0.2)",
      "o1-mini": "rgba(255, 206, 86, 0.2)",
      "gpt-4o-mini": "rgba(153, 102, 255, 0.2)"
    };
    var borderColorMapping = {
      "claude-3.5-sonnet": "rgba(75, 192, 192, 1)",
      "gpt-4o": "rgba(255, 99, 132, 1)",
      "o1-preview": "rgba(54, 162, 235, 1)",
      "o1-mini": "rgba(255, 206, 86, 1)",
      "gpt-4o-mini": "rgba(153, 102, 255, 1)"
    };
    var backgroundColors = [];
    var borderColors = [];
    var patterns = {};
    for (var key in colorMapping) {
      patterns[key] = ctx.createPattern(createStripePattern(colorMapping[key]), 'repeat');
    }
    
    
      
        if ("o1-mini" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("o1-mini/whole");
        }
        data.push(85.0);
        if ("o1-mini" == "") {
          backgroundColors.push(patterns["o1-preview"]);
        } else {
          backgroundColors.push(colorMapping["o1-preview"]);
        }
        borderColors.push(borderColorMapping["o1-preview"]);
      
        if ("deepseek" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("deepseek/whole");
        }
        data.push(85.0);
        if ("deepseek" == "") {
          backgroundColors.push(patterns["o1-preview"]);
        } else {
          backgroundColors.push(colorMapping["o1-preview"]);
        }
        borderColors.push(borderColorMapping["o1-preview"]);
      
        if ("claude-3-5-sonnet" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("claude-3-5-sonnet/diff");
        }
        data.push(82.7);
        if ("claude-3-5-sonnet" == "") {
          backgroundColors.push(patterns["o1-preview"]);
        } else {
          backgroundColors.push(colorMapping["o1-preview"]);
        }
        borderColors.push(borderColorMapping["o1-preview"]);
      
        if ("deepseek" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("deepseek/diff");
        }
        data.push(80.5);
        if ("deepseek" == "") {
          backgroundColors.push(patterns["o1-preview"]);
        } else {
          backgroundColors.push(colorMapping["o1-preview"]);
        }
        borderColors.push(borderColorMapping["o1-preview"]);
      
        if ("gpt-4o" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("gpt-4o/diff");
        }
        data.push(80.5);
        if ("gpt-4o" == "") {
          backgroundColors.push(patterns["o1-preview"]);
        } else {
          backgroundColors.push(colorMapping["o1-preview"]);
        }
        borderColors.push(borderColorMapping["o1-preview"]);
      
        if ("" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("/diff");
        }
        data.push(79.7);
        if ("" == "") {
          backgroundColors.push(patterns["o1-preview"]);
        } else {
          backgroundColors.push(colorMapping["o1-preview"]);
        }
        borderColors.push(borderColorMapping["o1-preview"]);
      
    
      
        if ("claude-3.5-sonnet" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("claude-3.5-sonnet/diff");
        }
        data.push(80.5);
        if ("claude-3.5-sonnet" == "") {
          backgroundColors.push(patterns["claude-3.5-sonnet"]);
        } else {
          backgroundColors.push(colorMapping["claude-3.5-sonnet"]);
        }
        borderColors.push(borderColorMapping["claude-3.5-sonnet"]);
      
        if ("deepseek" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("deepseek/diff");
        }
        data.push(78.9);
        if ("deepseek" == "") {
          backgroundColors.push(patterns["claude-3.5-sonnet"]);
        } else {
          backgroundColors.push(colorMapping["claude-3.5-sonnet"]);
        }
        borderColors.push(borderColorMapping["claude-3.5-sonnet"]);
      
        if ("deepseek" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("deepseek/whole");
        }
        data.push(78.9);
        if ("deepseek" == "") {
          backgroundColors.push(patterns["claude-3.5-sonnet"]);
        } else {
          backgroundColors.push(colorMapping["claude-3.5-sonnet"]);
        }
        borderColors.push(borderColorMapping["claude-3.5-sonnet"]);
      
        if ("" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("/diff");
        }
        data.push(77.4);
        if ("" == "") {
          backgroundColors.push(patterns["claude-3.5-sonnet"]);
        } else {
          backgroundColors.push(colorMapping["claude-3.5-sonnet"]);
        }
        borderColors.push(borderColorMapping["claude-3.5-sonnet"]);
      
    
      
        if ("gpt-4o" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("gpt-4o/diff");
        }
        data.push(75.2);
        if ("gpt-4o" == "") {
          backgroundColors.push(patterns["gpt-4o"]);
        } else {
          backgroundColors.push(colorMapping["gpt-4o"]);
        }
        borderColors.push(borderColorMapping["gpt-4o"]);
      
        if ("deepseek" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("deepseek/diff");
        }
        data.push(74.4);
        if ("deepseek" == "") {
          backgroundColors.push(patterns["gpt-4o"]);
        } else {
          backgroundColors.push(colorMapping["gpt-4o"]);
        }
        borderColors.push(borderColorMapping["gpt-4o"]);
      
        if ("deepseek" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("deepseek/whole");
        }
        data.push(73.7);
        if ("deepseek" == "") {
          backgroundColors.push(patterns["gpt-4o"]);
        } else {
          backgroundColors.push(colorMapping["gpt-4o"]);
        }
        borderColors.push(borderColorMapping["gpt-4o"]);
      
        if ("" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("/diff");
        }
        data.push(71.4);
        if ("" == "") {
          backgroundColors.push(patterns["gpt-4o"]);
        } else {
          backgroundColors.push(colorMapping["gpt-4o"]);
        }
        borderColors.push(borderColorMapping["gpt-4o"]);
      
    
      
        if ("deepseek" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("deepseek/whole");
        }
        data.push(71.4);
        if ("deepseek" == "") {
          backgroundColors.push(patterns["o1-mini"]);
        } else {
          backgroundColors.push(colorMapping["o1-mini"]);
        }
        borderColors.push(borderColorMapping["o1-mini"]);
      
        if ("gpt-4o" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("gpt-4o/diff");
        }
        data.push(70.7);
        if ("gpt-4o" == "") {
          backgroundColors.push(patterns["o1-mini"]);
        } else {
          backgroundColors.push(colorMapping["o1-mini"]);
        }
        borderColors.push(borderColorMapping["o1-mini"]);
      
        if ("deepseek" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("deepseek/diff");
        }
        data.push(69.2);
        if ("deepseek" == "") {
          backgroundColors.push(patterns["o1-mini"]);
        } else {
          backgroundColors.push(colorMapping["o1-mini"]);
        }
        borderColors.push(borderColorMapping["o1-mini"]);
      
        if ("" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("/diff");
        }
        data.push(61.1);
        if ("" == "") {
          backgroundColors.push(patterns["o1-mini"]);
        } else {
          backgroundColors.push(colorMapping["o1-mini"]);
        }
        borderColors.push(borderColorMapping["o1-mini"]);
      
    
      
        if ("gpt-4o-mini" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("gpt-4o-mini/whole");
        }
        data.push(60.2);
        if ("gpt-4o-mini" == "") {
          backgroundColors.push(patterns["gpt-4o-mini"]);
        } else {
          backgroundColors.push(colorMapping["gpt-4o-mini"]);
        }
        borderColors.push(borderColorMapping["gpt-4o-mini"]);
      
        if ("" == "") {
          labels.push("Baseline");
        } else {       
          labels.push("/whole");
        }
        data.push(55.6);
        if ("" == "") {
          backgroundColors.push(patterns["gpt-4o-mini"]);
        } else {
          backgroundColors.push(colorMapping["gpt-4o-mini"]);
        }
        borderColors.push(borderColorMapping["gpt-4o-mini"]);
      
    
    labels.reverse();
    data.reverse();
    backgroundColors.reverse();
    borderColors.reverse();
    var chart = new Chart(ctx, {
      type: 'bar',
      data: {
        labels: labels,
        datasets: [{
          label: 'Pass Rate',
          data: data,
          backgroundColor: backgroundColors,
          borderColor: borderColors,
          borderWidth: 1
        }]
      },
      options: {
        responsive: true,
        maintainAspectRatio: true,
        aspectRatio: chartSettings.aspectRatio,
        scales: {
          y: { 
            beginAtZero: true,
            title: {
              display: true,
              text: 'Pass Rate (%)',
              font: {
                size: baseFontSize + 6
              }
            },
            ticks: {
              font: {
                size: baseFontSize
              }
            }
          },
          x: {
            title: {
              display: true,
              text: 'Editor model and edit format',
              font: {
                size: baseFontSize + 6
              }
            },
            ticks: {
              font: {
                size: baseFontSize + 4
              },
              maxRotation: 90, // Allow full rotation if needed
              minRotation: 45  // Start rotating at 45 degrees to fit more labels
            }
          }
        },
        plugins: {
          annotation: {
            annotations: {
              line1: {
                type: 'line',
                yMin: 79.7,
                yMax: 79.7,
                borderColor: 'rgba(255, 99, 132, 0.8)',
                borderWidth: 2,
                borderDash: [6, 6],
                label: {
                  content: 'Previous SOTA',
                  enabled: true,
                  position: 'start',
                  xAdjust: 10,
                  font: {
                    size: baseFontSize
                  }
                }
              }
            }
          },
          legend: {
            display: true,
            title: {
              display: true,
              text: 'Architect model',
              font: {
                size: baseFontSize + 2,
                weight: 'bold'
              }
            },
            labels: {
              font: {
                size: baseFontSize + 4
              },
              generateLabels: function(chart) {
                var colorMapping = {
                  "o1-preview": "rgba(54, 162, 235, 0.2)",
                  "claude-3.5-sonnet": "rgba(75, 192, 192, 0.2)",
                  "gpt-4o": "rgba(255, 99, 132, 0.2)",
                  "o1-mini": "rgba(255, 206, 86, 0.2)",
                  "gpt-4o-mini": "rgba(153, 102, 255, 0.2)"
                };
                return Object.keys(colorMapping).reverse().map(function(key) {
                  return {
                    text: key,
                    fillStyle: colorMapping[key],
                    strokeStyle: colorMapping[key].replace('0.2', '1'),
                    lineWidth: 1
                  };
                });
              }
            }
          }
        }
      }
    });

    // Update aspect ratio and font sizes on window resize
    window.addEventListener('resize', function() {
      var newSettings = getChartSettings();
      chart.options.aspectRatio = newSettings.aspectRatio;
      baseFontSize = newSettings.baseFontSize;
      
      // Update font sizes
      chart.options.scales.y.title.font.size = baseFontSize + 6;
      chart.options.scales.y.ticks.font.size = baseFontSize;
      chart.options.scales.x.title.font.size = baseFontSize + 6;
      chart.options.scales.x.ticks.font.size = baseFontSize + 4;
      chart.options.plugins.annotation.annotations.line1.label.font.size = baseFontSize;
      chart.options.plugins.legend.title.font.size = baseFontSize + 4;
      chart.options.plugins.legend.labels.font.size = baseFontSize + 4;
      
      chart.update();
    });
  });

  function createStripePattern(baseColor) {
    var canvas = document.createElement('canvas');
    canvas.width = 10;
    canvas.height = 10;
    var ctx = canvas.getContext('2d');

    ctx.fillStyle = baseColor;
    ctx.fillRect(0, 0, canvas.width, canvas.height);
    ctx.strokeStyle = 'rgba(0, 0, 0, 0.1)';
    ctx.lineWidth = 2;
    ctx.beginPath();
    ctx.moveTo(0, 0);
    ctx.lineTo(10, 10);
    ctx.stroke();

    return canvas;
  }
</script>

<h2 id="motivation">Motivation</h2>

<p>This approach was motivated by the release of OpenAI’s o1 models.
They are strong at reasoning, but often fail to output properly formatted
code editing instructions.
It helps to instead let them describe the solution
however they prefer and then pass that output to a more traditional LLM.
This second Editor LLM can then interpret the solution description and
produce the code editing instructions needed to update
the existing source code.</p>

<p>This approach has recently become attractive for aider due to 
rapid improvements in the speed and costs of frontier models.
In particular, chaining older LLMs would have been quite slow and
incompatible with aider’s goal of providing an interactive,
pair programming AI coding experience.</p>

<h2 id="code-reasoning-and-code-editing">Code reasoning and code editing</h2>

<p>Normally aider asks the model to solve a coding problem in one prompt,
asking the LLM to explain the solution and return 
a well formatted series of file edits.
All of <a href="/docs/more/edit-formats.html">aider’s editing formats</a>
require the LLM to return source code edits in a specific text
format, so that aider can process the edits and apply them to the local source files.</p>

<p>Because this all happens in a single prompt/response round trip to the LLM,
the model has to split its attention between 
solving the coding problem and conforming to the edit format.</p>

<p>The Architect/Editor approach splits this into two inference steps, possibly
using two different LLMs:</p>

<ol>
  <li>Solve the coding problem (Architect).</li>
  <li>Turn the proposed solution into a series of well formed code edits (Editor).</li>
</ol>

<p>The Architect/Editor approach allows the Architect to focus on solving the coding problem
and <em>describe the solution however comes naturally to it</em>.
Similarly, the Editor can focus all of its attention on properly formatting the edits
without needing to reason much about how to solve the coding problem.</p>

<p>We can assign the Architect and Editor roles to LLMs which are well suited to their needs.
Strong reasoning model like o1-preview make excellent Architects, while
the Editor role can be assigned to an appropriate model based on cost, speed
and code editing skill.</p>

<h2 id="results">Results</h2>

<p>The graph above and the table below show the
<a href="/docs/benchmarks.html#the-benchmark">aider’s code editing benchmark</a>
score for various combinations of Architect and Editor models.</p>

<p>Some noteworthy observations:</p>

<ul>
  <li>Pairing o1-preview as Architect with either Deepseek or o1-mini as Editor sets a SOTA significantly above the previous best score. This result is obtained with the “whole” editing format, requiring the Editor to output a full update copy of each edited source file. Both of these steps are therefore quite slow, so probably not practical for interactive use with aider.</li>
  <li>Pairing OpenAI’s o1-preview with Anthropic’s Sonnet as the Editor produces the second best result. This is an entirely practical configuration for users able to work with both providers.</li>
  <li>Pairing many models with themselves in the Architect/Editor configuration can provide
significant benefits. 
Sonnet, GPT-4o and GPT-4o-mini all scored higher when used as an Architect/Editor pair.</li>
  <li>Deepseek is surprisingly effective as an Editor model. It seems remarkably capable at turning proposed coding solutions into new, updated versions of the source files. Using the efficient “diff” editing format, Deepseek helps all the Architect models except for Sonnet.</li>
</ul>

<h2 id="try-it">Try it!</h2>

<p>The development version of aider 
has built in defaults to support Architect/Editor coding with
o1-preview, o1-mini, GPT-4o and Claude 3.5 Sonnet.
Run aider with <code class="language-plaintext highlighter-rouge">--architect</code> or get started quickly like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip install -U aider-chat

# Change directory into a git repo
cd /to/your/git/repo

# Work with Claude 3.5 Sonnet as the Architect and Editor
export ANTHROPIC_API_KEY=your-key-goes-here
aider --sonnet --architect

# Work with OpenAI models, using gpt-4o as the Editor
export OPENAI_API_KEY=your-key-goes-here
aider --4o --architect
aider --o1-mini --architect
aider --o1-preview --architect
</code></pre></div></div>

<h2 id="more-info">More info</h2>

<p>Aider has a number of “chat modes”, and “architect” is available as a new chat mode.
The <code class="language-plaintext highlighter-rouge">--architect</code> switch is a shortcut for <code class="language-plaintext highlighter-rouge">--chat-mode architect</code>.
For more details, see documentation on 
<a href="/docs/usage/modes.html">aider’s chat modes</a>.</p>

<h2 id="full-results">Full results</h2>

<p>Below are the benchmark results using various models as the Architect, paired with
various models as the Editor.
Each section includes a “baseline” result,
where the model works
by itself in aider’s normal “code” editing mode
(not as part of an Architect/Editor configuration).
This “solo” baseline represents the performance previously available when using
this model with aider.</p>

<div class="table-container">
  <table class="responsive-table">
    <thead>
      <tr>
        <th>Architect</th>
        <th>Editor</th>
        <th>Edit Format</th>
        <th>Pass Rate</th>
      </tr>
    </thead>
    <tbody>
      
        
        
          <tr class="">
            <td>o1-preview</td>
            <td>o1-mini</td>
            <td style="text-align: center;">whole</td>
            <td style="text-align: right;">85.0%</td>
          </tr>
        
          <tr class="">
            <td>o1-preview</td>
            <td>deepseek</td>
            <td style="text-align: center;">whole</td>
            <td style="text-align: right;">85.0%</td>
          </tr>
        
          <tr class="">
            <td>o1-preview</td>
            <td>claude-3-5-sonnet</td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">82.7%</td>
          </tr>
        
          <tr class="">
            <td>o1-preview</td>
            <td>deepseek</td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">80.5%</td>
          </tr>
        
          <tr class="">
            <td>o1-preview</td>
            <td>gpt-4o</td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">80.5%</td>
          </tr>
        
          <tr class="">
            <td>o1-preview</td>
            <td><b>Baseline</b></td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">79.7%</td>
          </tr>
        
      
        
        
          <tr class="shaded">
            <td>claude-3.5-sonnet</td>
            <td>claude-3.5-sonnet</td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">80.5%</td>
          </tr>
        
          <tr class="shaded">
            <td>claude-3.5-sonnet</td>
            <td>deepseek</td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">78.9%</td>
          </tr>
        
          <tr class="shaded">
            <td>claude-3.5-sonnet</td>
            <td>deepseek</td>
            <td style="text-align: center;">whole</td>
            <td style="text-align: right;">78.9%</td>
          </tr>
        
          <tr class="shaded">
            <td>claude-3.5-sonnet</td>
            <td><b>Baseline</b></td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">77.4%</td>
          </tr>
        
      
        
        
          <tr class="">
            <td>gpt-4o</td>
            <td>gpt-4o</td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">75.2%</td>
          </tr>
        
          <tr class="">
            <td>gpt-4o</td>
            <td>deepseek</td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">74.4%</td>
          </tr>
        
          <tr class="">
            <td>gpt-4o</td>
            <td>deepseek</td>
            <td style="text-align: center;">whole</td>
            <td style="text-align: right;">73.7%</td>
          </tr>
        
          <tr class="">
            <td>gpt-4o</td>
            <td><b>Baseline</b></td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">71.4%</td>
          </tr>
        
      
        
        
          <tr class="shaded">
            <td>o1-mini</td>
            <td>deepseek</td>
            <td style="text-align: center;">whole</td>
            <td style="text-align: right;">71.4%</td>
          </tr>
        
          <tr class="shaded">
            <td>o1-mini</td>
            <td>gpt-4o</td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">70.7%</td>
          </tr>
        
          <tr class="shaded">
            <td>o1-mini</td>
            <td>deepseek</td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">69.2%</td>
          </tr>
        
          <tr class="shaded">
            <td>o1-mini</td>
            <td><b>Baseline</b></td>
            <td style="text-align: center;">diff</td>
            <td style="text-align: right;">61.1%</td>
          </tr>
        
      
        
        
          <tr class="">
            <td>gpt-4o-mini</td>
            <td>gpt-4o-mini</td>
            <td style="text-align: center;">whole</td>
            <td style="text-align: right;">60.2%</td>
          </tr>
        
          <tr class="">
            <td>gpt-4o-mini</td>
            <td><b>Baseline</b></td>
            <td style="text-align: center;">whole</td>
            <td style="text-align: right;">55.6%</td>
          </tr>
        
      
    </tbody>
  </table>
</div>]]></content><author><name></name></author><summary type="html"><![CDATA[An Architect model describes how to solve the coding problem, and an Editor model translates that into file edits. This Architect/Editor approach produces SOTA benchmark results.]]></summary></entry><entry><title type="html">o1-preview is SOTA on the aider leaderboard</title><link href="https://aider.chat/2024/09/12/o1.html" rel="alternate" type="text/html" title="o1-preview is SOTA on the aider leaderboard" /><published>2024-09-12T00:00:00+00:00</published><updated>2024-09-12T00:00:00+00:00</updated><id>https://aider.chat/2024/09/12/o1</id><content type="html" xml:base="https://aider.chat/2024/09/12/o1.html"><![CDATA[<p class="post-date">September 12, 2024</p>

<h1 id="openai-o1-preview-is-sota-on-the-aider-leaderboard">OpenAI o1-preview is SOTA on the aider leaderboard</h1>

<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>

<canvas id="editChart" width="800" height="450" style="margin-top: 20px"></canvas>
<script>
  document.addEventListener('DOMContentLoaded', function () {
    var ctx = document.getElementById('editChart').getContext('2d');
    var leaderboardData = {
      labels: [],
      datasets: [{
        label: 'Percent completed correctly',
        data: [],
        backgroundColor: [],
        borderColor: [],
        borderWidth: 1
      }]
    };

    var allData = [];
    
      allData.push({
        model: 'o1-preview (whole)',
        pass_rate: 79.7,
        percent_cases_well_formed: 100.0,
        edit_format: 'whole'
      });
    
      allData.push({
        model: 'claude-3.5-sonnet (diff)',
        pass_rate: 77.4,
        percent_cases_well_formed: 99.2,
        edit_format: 'diff'
      });
    
      allData.push({
        model: 'o1-preview (diff)',
        pass_rate: 75.2,
        percent_cases_well_formed: 84.2,
        edit_format: 'diff'
      });
    
      allData.push({
        model: 'claude-3.5-sonnet (whole)',
        pass_rate: 75.2,
        percent_cases_well_formed: 100.0,
        edit_format: 'whole'
      });
    
      allData.push({
        model: 'gpt-4o-2024-08-06 (diff)',
        pass_rate: 71.4,
        percent_cases_well_formed: 98.5,
        edit_format: 'diff'
      });
    
      allData.push({
        model: 'o1-mini (whole)',
        pass_rate: 70.7,
        percent_cases_well_formed: 90.0,
        edit_format: 'whole'
      });
    
      allData.push({
        model: 'o1-mini (diff)',
        pass_rate: 62.4,
        percent_cases_well_formed: 85.7,
        edit_format: 'diff'
      });
    
      allData.push({
        model: 'gpt-4o-mini (whole)',
        pass_rate: 55.6,
        percent_cases_well_formed: 100.0,
        edit_format: 'whole'
      });
    

    function updateChart() {
      var selectedRows = document.querySelectorAll('tr.selected');
      var showAll = selectedRows.length === 0;

      leaderboardData.labels = [];
      leaderboardData.datasets[0].data = [];
      leaderboardData.datasets[0].backgroundColor = [];
      leaderboardData.datasets[0].borderColor = [];

      allData.forEach(function(row, index) {
        var rowElement = document.getElementById('edit-row-' + index);
        if (showAll) {
          rowElement.classList.remove('selected');
        }
        if (showAll || rowElement.classList.contains('selected')) {
          leaderboardData.labels.push(row.model);
          leaderboardData.datasets[0].data.push(row.pass_rate);
          
          switch (row.edit_format) {
            case 'whole':
              leaderboardData.datasets[0].backgroundColor.push('rgba(255, 99, 132, 0.2)');
              leaderboardData.datasets[0].borderColor.push('rgba(255, 99, 132, 1)');
              break;
            case 'diff':
              leaderboardData.datasets[0].backgroundColor.push('rgba(54, 162, 235, 0.2)');
              leaderboardData.datasets[0].borderColor.push('rgba(54, 162, 235, 1)');
              break;
            case 'udiff':
              leaderboardData.datasets[0].backgroundColor.push('rgba(75, 192, 192, 0.2)');
              leaderboardData.datasets[0].borderColor.push('rgba(75, 192, 192, 1)');
              break;
            case 'diff-fenced':
              leaderboardData.datasets[0].backgroundColor.push('rgba(153, 102, 255, 0.2)');
              leaderboardData.datasets[0].borderColor.push('rgba(153, 102, 255, 1)');
              break;
            default:
              leaderboardData.datasets[0].backgroundColor.push('rgba(201, 203, 207, 0.2)');
              leaderboardData.datasets[0].borderColor.push('rgba(201, 203, 207, 1)');
          }
        }
      });

      // Apply legend filtering
      var meta = leaderboardChart.getDatasetMeta(0);
      meta.data.forEach(function(bar, index) {
        if (leaderboardData.labels.includes(allData[index].model)) {
          bar.hidden = (allData[index].edit_format === 'whole' && meta.data[0].hidden) ||
                       (allData[index].edit_format !== 'whole' && meta.data[1].hidden);
        } else {
          bar.hidden = true;
        }
      });

      leaderboardChart.update();
    }

    var tableBody = document.querySelector('table tbody');
    allData.forEach(function(row, index) {
      var tr = tableBody.children[index];
      tr.id = 'edit-row-' + index;
      tr.style.cursor = 'pointer';
      tr.onclick = function() {
        this.classList.toggle('selected');
        updateChart();
      };
    });

    var leaderboardChart = new Chart(ctx, {
      type: 'bar',
      data: leaderboardData,
      options: {
        scales: {
          y: {
            beginAtZero: true,
            title: {
              display: true,
              text: 'Correct Exercises (%)'
            }
          },
          x: {
            ticks: {
              autoSkip: false,
              maxRotation: 90,
              minRotation: 0
            }
          }
        },
        plugins: {
          legend: {
            display: true,
            position: 'top',
            labels: {
              generateLabels: function(chart) {
                var uniqueFormats = [...new Set(allData.map(item => item.edit_format))];
                return uniqueFormats.map(format => {
                  var color;
                  switch (format) {
                    case 'whole':
                      color = { fill: 'rgba(255, 99, 132, 0.2)', stroke: 'rgba(255, 99, 132, 1)' };
                      break;
                    case 'diff':
                      color = { fill: 'rgba(54, 162, 235, 0.2)', stroke: 'rgba(54, 162, 235, 1)' };
                      break;
                    case 'udiff':
                      color = { fill: 'rgba(75, 192, 192, 0.2)', stroke: 'rgba(75, 192, 192, 1)' };
                      break;
                    case 'diff-fenced':
                      color = { fill: 'rgba(153, 102, 255, 0.2)', stroke: 'rgba(153, 102, 255, 1)' };
                      break;
                    default:
                      color = { fill: 'rgba(201, 203, 207, 0.2)', stroke: 'rgba(201, 203, 207, 1)' };
                  }
                  return {
                    text: format,
                    fillStyle: color.fill,
                    strokeStyle: color.stroke,
                    lineWidth: 1,
                    hidden: false
                  };
                });
              }
            },
            onClick: function(e, legendItem, legend) {
              var ci = legend.chart;
              var clickedFormat = legendItem.text;
              
              legendItem.hidden = !legendItem.hidden;
              
              ci.data.datasets[0].data.forEach(function(dataPoint, i) {
                var meta = ci.getDatasetMeta(0);
                if (allData[i].edit_format === clickedFormat) {
                  meta.data[i].hidden = legendItem.hidden;
                }
              });
              
              ci.update();
            }
          }
        }
      }
    });

    updateChart();
  });
</script>

<h2 id="o1-preview">o1-preview</h2>

<p>OpenAI o1-preview scored 79.7% on aider’s code editing benchmark,
a state of the art result.
It achieved this result with the 
<a href="/docs/leaderboards/#notes-on-the-edit-format">“whole” edit format</a>,
where the LLM returns a full copy of the source code file with changes.</p>

<p>It is much more practical to use aider’s
<a href="/docs/leaderboards/#notes-on-the-edit-format">“diff” edit format</a>,
which allows the LLM to return search/replace blocks to 
efficiently edit the source code.
This saves significant time and token costs.</p>

<p>Using the diff edit format the o1-preview model had a strong
benchmark score of 75.2%.
This likely places o1-preview between Sonnet and GPT-4o for practical use,
but at significantly higher cost.</p>

<h2 id="o1-mini">o1-mini</h2>

<p>OpenAI o1-mini is priced similarly to GPT-4o and Claude 3.5 Sonnet,
but scored below those models.
It also works best with the whole edit format.</p>

<h2 id="future-work">Future work</h2>

<p>The o1-preview model had trouble conforming to aider’s diff edit format.
The o1-mini model had trouble conforming to both the whole and diff edit formats.
Aider is extremely permissive and tries hard to accept anything close
to the correct formats.</p>

<p>It is surprising that such strong models had trouble with
the syntactic requirements of simple text output formats.
It seems likely that aider could optimize its prompts and edit formats to
better harness the o1 models.</p>

<h2 id="using-aider-with-o1">Using aider with o1</h2>

<p>OpenAI’s new o1 models are supported in v0.57.0 of aider:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aider --model o1-mini
aider --model o1-preview
</code></pre></div></div>

<blockquote class="note">
  <p>These are initial benchmark results for the o1 models,
based on aider v0.56.1-dev.
See the <a href="/docs/leaderboards/">aider leaderboards</a> for up-to-date results
based on the latest aider releases.</p>
</blockquote>

<table style="width: 100%; max-width: 800px; margin: auto; border-collapse: collapse; box-shadow: 0 2px 4px rgba(0,0,0,0.1); font-size: 14px;">
  <thead style="background-color: #f2f2f2;">
    <tr>
      <th style="padding: 8px; text-align: left;">Model</th>
      <th style="padding: 8px; text-align: center;">Percent completed correctly</th>
      <th style="padding: 8px; text-align: center;">Percent using correct edit format</th>
      <th style="padding: 8px; text-align: left;">Command</th>
      <th style="padding: 8px; text-align: center;">Edit format</th>
    </tr>
  </thead>
  <tbody>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">o1-preview (whole)</td>
        <td style="padding: 8px; text-align: center;">79.7%</td>
        <td style="padding: 8px; text-align: center;">100.0%</td>
        <td style="padding: 8px;"><code>aider --model o1-preview</code></td>
        <td style="padding: 8px; text-align: center;">whole</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">claude-3.5-sonnet (diff)</td>
        <td style="padding: 8px; text-align: center;">77.4%</td>
        <td style="padding: 8px; text-align: center;">99.2%</td>
        <td style="padding: 8px;"><code>aider --sonnet</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">o1-preview (diff)</td>
        <td style="padding: 8px; text-align: center;">75.2%</td>
        <td style="padding: 8px; text-align: center;">84.2%</td>
        <td style="padding: 8px;"><code>aider --model o1-preview</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">claude-3.5-sonnet (whole)</td>
        <td style="padding: 8px; text-align: center;">75.2%</td>
        <td style="padding: 8px; text-align: center;">100.0%</td>
        <td style="padding: 8px;"><code>aider --model openrouter/anthropic/claude-3.5-sonnet --edit-format whole</code></td>
        <td style="padding: 8px; text-align: center;">whole</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">gpt-4o-2024-08-06 (diff)</td>
        <td style="padding: 8px; text-align: center;">71.4%</td>
        <td style="padding: 8px; text-align: center;">98.5%</td>
        <td style="padding: 8px;"><code>aider --model openai/gpt-4o-2024-08-06</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">o1-mini (whole)</td>
        <td style="padding: 8px; text-align: center;">70.7%</td>
        <td style="padding: 8px; text-align: center;">90.0%</td>
        <td style="padding: 8px;"><code>aider --model o1-mini</code></td>
        <td style="padding: 8px; text-align: center;">whole</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">o1-mini (diff)</td>
        <td style="padding: 8px; text-align: center;">62.4%</td>
        <td style="padding: 8px; text-align: center;">85.7%</td>
        <td style="padding: 8px;"><code>aider --model o1-mini --edit-format diff</code></td>
        <td style="padding: 8px; text-align: center;">diff</td>
      </tr>
    
      <tr style="border-bottom: 1px solid #ddd;">
        <td style="padding: 8px;">gpt-4o-mini (whole)</td>
        <td style="padding: 8px; text-align: center;">55.6%</td>
        <td style="padding: 8px; text-align: center;">100.0%</td>
        <td style="padding: 8px;"><code>aider --model gpt-4o-mini</code></td>
        <td style="padding: 8px; text-align: center;">whole</td>
      </tr>
    
  </tbody>
</table>

<style>
  tr.selected {
    color: #0056b3;
  }
  table {
    table-layout: fixed;
  }
  td, th {
    word-wrap: break-word;
    overflow-wrap: break-word;
  }
  td:nth-child(3), td:nth-child(4) {
    font-size: 12px;
  }
</style>]]></content><author><name></name></author><summary type="html"><![CDATA[Preliminary benchmark results for the new OpenAI o1 models.]]></summary></entry><entry><title type="html">Sonnet seems as good as ever</title><link href="https://aider.chat/2024/08/26/sonnet-seems-fine.html" rel="alternate" type="text/html" title="Sonnet seems as good as ever" /><published>2024-08-26T00:00:00+00:00</published><updated>2024-08-26T00:00:00+00:00</updated><id>https://aider.chat/2024/08/26/sonnet-seems-fine</id><content type="html" xml:base="https://aider.chat/2024/08/26/sonnet-seems-fine.html"><![CDATA[<p class="post-date">August 26, 2024</p>

<h1 id="sonnet-seems-as-good-as-ever">Sonnet seems as good as ever</h1>

<p>Recently there has been a lot of speculation that Sonnet has been
dumbed-down, nerfed or is otherwise performing worse.
Sonnet seems as good as ever, when performing the
<a href="/docs/benchmarks.html#the-benchmark">aider code editing benchmark</a>
via the API.</p>

<p>Below is a graph showing the performance of Claude 3.5 Sonnet over time.
It shows every clean, comparable benchmark run performed since Sonnet launched.
Benchmarks were performed for various reasons, usually
to evaluate the effects of small changes to aider’s system prompts.</p>

<p>The graph shows variance, but no indication of a noteworthy
degradation.
There is always some variance in benchmark results, typically +/- 2%
between runs with identical prompts.</p>

<p>It’s worth noting that these results would not capture any changes
made to Anthropic web chat’s use of Sonnet.</p>

<div class="chart-container" style="position: relative; height:400px; width:100%">
    <canvas id="sonnetPerformanceChart"></canvas>
</div>

<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>

<script src="https://cdn.jsdelivr.net/npm/moment@2.29.4/moment.min.js"></script>

<script src="https://cdn.jsdelivr.net/npm/chartjs-adapter-moment@1.0.1/dist/chartjs-adapter-moment.min.js"></script>

<script>
document.addEventListener('DOMContentLoaded', function() {
    var ctx = document.getElementById('sonnetPerformanceChart').getContext('2d');
    var sonnetData = [{"dirname":"2024-06-20-15-16-41--claude-3.5-sonnet-diff","test_cases":133,"model":"openrouter/anthropic/claude-3.5-sonnet","edit_format":"diff","commit_hash":"068609e-dirty","pass_rate_1":57.9,"pass_rate_2":74.4,"percent_cases_well_formed":97.0,"error_outputs":48,"num_malformed_responses":11,"num_with_malformed_responses":4,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model openrouter/anthropic/claude-3.5-sonnet","date":"2024-06-20","versions":"0.38.1-dev","seconds_per_case":21.6,"total_cost":0.0},{"dirname":"2024-06-24-12-48-43--claude-3.5-sonnet-udiff","test_cases":133,"model":"openrouter/anthropic/claude-3.5-sonnet","edit_format":"udiff","commit_hash":"7be08c7","pass_rate_1":62.4,"pass_rate_2":74.4,"percent_cases_well_formed":100.0,"error_outputs":10,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":10,"lazy_comments":0,"syntax_errors":1,"indentation_errors":2,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model openrouter/anthropic/claude-3.5-sonnet","date":"2024-06-24","versions":"0.39.1-dev","seconds_per_case":14.3,"total_cost":0.0},{"dirname":"2024-06-24-17-44-31--claude-3.5-sonnet-diff-less-chatty","test_cases":133,"model":"openrouter/anthropic/claude-3.5-sonnet","edit_format":"diff","commit_hash":"0d484e5","pass_rate_1":57.9,"pass_rate_2":74.4,"percent_cases_well_formed":99.2,"error_outputs":14,"num_malformed_responses":3,"num_with_malformed_responses":1,"user_asks":2,"lazy_comments":0,"syntax_errors":1,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":4,"command":"aider --model openrouter/anthropic/claude-3.5-sonnet","date":"2024-06-24","versions":"0.39.1-dev","seconds_per_case":16.0,"total_cost":0.0},{"dirname":"2024-06-24-17-50-46--claude-3.5-sonnet-diff-less-chatty2","test_cases":133,"model":"openrouter/anthropic/claude-3.5-sonnet","edit_format":"diff","commit_hash":3015495,"pass_rate_1":59.4,"pass_rate_2":76.7,"percent_cases_well_formed":99.2,"error_outputs":5,"num_malformed_responses":1,"num_with_malformed_responses":1,"user_asks":1,"lazy_comments":0,"syntax_errors":0,"indentation_errors":1,"exhausted_context_windows":0,"test_timeouts":2,"command":"aider --model openrouter/anthropic/claude-3.5-sonnet","date":"2024-06-24","versions":"0.39.1-dev","seconds_per_case":15.7,"total_cost":0.0},{"dirname":"2024-06-24-17-56-40--claude-3.5-sonnet-diff-less-chatty-sys-examples","test_cases":133,"model":"openrouter/anthropic/claude-3.5-sonnet","edit_format":"diff","commit_hash":"3015495-dirty","pass_rate_1":58.6,"pass_rate_2":75.9,"percent_cases_well_formed":100.0,"error_outputs":2,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":3,"command":"aider --model openrouter/anthropic/claude-3.5-sonnet","date":"2024-06-24","versions":"0.39.1-dev","seconds_per_case":15.9,"total_cost":0.0},{"dirname":"2024-07-04-14-32-08--claude-3.5-sonnet-diff-continue","test_cases":133,"model":"openrouter/anthropic/claude-3.5-sonnet","edit_format":"diff","commit_hash":"35f21b5","pass_rate_1":57.1,"pass_rate_2":77.4,"percent_cases_well_formed":99.2,"error_outputs":23,"num_malformed_responses":4,"num_with_malformed_responses":1,"user_asks":2,"lazy_comments":0,"syntax_errors":1,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model openrouter/anthropic/claude-3.5-sonnet","date":"2024-07-04","versions":"0.42.1-dev","seconds_per_case":17.6,"total_cost":3.6346},{"dirname":"2024-07-06-19-39-59--claude-3.5-sonnet-diff-platform","test_cases":133,"model":"openrouter/anthropic/claude-3.5-sonnet","edit_format":"diff","commit_hash":"e47c2a9-dirty","pass_rate_1":57.9,"pass_rate_2":78.2,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model openrouter/anthropic/claude-3.5-sonnet","date":"2024-07-06","versions":"0.42.1-dev","seconds_per_case":14.6,"total_cost":3.5616},{"dirname":"2024-07-24-17-11-07--claude-3.5-sonnet-diff-july24","test_cases":133,"model":"openrouter/anthropic/claude-3.5-sonnet","edit_format":"diff","commit_hash":"859a13e","pass_rate_1":59.4,"pass_rate_2":78.2,"percent_cases_well_formed":99.2,"error_outputs":6,"num_malformed_responses":1,"num_with_malformed_responses":1,"user_asks":1,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model openrouter/anthropic/claude-3.5-sonnet","date":"2024-07-24","versions":"0.45.2-dev","seconds_per_case":16.9,"total_cost":3.4981},{"dirname":"2024-07-28-20-23-42--claude-3.5-sonnet-diff-no-reminder","test_cases":94,"model":"openrouter/anthropic/claude-3.5-sonnet","edit_format":"diff","commit_hash":"e799e89-dirty","pass_rate_1":59.6,"pass_rate_2":83.0,"percent_cases_well_formed":98.9,"error_outputs":12,"num_malformed_responses":2,"num_with_malformed_responses":1,"user_asks":2,"lazy_comments":0,"syntax_errors":1,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model openrouter/anthropic/claude-3.5-sonnet","date":"2024-07-28","versions":"0.45.2-dev","seconds_per_case":15.7,"total_cost":2.434},{"dirname":"2024-08-14-00-46-09--claude-3.5-sonnet-diff-no-ipynb-again","test_cases":133,"model":"openrouter/anthropic/claude-3.5-sonnet","edit_format":"diff","commit_hash":"139f799","pass_rate_1":57.9,"pass_rate_2":75.9,"percent_cases_well_formed":98.5,"error_outputs":22,"num_malformed_responses":5,"num_with_malformed_responses":2,"user_asks":249,"lazy_comments":0,"syntax_errors":1,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model openrouter/anthropic/claude-3.5-sonnet","date":"2024-08-14","versions":"0.50.1-dev","seconds_per_case":18.0,"total_cost":3.7058},{"dirname":"2024-06-21-00-07-01--claude-3.5-sonnet-do-over","test_cases":133,"model":"openrouter/anthropic/claude-3.5-sonnet","edit_format":"diff","commit_hash":"fb26174-dirty","pass_rate_1":59.4,"pass_rate_2":80.5,"percent_cases_well_formed":99.2,"error_outputs":20,"num_malformed_responses":4,"num_with_malformed_responses":1,"user_asks":1,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model openrouter/anthropic/claude-3.5-sonnet","date":"2024-06-21","versions":"0.39.1-dev","seconds_per_case":18.3,"total_cost":0.0},{"dirname":"2024-06-21-00-18-25--claude-3.5-sonnet-do-over2","test_cases":133,"model":"openrouter/anthropic/claude-3.5-sonnet","edit_format":"diff","commit_hash":"fb26174-dirty","pass_rate_1":58.6,"pass_rate_2":77.4,"percent_cases_well_formed":98.5,"error_outputs":22,"num_malformed_responses":4,"num_with_malformed_responses":2,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model openrouter/anthropic/claude-3.5-sonnet","date":"2024-06-21","versions":"0.39.1-dev","seconds_per_case":17.3,"total_cost":0.0},{"dirname":"2024-06-24-00-09-40--claude-3.5-sonnet-chatty","test_cases":133,"model":"openrouter/anthropic/claude-3.5-sonnet","edit_format":"diff","commit_hash":"b44c246-dirty","pass_rate_1":59.4,"pass_rate_2":75.2,"percent_cases_well_formed":98.5,"error_outputs":21,"num_malformed_responses":5,"num_with_malformed_responses":2,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":2,"command":"aider --model openrouter/anthropic/claude-3.5-sonnet","date":"2024-06-24","versions":"0.39.1-dev","seconds_per_case":15.7,"total_cost":0.0},{"dirname":"2024-06-24-00-33-35--claude-3.5-sonnet-chatty-do-over","test_cases":133,"model":"openrouter/anthropic/claude-3.5-sonnet","edit_format":"diff","commit_hash":"bc1dfa3","pass_rate_1":58.6,"pass_rate_2":76.7,"percent_cases_well_formed":97.7,"error_outputs":26,"num_malformed_responses":6,"num_with_malformed_responses":3,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":2,"command":"aider --model openrouter/anthropic/claude-3.5-sonnet","date":"2024-06-24","versions":"0.39.1-dev","seconds_per_case":16.4,"total_cost":0.0},{"dirname":"2024-08-18-19-57-30--claude-3.5-sonnet-aug18","test_cases":133,"model":"openrouter/anthropic/claude-3.5-sonnet","edit_format":"diff","commit_hash":"5099a5c","pass_rate_1":54.9,"pass_rate_2":78.9,"percent_cases_well_formed":97.7,"error_outputs":47,"num_malformed_responses":11,"num_with_malformed_responses":3,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":2,"command":"aider --model openrouter/anthropic/claude-3.5-sonnet","date":"2024-08-18","versions":"0.50.2-dev","seconds_per_case":22.3,"total_cost":3.9008},{"dirname":"2024-08-18-20-23-50--claude-3.5-sonnet-aug18-cache-prompts","test_cases":133,"model":"openrouter/anthropic/claude-3.5-sonnet","edit_format":"diff","commit_hash":"53db8cf-dirty","pass_rate_1":56.4,"pass_rate_2":78.9,"percent_cases_well_formed":97.7,"error_outputs":16,"num_malformed_responses":4,"num_with_malformed_responses":3,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":3,"command":"aider --model openrouter/anthropic/claude-3.5-sonnet","date":"2024-08-18","versions":"0.50.2-dev","seconds_per_case":21.1,"total_cost":3.6918},{"dirname":"2024-08-18-23-11-04--claude-3.5-sonnet-aug18-cache-prompts-cold","test_cases":133,"model":"openrouter/anthropic/claude-3.5-sonnet","edit_format":"diff","commit_hash":"53db8cf-dirty","pass_rate_1":56.4,"pass_rate_2":78.2,"percent_cases_well_formed":97.0,"error_outputs":30,"num_malformed_responses":7,"num_with_malformed_responses":4,"user_asks":1,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":2,"command":"aider --model openrouter/anthropic/claude-3.5-sonnet","date":"2024-08-18","versions":"0.50.2-dev","seconds_per_case":21.8,"total_cost":3.7858},{"dirname":"2024-08-21-01-07-39--sonnet-diff-cache","test_cases":133,"model":"claude-3-5-sonnet-20240620","edit_format":"diff","commit_hash":"e12157b-dirty","pass_rate_1":57.1,"pass_rate_2":82.0,"percent_cases_well_formed":98.5,"error_outputs":12,"num_malformed_responses":2,"num_with_malformed_responses":2,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":2,"command":"aider --model claude-3-5-sonnet-20240620","date":"2024-08-21","versions":"0.51.2-dev","seconds_per_case":14.5,"total_cost":3.1795},{"dirname":"2024-08-21-00-50-49--shell-cmds-sonnet-user-remind","test_cases":133,"model":"openrouter/anthropic/claude-3.5-sonnet","edit_format":"diff","commit_hash":"919ea05","pass_rate_1":63.2,"pass_rate_2":79.7,"percent_cases_well_formed":98.5,"error_outputs":18,"num_malformed_responses":4,"num_with_malformed_responses":2,"user_asks":26,"lazy_comments":0,"syntax_errors":0,"indentation_errors":2,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model openrouter/anthropic/claude-3.5-sonnet","date":"2024-08-21","versions":"0.51.2-dev","seconds_per_case":16.3,"total_cost":3.4738},{"dirname":"2024-08-21-00-55-30--shell-cmds-sonnet-no-user-remind","test_cases":133,"model":"openrouter/anthropic/claude-3.5-sonnet","edit_format":"diff","commit_hash":"5c7707a","pass_rate_1":63.9,"pass_rate_2":80.5,"percent_cases_well_formed":97.7,"error_outputs":51,"num_malformed_responses":12,"num_with_malformed_responses":3,"user_asks":24,"lazy_comments":0,"syntax_errors":0,"indentation_errors":1,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model openrouter/anthropic/claude-3.5-sonnet","date":"2024-08-21","versions":"0.51.2-dev","seconds_per_case":17.7,"total_cost":3.899}];

    var chartData = sonnetData.map(item => ({
        x: moment(item.date).toDate(),
        y1: item.pass_rate_1,
        y2: item.pass_rate_2
    })).sort((a, b) => a.x - b.x);

    new Chart(ctx, {
        type: 'scatter',
        data: {
            datasets: [{
                label: 'Pass Rate 1',
                data: chartData.map(item => ({ x: item.x, y: item.y1 })),
                backgroundColor: 'rgb(75, 192, 192)',
                pointRadius: 5,
                pointHoverRadius: 7
            }, {
                label: 'Pass Rate 2',
                data: chartData.map(item => ({ x: item.x, y: item.y2 })),
                backgroundColor: 'rgb(255, 99, 132)',
                pointRadius: 5,
                pointHoverRadius: 7
            }]
        },
        options: {
            responsive: true,
            maintainAspectRatio: false,
            scales: {
                y: {
                    beginAtZero: true,
                    title: {
                        display: true,
                        text: 'Pass Rate (%)',
                        font: {
                            size: 14
                        }
                    },
                    ticks: {
                        font: {
                            size: 12
                        }
                    }
                },
                x: {
                    type: 'time',
                    time: {
                        unit: 'day'
                    },
                    title: {
                        display: true,
                        text: 'Date',
                        font: {
                            size: 14
                        }
                    },
                    ticks: {
                        font: {
                            size: 12
                        }
                    }
                }
            },
            plugins: {
                title: {
                    display: true,
                    text: 'Claude 3.5 Sonnet Performance Over Time',
                    font: {
                        size: 18
                    }
                },
                legend: {
                    labels: {
                        font: {
                            size: 14
                        }
                    }
                },
                tooltip: {
                    callbacks: {
                        label: function(context) {
                            let label = context.dataset.label || '';
                            if (label) {
                                label += ': ';
                            }
                            if (context.parsed.y !== null) {
                                label += context.parsed.y.toFixed(1) + '%';
                            }
                            return label;
                        }
                    }
                }
            }
        }
    });
});
</script>

<blockquote>
  <p>This graph shows the performance of Claude 3.5 Sonnet on 
<a href="/docs/benchmarks.html#the-benchmark">aider’s code editing benchmark</a>
over time. ‘Pass Rate 1’ represents the initial success rate, while ‘Pass Rate 2’ shows the success rate after a second attempt with a chance to fix testing errors. 
The 
<a href="https://aider.chat/docs/leaderboards/">aider LLM code editing leaderboard</a>
ranks models based on Pass Rate 2.</p>
</blockquote>]]></content><author><name></name></author><summary type="html"><![CDATA[Sonnet's score on the aider code editing benchmark has been stable since it launched.]]></summary></entry><entry><title type="html">LLMs are bad at returning code in JSON</title><link href="https://aider.chat/2024/08/14/code-in-json.html" rel="alternate" type="text/html" title="LLMs are bad at returning code in JSON" /><published>2024-08-14T00:00:00+00:00</published><updated>2024-08-14T00:00:00+00:00</updated><id>https://aider.chat/2024/08/14/code-in-json</id><content type="html" xml:base="https://aider.chat/2024/08/14/code-in-json.html"><![CDATA[<p class="post-date">August 14, 2024</p>

<h1 id="llms-are-bad-at-returning-code-in-json">LLMs are bad at returning code in JSON</h1>

<p>LLMs produce lower quality code if they’re asked to return it as part of a structured JSON response. This seems to be true for many top models, including those with specialized support for JSON. Benchmarks show that models struggle with syntax errors in the code
they write, related to quoting and escaping it into JSON.
The benchmark results also imply a decreased capacity for solving coding problems due to the burden of JSON formatting.</p>

<style>
    .chart-container {
        position: relative;
        width: 100%;
        max-width: 800px;
        margin: 0 auto;
    }
</style>

<div class="chart-container">
    <canvas id="passRateChart"></canvas>
</div>

<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>

<script>
document.addEventListener('DOMContentLoaded', function () {
    var ctx = document.getElementById('passRateChart').getContext('2d');
    var chartContainer = document.querySelector('.chart-container');
    
    var yamlData = [{"dirname":"2024-08-15-13-17-11--json-no-lint-gpt-4o-2024-08-06-whole","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":60.2,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":4.3,"total_cost":0.7965},{"dirname":"2024-08-15-13-18-36--json-no-lint-gpt-4o-2024-08-06-func","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"JSON","commit_hash":"bac04a2","pass_rate_1":57.9,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":1,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":5.7,"total_cost":0.8417},{"dirname":"2024-08-15-13-21-55--json-no-lint-gpt-4o-2024-05-13-func","test_cases":133,"model":"gpt-4o-2024-05-13","edit_format":"JSON","commit_hash":"bac04a2","pass_rate_1":60.2,"percent_cases_well_formed":100.0,"error_outputs":2,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":1,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model gpt-4o-2024-05-13","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":7.1,"total_cost":1.2285},{"dirname":"2024-08-15-13-23-33--json-no-lint-claude-3.5-sonnet-whole","test_cases":133,"model":"claude-3.5-sonnet","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":60.2,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model claude-3.5-sonnet","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":10.5,"total_cost":1.6714},{"dirname":"2024-08-15-13-26-38--json-no-lint-deepseek-coder-whole","test_cases":133,"model":"deepseek-coder V2 0724","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":59.4,"percent_cases_well_formed":100.0,"error_outputs":2,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":2,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model deepseek-coder","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":27.9,"total_cost":0.0438},{"dirname":"2024-08-15-13-50-03--json-no-lint-gpt-4o-2024-08-06-whole-2","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":61.7,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":4.2,"total_cost":0.7946},{"dirname":"2024-08-15-13-51-36--json-no-lint-gpt-4o-2024-08-06-func-2","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"JSON","commit_hash":"bac04a2","pass_rate_1":56.4,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":1,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":6.4,"total_cost":0.839},{"dirname":"2024-08-15-13-54-53--json-no-lint-gpt-4o-2024-05-13-func-2","test_cases":133,"model":"gpt-4o-2024-05-13","edit_format":"JSON","commit_hash":"bac04a2","pass_rate_1":60.2,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":1,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model gpt-4o-2024-05-13","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":7.7,"total_cost":1.221},{"dirname":"2024-08-15-13-56-21--json-no-lint-claude-3.5-sonnet-whole-2","test_cases":133,"model":"claude-3.5-sonnet","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":60.9,"percent_cases_well_formed":100.0,"error_outputs":1,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model claude-3.5-sonnet","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":16.5,"total_cost":1.6556},{"dirname":"2024-08-15-14-06-12--json-no-lint-deepseek-coder-whole-2","test_cases":133,"model":"deepseek-coder V2 0724","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":60.9,"percent_cases_well_formed":100.0,"error_outputs":2,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":1,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model deepseek-coder","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":25.8,"total_cost":0.0439},{"dirname":"2024-08-15-14-11-45--json-no-lint-gpt-4o-2024-08-06-whole-3","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":60.9,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":4.3,"total_cost":0.7945},{"dirname":"2024-08-15-14-13-11--json-no-lint-gpt-4o-2024-08-06-func-3","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"JSON","commit_hash":"bac04a2","pass_rate_1":56.4,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":1,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":5.6,"total_cost":0.822},{"dirname":"2024-08-15-14-16-34--json-no-lint-gpt-4o-2024-05-13-func-3","test_cases":133,"model":"gpt-4o-2024-05-13","edit_format":"JSON","commit_hash":"bac04a2","pass_rate_1":58.6,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model gpt-4o-2024-05-13","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":8.7,"total_cost":1.2064},{"dirname":"2024-08-15-14-17-51--json-no-lint-claude-3.5-sonnet-whole-3","test_cases":133,"model":"claude-3.5-sonnet","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":60.2,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model claude-3.5-sonnet","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":11.0,"total_cost":1.6555},{"dirname":"2024-08-15-14-21-06--json-no-lint-deepseek-coder-whole-3","test_cases":133,"model":"deepseek-coder V2 0724","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":61.7,"percent_cases_well_formed":100.0,"error_outputs":3,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":2,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":3,"command":"aider --model deepseek-coder","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":24.4,"total_cost":0.0439},{"dirname":"2024-08-15-14-27-17--json-no-lint-gpt-4o-2024-08-06-whole-4","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":60.2,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":4.3,"total_cost":0.8015},{"dirname":"2024-08-15-14-28-58--json-no-lint-gpt-4o-2024-08-06-func-4","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"JSON","commit_hash":"bac04a2","pass_rate_1":60.2,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":6.0,"total_cost":0.8394},{"dirname":"2024-08-15-14-32-58--json-no-lint-gpt-4o-2024-05-13-func-4","test_cases":133,"model":"gpt-4o-2024-05-13","edit_format":"JSON","commit_hash":"bac04a2","pass_rate_1":59.4,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":2,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model gpt-4o-2024-05-13","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":11.1,"total_cost":1.212},{"dirname":"2024-08-15-14-34-39--json-no-lint-claude-3.5-sonnet-whole-4","test_cases":133,"model":"claude-3.5-sonnet","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":60.9,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model claude-3.5-sonnet","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":11.3,"total_cost":1.6635},{"dirname":"2024-08-15-14-38-35--json-no-lint-deepseek-coder-whole-4","test_cases":133,"model":"deepseek-coder V2 0724","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":59.4,"percent_cases_well_formed":100.0,"error_outputs":2,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":2,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model deepseek-coder","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":24.5,"total_cost":0.0438},{"dirname":"2024-08-15-14-44-11--json-no-lint-gpt-4o-2024-08-06-whole-5","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":60.9,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":4.6,"total_cost":0.8023},{"dirname":"2024-08-15-14-45-40--json-no-lint-gpt-4o-2024-08-06-func-5","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"JSON","commit_hash":"bac04a2","pass_rate_1":57.1,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":3,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":6.3,"total_cost":0.8354},{"dirname":"2024-08-15-14-49-44--json-no-lint-gpt-4o-2024-05-13-func-5","test_cases":133,"model":"gpt-4o-2024-05-13","edit_format":"JSON","commit_hash":"bac04a2","pass_rate_1":59.4,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":4,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model gpt-4o-2024-05-13","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":10.5,"total_cost":1.2099},{"dirname":"2024-08-15-14-51-18--json-no-lint-claude-3.5-sonnet-whole-5","test_cases":133,"model":"claude-3.5-sonnet","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":60.2,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model claude-3.5-sonnet","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":11.4,"total_cost":1.6685},{"dirname":"2024-08-15-14-54-41--json-no-lint-deepseek-coder-whole-5","test_cases":133,"model":"deepseek-coder V2 0724","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":61.7,"percent_cases_well_formed":100.0,"error_outputs":2,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":2,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model deepseek-coder","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":24.5,"total_cost":0.0439},{"dirname":"2024-08-15-15-12-55--json-no-lint-strict-gpt-4o-2024-08-06-func-2","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"JSON (strict)","commit_hash":"bf2d5fe","pass_rate_1":57.1,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":5.9,"total_cost":0.8216},{"dirname":"2024-08-15-15-14-31--json-no-lint-strict-gpt-4o-2024-08-06-func-3","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"JSON (strict)","commit_hash":"bf2d5fe","pass_rate_1":54.1,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":2,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":6.3,"total_cost":0.841},{"dirname":"2024-08-15-15-16-14--json-no-lint-strict-gpt-4o-2024-08-06-func-4","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"JSON (strict)","commit_hash":"bf2d5fe","pass_rate_1":59.4,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":5.9,"total_cost":0.8203},{"dirname":"2024-08-15-15-17-50--json-no-lint-strict-gpt-4o-2024-08-06-func-5","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"JSON (strict)","commit_hash":"bf2d5fe","pass_rate_1":57.1,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":1,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":6.1,"total_cost":0.8415},{"dirname":"2024-08-15-17-36-22--json-no-lint-again-gpt-4o-2024-05-13-whole-1","test_cases":133,"model":"gpt-4o-2024-05-13","edit_format":"Markdown","commit_hash":"ed94379","pass_rate_1":60.2,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":7,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-05-13","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":6.8,"total_cost":1.511},{"dirname":"2024-08-15-17-38-13--json-no-lint-again-gpt-4o-2024-05-13-whole-2","test_cases":133,"model":"gpt-4o-2024-05-13","edit_format":"Markdown","commit_hash":"ed94379","pass_rate_1":60.9,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-05-13","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":7.0,"total_cost":1.4954},{"dirname":"2024-08-15-17-40-10--json-no-lint-again-gpt-4o-2024-05-13-whole-3","test_cases":133,"model":"gpt-4o-2024-05-13","edit_format":"Markdown","commit_hash":"ed94379","pass_rate_1":60.9,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model gpt-4o-2024-05-13","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":6.8,"total_cost":1.4999},{"dirname":"2024-08-15-17-41-30--json-no-lint-again-gpt-4o-2024-05-13-whole-4","test_cases":133,"model":"gpt-4o-2024-05-13","edit_format":"Markdown","commit_hash":"ed94379","pass_rate_1":58.6,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-05-13","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":7.4,"total_cost":1.4848},{"dirname":"2024-08-15-17-43-12--json-no-lint-again-gpt-4o-2024-05-13-whole-5","test_cases":133,"model":"gpt-4o-2024-05-13","edit_format":"Markdown","commit_hash":"ed94379","pass_rate_1":59.4,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-05-13","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":7.6,"total_cost":1.4948},{"dirname":"2024-08-15-19-35-32--json-no-lint-again-deepseek-coder-func-1","test_cases":133,"model":"deepseek-coder V2 0724","edit_format":"JSON","commit_hash":"3a2ac02-dirty","pass_rate_1":50.4,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":2,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model deepseek-coder","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":17.8,"total_cost":0.033},{"dirname":"2024-08-15-19-37-50--json-no-lint-again-deepseek-coder-func-2","test_cases":133,"model":"deepseek-coder V2 0724","edit_format":"JSON","commit_hash":"1a98c28","pass_rate_1":49.6,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":5,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model deepseek-coder","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":18.3,"total_cost":0.0336},{"dirname":"2024-08-15-19-40-20--json-no-lint-again-deepseek-coder-func-3","test_cases":133,"model":"deepseek-coder V2 0724","edit_format":"JSON","commit_hash":"1a98c28","pass_rate_1":48.9,"percent_cases_well_formed":100.0,"error_outputs":1,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":5,"indentation_errors":1,"exhausted_context_windows":1,"test_timeouts":2,"command":"aider --model deepseek-coder","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":18.4,"total_cost":0.0337},{"dirname":"2024-08-15-19-44-07--json-no-lint-again-deepseek-coder-func-4","test_cases":133,"model":"deepseek-coder V2 0724","edit_format":"JSON","commit_hash":"1a98c28","pass_rate_1":53.4,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":2,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":2,"command":"aider --model deepseek-coder","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":17.6,"total_cost":0.033},{"dirname":"2024-08-15-19-46-48--json-no-lint-again-deepseek-coder-func-5","test_cases":133,"model":"deepseek-coder V2 0724","edit_format":"JSON","commit_hash":"1a98c28-dirty","pass_rate_1":53.4,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":11,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":2,"command":"aider --model deepseek-coder","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":18.0,"total_cost":0.0332},{"dirname":"2024-08-15-20-07-59--json-no-lint-again-claude-3.5-sonnet-func-1","test_cases":133,"model":"claude-3.5-sonnet","edit_format":"JSON","commit_hash":"1a98c28","pass_rate_1":54.1,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model claude-3.5-sonnet","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":9.5,"total_cost":1.5789},{"dirname":"2024-08-15-20-09-39--json-no-lint-again-claude-3.5-sonnet-func-2","test_cases":133,"model":"claude-3.5-sonnet","edit_format":"JSON","commit_hash":"1a98c28","pass_rate_1":55.6,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model claude-3.5-sonnet","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":9.2,"total_cost":1.5916},{"dirname":"2024-08-15-20-11-39--json-no-lint-again-claude-3.5-sonnet-func-3","test_cases":133,"model":"claude-3.5-sonnet","edit_format":"JSON","commit_hash":"1a98c28","pass_rate_1":53.4,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model claude-3.5-sonnet","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":10.3,"total_cost":1.5896},{"dirname":"2024-08-15-20-13-44--json-no-lint-again-claude-3.5-sonnet-func-4","test_cases":133,"model":"claude-3.5-sonnet","edit_format":"JSON","commit_hash":"1a98c28","pass_rate_1":55.6,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model claude-3.5-sonnet","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":9.2,"total_cost":1.6},{"dirname":"2024-08-15-20-15-51--json-no-lint-again-claude-3.5-sonnet-func-5","test_cases":133,"model":"claude-3.5-sonnet","edit_format":"JSON","commit_hash":"1a98c28","pass_rate_1":51.9,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model claude-3.5-sonnet","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":8.9,"total_cost":1.5936}];
    
    var models = [...new Set(yamlData.map(item => item.model))].sort();
    var editFormats = [...new Set(yamlData.map(item => item.edit_format))];
    
    var datasets = editFormats.map(format => ({
        label: format,
        data: models.map(model => {
            var items = yamlData.filter(d => d.model === model && d.edit_format === format);
            if (items.length === 0) return null;
            var average = items.reduce((sum, item) => sum + item.pass_rate_1, 0) / items.length;
            return parseFloat(average.toFixed(1));
        }),
        backgroundColor: function(context) {
            const format = context.dataset.label;
            if (format === 'Markdown') {
                return 'rgba(54, 162, 235, 0.8)';
            } else if (format.startsWith('JSON')) {
                const ctx = context.chart.ctx;
                const gradient = ctx.createPattern(createStripedCanvas(format === 'JSON (strict)'), 'repeat');
                return gradient;
            } else {
                return 'rgba(75, 192, 192, 0.8)';
            }
        },
    }));

    var data = {
        labels: models,
        datasets: datasets
    };

    function getAspectRatio() {
        var width = chartContainer.offsetWidth;
        // Gradually change aspect ratio from 2 (landscape) to 1 (square)
        return Math.max(1, Math.min(2, width / 300));
    }

    var config = {
        type: 'bar',
        data: data,
        options: {
            responsive: true,
            maintainAspectRatio: true,
            aspectRatio: getAspectRatio(),
            scales: {
                x: {
                    title: {
                        display: true,
                        text: 'Model'
                    }
                },
                y: {
                    beginAtZero: true,
                    title: {
                        display: true,
                        text: 'Pass Rate (%, average of 5 runs)'
                    },
                    max: 70
                }
            },
            plugins: {
                title: {
                    display: true,
                    text: 'Coding skill by model and code wrapping strategy',
                    font: {
                        size: 16
                    }
                },
                legend: {
                    position: 'top',
                },
                tooltip: {
                    callbacks: {
                        label: function(context) {
                            let label = context.dataset.label || '';
                            if (label) {
                                label += ': ';
                            }
                            if (context.parsed.y !== null) {
                                label += context.parsed.y.toFixed(1) + '%';
                            }
                            return label;
                        }
                    }
                }
            }
        },
        plugins: [{
            afterDraw: function(chart) {
                var ctx = chart.ctx;
                var isWideScreen = window.innerWidth > 768; // Assuming 768px as the breakpoint for wide screens
                if (isWideScreen) {
                    chart.data.datasets.forEach(function(dataset, i) {
                        var meta = chart.getDatasetMeta(i);
                        meta.data.forEach(function(bar, index) {
                            var data = dataset.data[index];
                            if (data !== null) {
                                ctx.fillStyle = '#000000';
                                ctx.textAlign = 'center';
                                ctx.textBaseline = 'bottom';
                                var displayText = data.toFixed(1) + '%';
                                ctx.fillText(displayText, bar.x, bar.y - 5);
                            }
                        });
                    });
                }
            }
        }]
    };

    var chart = new Chart(ctx, config);

    function resizeChart() {
        chart.options.aspectRatio = getAspectRatio();
        chart.resize();
    }

    window.addEventListener('resize', resizeChart);

    // Initial resize to set correct size
    resizeChart();
});

function createStripedCanvas(isStrict) {
    const patternCanvas = document.createElement('canvas');
    const patternContext = patternCanvas.getContext('2d');
    const size = 10;
    patternCanvas.width = size;
    patternCanvas.height = size;

    patternContext.fillStyle = 'rgba(255, 99, 132, 0.8)';
    patternContext.fillRect(0, 0, size, size);

    if (isStrict) {
        patternContext.strokeStyle = 'rgba(255, 255, 255, 0.8)';
        patternContext.lineWidth = 0.75;
        patternContext.beginPath();
        patternContext.moveTo(0, 0);
        patternContext.lineTo(size, size);
        patternContext.stroke();
    }

    return patternCanvas;
}
</script>

<blockquote>
  <p>Figure 1: Aider coding benchmark scores of models using either plain markdown text or JSON to return code.
Pass rate (%) averaged over 5 runs.
Models produce better code when they return it as markdown text,
as compared to returning code in a structured JSON response.</p>
</blockquote>

<h2 id="background">Background</h2>

<p>People often ask why aider uses a plain text format for LLMs to specify code edits (below),
rather than relying on LLM tools and structured JSON responses.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">greeting</span><span class="p">.</span><span class="n">py</span>
<span class="o">&lt;&lt;&lt;&lt;&lt;&lt;&lt;</span> <span class="n">SEARCH</span>
<span class="k">def</span> <span class="nf">greeting</span><span class="p">():</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"Hello"</span><span class="p">)</span>
<span class="o">=======</span>
<span class="k">def</span> <span class="nf">greeting</span><span class="p">():</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"Goodbye"</span><span class="p">)</span>
<span class="o">&gt;&gt;&gt;&gt;&gt;&gt;&gt;</span> <span class="n">REPLACE</span>
</code></pre></div></div>

<p>People expect that it would be easier and more reliable to use tool calls,
which would involve a structured JSON response more like this:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
    </span><span class="nl">"filename"</span><span class="p">:</span><span class="w"> </span><span class="s2">"greeting.py"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"search"</span><span class="p">:</span><span class="w"> </span><span class="s2">"def greeting():</span><span class="se">\n</span><span class="s2">    print(</span><span class="se">\"</span><span class="s2">Hello</span><span class="se">\"</span><span class="s2">)</span><span class="se">\n</span><span class="s2">"</span><span class="w">
    </span><span class="nl">"replace"</span><span class="p">:</span><span class="w"> </span><span class="s2">"def greeting():</span><span class="se">\n</span><span class="s2">    print(</span><span class="se">\"</span><span class="s2">Goodbye</span><span class="se">\"</span><span class="s2">)</span><span class="se">\n</span><span class="s2">"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>This question becomes increasingly relevant as LLM providers
continue to improve their tooling for reliably generating JSON.
For example, 
<a href="https://openai.com/index/introducing-structured-outputs-in-the-api/">OpenAI recently announced</a>
the ability to
strictly enforce that JSON responses will be syntactically correct 
and conform to a specified schema.</p>

<p>But just producing valid JSON is not sufficient for AI code generation –
the code inside the JSON matters too.
It has to be high quality code that solves the assigned coding task without errors or bugs.
Unfortunately, 
LLMs write worse code when they’re asked to 
wrap it in JSON.</p>

<p>In some sense this shouldn’t be surprising.
Just look at the very simple
JSON example above, with the escaped 
quotes <code class="language-plaintext highlighter-rouge">\"</code> and
newlines <code class="language-plaintext highlighter-rouge">\n</code>
mixed into the code.
Imagine the additional
complexity
if the code itself contained quoted strings
with their
own escape sequences.</p>

<p>Would <em>you</em> write better code by
typing it out normally
or typing it as a properly escaped 
JSON string?</p>

<h2 id="quantifying-the-benefits-of-plain-text">Quantifying the benefits of plain text</h2>

<p>Previous <a href="/2023/07/02/benchmarks.html">aider benchmark results</a>
showed
the superiority of returning code
as plain text compared to JSON-wrapped function calls.
Those results were obtained
over a year ago, against models far less capable than those available today.
OpenAI’s newly announced support for “strict” JSON
suggests the possibility that modern models might be able
to return quality code inside a structured JSON response.</p>

<p>The results presented here are based on
the 
<a href="/2023/07/02/benchmarks.html#the-benchmark">aider “code editing” benchmark</a>
of 133 practice exercises from the Exercism python repository.
The benchmark was simplified somewhat to focus on the differences between
plain text and JSON responses.
In particular, models were 
restricted to a single attempt to solve each task
without a second try to fix errors.</p>

<p>The performance of each model was compared across different strategies for returning code:</p>

<ul>
  <li><strong>Markdown</strong> – the model returned the whole source code file in standard markdown triple-backtick fences.</li>
  <li><strong>JSON</strong> – the model used a tool function call to return the whole source code file in a structured JSON response.</li>
  <li><strong>JSON (strict)</strong> – the same as the “JSON” strategy, but with <code class="language-plaintext highlighter-rouge">strict=True</code>. Only gpt-4o-2024-08-06 supported this setting.</li>
</ul>

<p>The markdown strategy was the same as
aider’s “whole” edit format, where the
LLM returns an entire updated copy of the source file like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Here is the program you asked for which prints "Hello":

greeting.py
```
def greeting():
    print("Hello")
```
</code></pre></div></div>

<p>Both JSON strategies required the LLM to call the <code class="language-plaintext highlighter-rouge">write_file</code> function with
an explanation/plan and
the entire updated copy of the source file.
The LLM didn’t have to specify the filename,
since the benchmark operates on one source file at a time.</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
    </span><span class="nl">"explanation"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Here is the program you asked for which prints </span><span class="se">\"</span><span class="s2">Hello</span><span class="se">\"</span><span class="s2">"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"content"</span><span class="p">:</span><span class="w"> </span><span class="s2">"def greeting():</span><span class="se">\n</span><span class="s2">    print(</span><span class="se">\"</span><span class="s2">Hello</span><span class="se">\"</span><span class="s2">)</span><span class="se">\n</span><span class="s2">"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>This experimental setup was designed to quantify
the effects of JSON-wrapping on the LLMs ability to write code to solve a task.</p>

<h2 id="results">Results</h2>

<p>Four of the strongest code editing models were benchmarked
to assess the impact of JSON-wrapping code:</p>

<ul>
  <li>claude-3-5-sonnet-20240620</li>
  <li>deepseek-coder (V2 0724)</li>
  <li>gpt-4o-2024-05-13</li>
  <li>gpt-4o-2024-08-06</li>
</ul>

<p>Each combination of model and code wrapping strategy was benchmarked 5 times
on all 133 problems.</p>

<h3 id="overall-coding-skill">Overall coding skill</h3>

<p>As shown in Figure 1, 
all of the models did worse on the benchmark when asked to
return code in a structured JSON response.
Most did significantly worse, performing well below
their result with the markdown strategy.</p>

<p>Some noteworthy observations:</p>

<ul>
  <li>OpenAI’s gpt-4o-2024-05-13 was the only model where the markdown and JSON results were
close. Using JSON only dropped the score by 0.4 percent, a difference which is
within the margin of error for 5 trials.</li>
  <li>The use of OpenAI’s new strict mode offered no improvement
as compared to non-strict JSON.
Both JSON results were well below the markdown result.</li>
  <li>The results from Sonnet and DeepSeek Coder suffered the worst harm from JSON wrapping.</li>
</ul>

<h3 id="syntax-errors">Syntax errors</h3>

<p>Models tend to make more syntax errors <em>in the code they write</em>
when asked to wrap it in JSON.
The models can reliably 
produce valid JSON, but code inside is more prone to syntax errors.</p>

<p>Figure 2 shows the number of syntax errors found in the code produced by each
model and code wrapping strategy.
It totals up the <code class="language-plaintext highlighter-rouge">SyntaxError</code> and <code class="language-plaintext highlighter-rouge">IndentationError</code> errors from all 5 runs,
for each model and strategy combination.</p>

<p>Below is an example of a <code class="language-plaintext highlighter-rouge">SyntaxError</code> created by gpt-4o-2024-05-13 using the
JSON code wrapping strategy.
It appears that the model got confused about escaping and quoting while trying
to format the JSON response.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Traceback</span> <span class="p">(</span><span class="n">most</span> <span class="n">recent</span> <span class="n">call</span> <span class="n">last</span><span class="p">):</span>
  <span class="p">...</span>   
  <span class="n">File</span> <span class="s">"bottle-song/bottle_song.py"</span><span class="p">,</span> <span class="n">line</span> <span class="mi">9</span>
    <span class="n">lyrics</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">'There'</span><span class="n">ll</span> <span class="n">be</span> <span class="p">{</span><span class="n">i</span> <span class="o">-</span> <span class="mi">1</span><span class="p">}</span> <span class="n">green</span> <span class="n">bottles</span> <span class="n">hanging</span> <span class="n">on</span> <span class="n">the</span> <span class="n">wall</span><span class="p">.</span><span class="s">')
                                                                          ^
SyntaxError: unterminated string literal (detected at line 9)
</span></code></pre></div></div>

<p>The problematic line of code contains a single-quoted string which also
contains a single-quote character.
It should have been output as the following chunk of JSON, with
a double slash in <code class="language-plaintext highlighter-rouge">There\\'ll</code>.
That is needed to JSON-escape the <code class="language-plaintext highlighter-rouge">\</code> so that it survives
JSON-decoding to 
produce <code class="language-plaintext highlighter-rouge">There\'ll</code> in the resulting code.
That would correctly escape the single-quote inside the single-quoted string.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>...lyrics.append(f'There\\'ll be {i - 1} green bottles hanging on the wall.')\n...
</code></pre></div></div>

<style>
    .chart-container {
        position: relative;
        width: 100%;
        max-width: 800px;
        margin: 0 auto;
    }
</style>

<div class="chart-container">
    <canvas id="syntaxErrorsChart"></canvas>
</div>

<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>

<script>
document.addEventListener('DOMContentLoaded', function () {
    var ctx = document.getElementById('syntaxErrorsChart').getContext('2d');
    var chartContainer = document.querySelector('.chart-container');
    
    var yamlData = [{"dirname":"2024-08-15-13-17-11--json-no-lint-gpt-4o-2024-08-06-whole","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":60.2,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":4.3,"total_cost":0.7965},{"dirname":"2024-08-15-13-18-36--json-no-lint-gpt-4o-2024-08-06-func","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"JSON","commit_hash":"bac04a2","pass_rate_1":57.9,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":1,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":5.7,"total_cost":0.8417},{"dirname":"2024-08-15-13-21-55--json-no-lint-gpt-4o-2024-05-13-func","test_cases":133,"model":"gpt-4o-2024-05-13","edit_format":"JSON","commit_hash":"bac04a2","pass_rate_1":60.2,"percent_cases_well_formed":100.0,"error_outputs":2,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":1,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model gpt-4o-2024-05-13","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":7.1,"total_cost":1.2285},{"dirname":"2024-08-15-13-23-33--json-no-lint-claude-3.5-sonnet-whole","test_cases":133,"model":"claude-3.5-sonnet","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":60.2,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model claude-3.5-sonnet","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":10.5,"total_cost":1.6714},{"dirname":"2024-08-15-13-26-38--json-no-lint-deepseek-coder-whole","test_cases":133,"model":"deepseek-coder V2 0724","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":59.4,"percent_cases_well_formed":100.0,"error_outputs":2,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":2,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model deepseek-coder","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":27.9,"total_cost":0.0438},{"dirname":"2024-08-15-13-50-03--json-no-lint-gpt-4o-2024-08-06-whole-2","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":61.7,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":4.2,"total_cost":0.7946},{"dirname":"2024-08-15-13-51-36--json-no-lint-gpt-4o-2024-08-06-func-2","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"JSON","commit_hash":"bac04a2","pass_rate_1":56.4,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":1,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":6.4,"total_cost":0.839},{"dirname":"2024-08-15-13-54-53--json-no-lint-gpt-4o-2024-05-13-func-2","test_cases":133,"model":"gpt-4o-2024-05-13","edit_format":"JSON","commit_hash":"bac04a2","pass_rate_1":60.2,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":1,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model gpt-4o-2024-05-13","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":7.7,"total_cost":1.221},{"dirname":"2024-08-15-13-56-21--json-no-lint-claude-3.5-sonnet-whole-2","test_cases":133,"model":"claude-3.5-sonnet","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":60.9,"percent_cases_well_formed":100.0,"error_outputs":1,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model claude-3.5-sonnet","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":16.5,"total_cost":1.6556},{"dirname":"2024-08-15-14-06-12--json-no-lint-deepseek-coder-whole-2","test_cases":133,"model":"deepseek-coder V2 0724","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":60.9,"percent_cases_well_formed":100.0,"error_outputs":2,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":1,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model deepseek-coder","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":25.8,"total_cost":0.0439},{"dirname":"2024-08-15-14-11-45--json-no-lint-gpt-4o-2024-08-06-whole-3","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":60.9,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":4.3,"total_cost":0.7945},{"dirname":"2024-08-15-14-13-11--json-no-lint-gpt-4o-2024-08-06-func-3","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"JSON","commit_hash":"bac04a2","pass_rate_1":56.4,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":1,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":5.6,"total_cost":0.822},{"dirname":"2024-08-15-14-16-34--json-no-lint-gpt-4o-2024-05-13-func-3","test_cases":133,"model":"gpt-4o-2024-05-13","edit_format":"JSON","commit_hash":"bac04a2","pass_rate_1":58.6,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model gpt-4o-2024-05-13","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":8.7,"total_cost":1.2064},{"dirname":"2024-08-15-14-17-51--json-no-lint-claude-3.5-sonnet-whole-3","test_cases":133,"model":"claude-3.5-sonnet","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":60.2,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model claude-3.5-sonnet","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":11.0,"total_cost":1.6555},{"dirname":"2024-08-15-14-21-06--json-no-lint-deepseek-coder-whole-3","test_cases":133,"model":"deepseek-coder V2 0724","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":61.7,"percent_cases_well_formed":100.0,"error_outputs":3,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":2,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":3,"command":"aider --model deepseek-coder","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":24.4,"total_cost":0.0439},{"dirname":"2024-08-15-14-27-17--json-no-lint-gpt-4o-2024-08-06-whole-4","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":60.2,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":4.3,"total_cost":0.8015},{"dirname":"2024-08-15-14-28-58--json-no-lint-gpt-4o-2024-08-06-func-4","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"JSON","commit_hash":"bac04a2","pass_rate_1":60.2,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":6.0,"total_cost":0.8394},{"dirname":"2024-08-15-14-32-58--json-no-lint-gpt-4o-2024-05-13-func-4","test_cases":133,"model":"gpt-4o-2024-05-13","edit_format":"JSON","commit_hash":"bac04a2","pass_rate_1":59.4,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":2,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model gpt-4o-2024-05-13","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":11.1,"total_cost":1.212},{"dirname":"2024-08-15-14-34-39--json-no-lint-claude-3.5-sonnet-whole-4","test_cases":133,"model":"claude-3.5-sonnet","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":60.9,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model claude-3.5-sonnet","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":11.3,"total_cost":1.6635},{"dirname":"2024-08-15-14-38-35--json-no-lint-deepseek-coder-whole-4","test_cases":133,"model":"deepseek-coder V2 0724","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":59.4,"percent_cases_well_formed":100.0,"error_outputs":2,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":2,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model deepseek-coder","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":24.5,"total_cost":0.0438},{"dirname":"2024-08-15-14-44-11--json-no-lint-gpt-4o-2024-08-06-whole-5","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":60.9,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":4.6,"total_cost":0.8023},{"dirname":"2024-08-15-14-45-40--json-no-lint-gpt-4o-2024-08-06-func-5","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"JSON","commit_hash":"bac04a2","pass_rate_1":57.1,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":3,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":6.3,"total_cost":0.8354},{"dirname":"2024-08-15-14-49-44--json-no-lint-gpt-4o-2024-05-13-func-5","test_cases":133,"model":"gpt-4o-2024-05-13","edit_format":"JSON","commit_hash":"bac04a2","pass_rate_1":59.4,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":4,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model gpt-4o-2024-05-13","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":10.5,"total_cost":1.2099},{"dirname":"2024-08-15-14-51-18--json-no-lint-claude-3.5-sonnet-whole-5","test_cases":133,"model":"claude-3.5-sonnet","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":60.2,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model claude-3.5-sonnet","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":11.4,"total_cost":1.6685},{"dirname":"2024-08-15-14-54-41--json-no-lint-deepseek-coder-whole-5","test_cases":133,"model":"deepseek-coder V2 0724","edit_format":"Markdown","commit_hash":"bac04a2","pass_rate_1":61.7,"percent_cases_well_formed":100.0,"error_outputs":2,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":2,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model deepseek-coder","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":24.5,"total_cost":0.0439},{"dirname":"2024-08-15-15-12-55--json-no-lint-strict-gpt-4o-2024-08-06-func-2","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"JSON (strict)","commit_hash":"bf2d5fe","pass_rate_1":57.1,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":5.9,"total_cost":0.8216},{"dirname":"2024-08-15-15-14-31--json-no-lint-strict-gpt-4o-2024-08-06-func-3","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"JSON (strict)","commit_hash":"bf2d5fe","pass_rate_1":54.1,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":2,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":6.3,"total_cost":0.841},{"dirname":"2024-08-15-15-16-14--json-no-lint-strict-gpt-4o-2024-08-06-func-4","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"JSON (strict)","commit_hash":"bf2d5fe","pass_rate_1":59.4,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":5.9,"total_cost":0.8203},{"dirname":"2024-08-15-15-17-50--json-no-lint-strict-gpt-4o-2024-08-06-func-5","test_cases":133,"model":"gpt-4o-2024-08-06","edit_format":"JSON (strict)","commit_hash":"bf2d5fe","pass_rate_1":57.1,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":1,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-08-06","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":6.1,"total_cost":0.8415},{"dirname":"2024-08-15-17-36-22--json-no-lint-again-gpt-4o-2024-05-13-whole-1","test_cases":133,"model":"gpt-4o-2024-05-13","edit_format":"Markdown","commit_hash":"ed94379","pass_rate_1":60.2,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":7,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-05-13","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":6.8,"total_cost":1.511},{"dirname":"2024-08-15-17-38-13--json-no-lint-again-gpt-4o-2024-05-13-whole-2","test_cases":133,"model":"gpt-4o-2024-05-13","edit_format":"Markdown","commit_hash":"ed94379","pass_rate_1":60.9,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-05-13","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":7.0,"total_cost":1.4954},{"dirname":"2024-08-15-17-40-10--json-no-lint-again-gpt-4o-2024-05-13-whole-3","test_cases":133,"model":"gpt-4o-2024-05-13","edit_format":"Markdown","commit_hash":"ed94379","pass_rate_1":60.9,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":0,"command":"aider --model gpt-4o-2024-05-13","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":6.8,"total_cost":1.4999},{"dirname":"2024-08-15-17-41-30--json-no-lint-again-gpt-4o-2024-05-13-whole-4","test_cases":133,"model":"gpt-4o-2024-05-13","edit_format":"Markdown","commit_hash":"ed94379","pass_rate_1":58.6,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-05-13","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":7.4,"total_cost":1.4848},{"dirname":"2024-08-15-17-43-12--json-no-lint-again-gpt-4o-2024-05-13-whole-5","test_cases":133,"model":"gpt-4o-2024-05-13","edit_format":"Markdown","commit_hash":"ed94379","pass_rate_1":59.4,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model gpt-4o-2024-05-13","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":7.6,"total_cost":1.4948},{"dirname":"2024-08-15-19-35-32--json-no-lint-again-deepseek-coder-func-1","test_cases":133,"model":"deepseek-coder V2 0724","edit_format":"JSON","commit_hash":"3a2ac02-dirty","pass_rate_1":50.4,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":2,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model deepseek-coder","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":17.8,"total_cost":0.033},{"dirname":"2024-08-15-19-37-50--json-no-lint-again-deepseek-coder-func-2","test_cases":133,"model":"deepseek-coder V2 0724","edit_format":"JSON","commit_hash":"1a98c28","pass_rate_1":49.6,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":5,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model deepseek-coder","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":18.3,"total_cost":0.0336},{"dirname":"2024-08-15-19-40-20--json-no-lint-again-deepseek-coder-func-3","test_cases":133,"model":"deepseek-coder V2 0724","edit_format":"JSON","commit_hash":"1a98c28","pass_rate_1":48.9,"percent_cases_well_formed":100.0,"error_outputs":1,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":5,"indentation_errors":1,"exhausted_context_windows":1,"test_timeouts":2,"command":"aider --model deepseek-coder","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":18.4,"total_cost":0.0337},{"dirname":"2024-08-15-19-44-07--json-no-lint-again-deepseek-coder-func-4","test_cases":133,"model":"deepseek-coder V2 0724","edit_format":"JSON","commit_hash":"1a98c28","pass_rate_1":53.4,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":2,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":2,"command":"aider --model deepseek-coder","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":17.6,"total_cost":0.033},{"dirname":"2024-08-15-19-46-48--json-no-lint-again-deepseek-coder-func-5","test_cases":133,"model":"deepseek-coder V2 0724","edit_format":"JSON","commit_hash":"1a98c28-dirty","pass_rate_1":53.4,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":11,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":2,"command":"aider --model deepseek-coder","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":18.0,"total_cost":0.0332},{"dirname":"2024-08-15-20-07-59--json-no-lint-again-claude-3.5-sonnet-func-1","test_cases":133,"model":"claude-3.5-sonnet","edit_format":"JSON","commit_hash":"1a98c28","pass_rate_1":54.1,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model claude-3.5-sonnet","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":9.5,"total_cost":1.5789},{"dirname":"2024-08-15-20-09-39--json-no-lint-again-claude-3.5-sonnet-func-2","test_cases":133,"model":"claude-3.5-sonnet","edit_format":"JSON","commit_hash":"1a98c28","pass_rate_1":55.6,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model claude-3.5-sonnet","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":9.2,"total_cost":1.5916},{"dirname":"2024-08-15-20-11-39--json-no-lint-again-claude-3.5-sonnet-func-3","test_cases":133,"model":"claude-3.5-sonnet","edit_format":"JSON","commit_hash":"1a98c28","pass_rate_1":53.4,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model claude-3.5-sonnet","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":10.3,"total_cost":1.5896},{"dirname":"2024-08-15-20-13-44--json-no-lint-again-claude-3.5-sonnet-func-4","test_cases":133,"model":"claude-3.5-sonnet","edit_format":"JSON","commit_hash":"1a98c28","pass_rate_1":55.6,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model claude-3.5-sonnet","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":9.2,"total_cost":1.6},{"dirname":"2024-08-15-20-15-51--json-no-lint-again-claude-3.5-sonnet-func-5","test_cases":133,"model":"claude-3.5-sonnet","edit_format":"JSON","commit_hash":"1a98c28","pass_rate_1":51.9,"percent_cases_well_formed":100.0,"error_outputs":0,"num_malformed_responses":0,"num_with_malformed_responses":0,"user_asks":0,"lazy_comments":0,"syntax_errors":0,"indentation_errors":0,"exhausted_context_windows":0,"test_timeouts":1,"command":"aider --model claude-3.5-sonnet","date":"2024-08-15","versions":"0.50.2-dev","seconds_per_case":8.9,"total_cost":1.5936}];
    
    var models = [...new Set(yamlData.map(item => item.model))].sort();
    var editFormats = [...new Set(yamlData.map(item => item.edit_format))];
    
    var datasets = editFormats.map(format => ({
        label: format,
        data: models.map(model => {
            var items = yamlData.filter(d => d.model === model && d.edit_format === format);
            if (items.length === 0) return null;
            var totalErrors = items.reduce((sum, item) => sum + item.syntax_errors + item.indentation_errors, 0);
            return totalErrors;
        }),
        backgroundColor: function(context) {
            const format = context.dataset.label;
            if (format === 'Markdown') {
                return 'rgba(54, 162, 235, 0.8)';
            } else if (format.startsWith('JSON')) {
                const ctx = context.chart.ctx;
                const gradient = ctx.createPattern(createStripedCanvas(format === 'JSON (strict)'), 'repeat');
                return gradient;
            } else {
                return 'rgba(75, 192, 192, 0.8)';
            }
        },
    }));

    var data = {
        labels: models,
        datasets: datasets
    };

    function getAspectRatio() {
        var width = chartContainer.offsetWidth;
        // Gradually change aspect ratio from 2 (landscape) to 1 (square)
        return Math.max(1, Math.min(2, width / 300));
    }

    var config = {
        type: 'bar',
        data: data,
        options: {
            responsive: true,
            maintainAspectRatio: true,
            aspectRatio: getAspectRatio(),
            scales: {
                x: {
                    title: {
                        display: true,
                        text: 'Model'
                    }
                },
                y: {
                    beginAtZero: true,
                    title: {
                        display: true,
                        text: 'Total syntax errors from 5 runs'
                    },
                    max: 35
                }
            },
            plugins: {
                title: {
                    display: true,
                    text: 'Syntax errors by model and code wrapping strategy',
                    font: {
                        size: 16
                    }
                },
                legend: {
                    position: 'top',
                },
                tooltip: {
                    callbacks: {
                        label: function(context) {
                            let label = context.dataset.label || '';
                            if (label) {
                                label += ': ';
                            }
                            if (context.parsed.y !== null) {
                                label += context.parsed.y;
                            }
                            return label;
                        }
                    }
                }
            }
        },
        plugins: [{
            afterDraw: function(chart) {
                var ctx = chart.ctx;
                chart.data.datasets.forEach(function(dataset, i) {
                    var meta = chart.getDatasetMeta(i);
                    meta.data.forEach(function(bar, index) {
                        var data = dataset.data[index];
                        if (data !== null) {
                            ctx.fillStyle = '#000000';
                            ctx.textAlign = 'center';
                            ctx.textBaseline = 'bottom';
                            ctx.fillText(data, bar.x, bar.y - 5);
                        }
                    });
                });
            }
        }]
    };

    var chart = new Chart(ctx, config);

    function resizeChart() {
        chart.options.aspectRatio = getAspectRatio();
        chart.resize();
    }

    window.addEventListener('resize', resizeChart);

    // Initial resize to set correct size
    resizeChart();
});
</script>

<blockquote>
  <p>Figure 2: Number of <code class="language-plaintext highlighter-rouge">SyntaxError</code> and <code class="language-plaintext highlighter-rouge">IndentationError</code> errors found in model generated code,
totaled from 5 runs.
Models tend to make more syntax and formatting errors when asked to wrap code in JSON.</p>
</blockquote>

<h3 id="beyond-syntax-errors">Beyond syntax errors</h3>

<p>Sonnet’s results seems to indicate that the negative effects of JSON-wrapping 
go beyond just syntactic difficulties.
Sonnet avoided syntax errors regardless of the code wrapping strategy,
but its benchmark scores in Figure 1 were nonetheless lower with JSON.
This implies that JSON-wrapping may distract or challenge models in a way that
reduces their ability to reason about solving coding problems.</p>

<h2 id="conclusions">Conclusions</h2>

<p>While the specific results differ from the similar
<a href="/2023/07/02/benchmarks.html">July 2023 experiments</a>,
the conclusion remains unchanged: LLMs are bad at returning code in
structured JSON responses.</p>

<p>OpenAI appears to be making progress in allowing LLMs to
return JSON-wrapped code
without harming the code quality.
But it seems premature to consider switching from plain text
to JSON-wrapped code at this time.</p>

<hr />

<h4 id="notes-on-the-aider-leaderboard">Notes on the aider leaderboard</h4>

<p><em>The results presented here are not directly comparable to results
from the main
<a href="https://aider.chat/docs/leaderboards/">aider LLM leaderboard</a>.
A number of settings were changed to simplify the benchmark
in order to focus on comparing plain text and JSON-wrapped code.</em></p>]]></content><author><name></name></author><summary type="html"><![CDATA[LLMs write worse code if you ask them to return the code wrapped in JSON via a tool function call.]]></summary></entry><entry><title type="html">Coding with Llama 3.1, new DeepSeek Coder &amp;amp; Mistral Large</title><link href="https://aider.chat/2024/07/25/new-models.html" rel="alternate" type="text/html" title="Coding with Llama 3.1, new DeepSeek Coder &amp;amp; Mistral Large" /><published>2024-07-25T00:00:00+00:00</published><updated>2024-07-25T00:00:00+00:00</updated><id>https://aider.chat/2024/07/25/new-models</id><content type="html" xml:base="https://aider.chat/2024/07/25/new-models.html"><![CDATA[<p class="post-date">July 25, 2024</p>

<h1 id="coding-with-llama-31-new-deepseek-coder--mistral-large">Coding with Llama 3.1, new DeepSeek Coder &amp; Mistral Large</h1>

<p><img src="/assets/2024-07-new-models.jpg" alt="Summary of code editing skill for the new models, with Sonnet and GPT-3.5 for scale." /></p>

<p>Five noteworthy models have been released in the last few days,
with a wide range of code editing capabilities.
Here are their results from
<a href="https://aider.chat/docs/leaderboards/">aider’s code editing leaderboard</a>
with Claude 3.5 Sonnet and the best GPT-3.5 model
included for scale.</p>

<ul>
  <li><strong>77% claude-3.5-sonnet</strong></li>
  <li>73% DeepSeek Coder V2 0724</li>
  <li>66% llama-3.1-405b-instruct</li>
  <li>60% Mistral Large 2 (2407)</li>
  <li>59% llama-3.1-70b-instruct</li>
  <li><strong>58% gpt-3.5-turbo-0301</strong></li>
  <li>38% llama-3.1-8b-instruct</li>
</ul>

<p>You can code with all of these models using aider like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python -m pip install -U aider-chat

# Change directory into a git repo to work on
$ cd /to/your/git/repo

$ export DEEPSEEK_API_KEY=your-key-goes-here
$ aider --model deepseek/deepseek-coder

$ export MISTRAL_API_KEY=your-key-goes-here
$ aider --model mistral/mistral-large-2407

$ export OPENROUTER_API_KEY=your-key-goes-here
$ aider --model openrouter/meta-llama/llama-3.1-405b-instruct
$ aider --model openrouter/meta-llama/llama-3.1-70b-instruct
$ aider --model openrouter/meta-llama/llama-3.1-8b-instruct
</code></pre></div></div>

<p>See the
<a href="https://aider.chat/docs/install.html">installation instructions</a>
and other
<a href="https://aider.chat/docs/usage.html">documentation</a>
for more details.</p>

<h2 id="deepseek-coder-v2-0724">DeepSeek Coder V2 0724</h2>

<p>DeepSeek Coder V2 0724 was by far the biggest surprise
and strongest code editing model, coming in 2nd on the leaderboard.
It can
efficiently edit code with SEARCH/REPLACE, unlike
the prior DeepSeek Coder version.
This unlocks the ability to edit large files.</p>

<p>This new Coder version got 73% on the benchmark,
very
close to Sonnet’s 77% but 20-50X less expensive!</p>

<h2 id="llama-31">LLama 3.1</h2>

<p>Meta released the
Llama 3.1 family of models,
which have performed well on many evals.</p>

<p>The flagship Llama 3.1 405B instruct only 
secured #7 on aider’s leaderboard, 
well behind frontier models like
Claude 3.5 Sonnet &amp; GPT-4o.</p>

<p>The 405B model can use SEARCH/REPLACE to efficiently
edit code, but with a decrease in the benchmark score.
When using this “diff” editing format, its score dropped 
from 66% to 64%.</p>

<p>The smaller 70B model was competitive with GPT-3.5, while
the 8B model lags far behind.
Both seem unable to reliably use SEARCH/REPLACE to edit files.
This limits them to editing smaller files that can
fit into their output token limit.</p>

<h2 id="mistral-large-2-2407">Mistral Large 2 (2407)</h2>

<p>Mistral Large 2 (2407) scored only 60% on aider’s code editing
benchmark. 
This puts it just ahead of the best GPT-3.5 model. 
It
doesn’t seem able to reliably use SEARCH/REPLACE to efficiently edit
code,
which limits its use to small source files.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Summary of code editing skill for the new models, with Sonnet and GPT-3.5 for scale.]]></summary></entry><entry><title type="html">Sonnet is the opposite of lazy</title><link href="https://aider.chat/2024/07/01/sonnet-not-lazy.html" rel="alternate" type="text/html" title="Sonnet is the opposite of lazy" /><published>2024-07-01T00:00:00+00:00</published><updated>2024-07-01T00:00:00+00:00</updated><id>https://aider.chat/2024/07/01/sonnet-not-lazy</id><content type="html" xml:base="https://aider.chat/2024/07/01/sonnet-not-lazy.html"><![CDATA[<p><a href="https://aider.chat/assets/sonnet-not-lazy.jpg"><img src="/assets/sonnet-not-lazy.jpg" alt="sonnet is the opposite of lazy" /></a></p>

<p class="post-date">July 01, 2024</p>

<h1 id="sonnet-is-the-opposite-of-lazy">Sonnet is the opposite of lazy</h1>

<p>Claude 3.5 Sonnet represents a step change
in AI coding.
It is incredibly industrious, diligent and hard working.
Unexpectedly,
this presented a challenge:
Sonnet
was often writing so much code that
it was hitting the 4k output token limit,
truncating its coding in mid-stream.</p>

<p>Aider now works
around this 4k limit and allows Sonnet to produce
as much code as it wants.
The result is surprisingly powerful.
Sonnet’s score on
<a href="https://aider.chat/docs/leaderboards/#code-refactoring-leaderboard">aider’s refactoring benchmark</a>
jumped from 55.1% up to 64.0%.
This moved Sonnet into second place, ahead of GPT-4o and
behind only Opus.</p>

<p>Users who tested Sonnet with a preview of 
<a href="https://aider.chat/HISTORY.html#aider-v0410">aider’s latest release</a>
were thrilled:</p>

<ul>
  <li><em>Works like a charm. It is a monster. It refactors files of any size like it is nothing. The continue trick with Sonnet is truly the holy grail. Aider beats [other tools] hands down. I’m going to cancel both subscriptions.</em> – <a href="https://github.com/Aider-AI/aider/issues/705#issuecomment-2200338971">Emasoft</a></li>
  <li><em>Thanks heaps for this feature - it’s a real game changer. I can be more ambitious when asking Claude for larger features.</em> – <a href="https://github.com/Aider-AI/aider/issues/705#issuecomment-2196026656">cngarrison</a></li>
  <li><em>Fantastic…! It’s such an improvement not being constrained by output token length issues. [I refactored] a single JavaScript file into seven smaller files using a single Aider request.</em> – <a href="https://discord.com/channels/1131200896827654144/1253492379336441907/1256250487934554143">John Galt</a></li>
</ul>

<h2 id="hitting-the-4k-token-output-limit">Hitting the 4k token output limit</h2>

<p>All LLMs have various token limits, the most familiar being their
context window size.
But they also have a limit on how many tokens they can output
in response to a single request.
Sonnet and the majority of other
models are limited to returning 4k tokens.</p>

<p>Sonnet’s amazing work ethic caused it to
regularly hit this 4k output token
limit for a few reasons:</p>

<ol>
  <li>Sonnet is capable of outputting a very large amount of correct,
complete new code in one response.</li>
  <li>Similarly, Sonnet can specify long sequences of edits in one go, 
like changing a majority of lines while refactoring a large file.</li>
  <li>Sonnet tends to quote large chunks of a
file when performing a SEARCH &amp; REPLACE edits.
Beyond token limits, this is very wasteful.</li>
</ol>

<h2 id="good-problems">Good problems</h2>

<p>Problems (1) and (2) are “good problems”
in the sense that Sonnet is
able to write more high quality code than any other model!
We just don’t want it to be interrupted prematurely
by the 4k output limit.</p>

<p>Aider now allows Sonnet to return code in multiple 4k token
responses.
Aider seamlessly combines them so that Sonnet can return arbitrarily
long responses.
This gets all the upsides of Sonnet’s prolific coding skills,
without being constrained by the 4k output token limit.</p>

<h2 id="wasting-tokens">Wasting tokens</h2>

<p>Problem (3) is more complicated, as Sonnet isn’t just
being stopped early – it’s actually wasting a lot
of tokens, time and money.</p>

<p>Faced with a few small changes spread far apart in 
a source file,
Sonnet would often prefer to do one giant SEARCH/REPLACE
operation of almost the entire file.
It would be far faster and less expensive to instead 
do a few surgical edits.</p>

<p>Aider now prompts Sonnet to discourage these long-winded
SEARCH/REPLACE operations
and promotes much more concise edits.</p>

<h2 id="aider-with-sonnet">Aider with Sonnet</h2>

<p><a href="https://aider.chat/HISTORY.html#aider-v0410">The latest release of aider</a>
has specialized support for Claude 3.5 Sonnet:</p>

<ul>
  <li>Aider allows Sonnet to produce as much code as it wants,
by automatically and seamlessly spreading the response
out over a sequence of 4k token API responses.</li>
  <li>Aider carefully prompts Sonnet to be concise when proposing
code edits.
This reduces Sonnet’s tendency to waste time, tokens and money
returning large chunks of unchanging code.</li>
  <li>Aider now uses Claude 3.5 Sonnet by default if the <code class="language-plaintext highlighter-rouge">ANTHROPIC_API_KEY</code> is set in the environment.</li>
</ul>

<p>See 
<a href="https://aider.chat/docs/install.html">aider’s install instructions</a>
for more details, but
you can get started quickly with aider and Sonnet like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python -m pip install -U aider-chat

$ export ANTHROPIC_API_KEY=&lt;key&gt; # Mac/Linux
$ setx   ANTHROPIC_API_KEY &lt;key&gt; # Windows, restart shell after setx

$ aider
</code></pre></div></div>]]></content><author><name></name></author><summary type="html"><![CDATA[Claude 3.5 Sonnet can easily write more good code than fits in one 4k token API response.]]></summary></entry><entry><title type="html">Aider is SOTA for both SWE Bench and SWE Bench Lite</title><link href="https://aider.chat/2024/06/02/main-swe-bench.html" rel="alternate" type="text/html" title="Aider is SOTA for both SWE Bench and SWE Bench Lite" /><published>2024-06-02T00:00:00+00:00</published><updated>2024-06-02T00:00:00+00:00</updated><id>https://aider.chat/2024/06/02/main-swe-bench</id><content type="html" xml:base="https://aider.chat/2024/06/02/main-swe-bench.html"><![CDATA[<p class="post-date">June 02, 2024</p>

<h1 id="aider-is-sota-for-both-swe-bench-and-swe-bench-lite">Aider is SOTA for both SWE Bench and SWE Bench Lite</h1>

<p>Aider scored 18.9%
on the main
<a href="https://www.swebench.com">SWE Bench benchmark</a>,
achieving a state-of-the-art result. 
The current top leaderboard entry is 13.8%
from Amazon Q Developer Agent.
The best result reported elsewhere seems to be
<a href="https://www.cognition.ai/post/swe-bench-technical-report">13.9% from Devin</a>.</p>

<p>This result on the main SWE Bench builds on
<a href="https://aider.chat/2024/05/22/swe-bench-lite.html">aider’s recent SOTA result on the easier SWE Bench Lite</a>.</p>

<p><a href="https://aider.chat/assets/swe_bench.svg"><img src="/assets/swe_bench.svg" alt="SWE Bench results" /></a></p>

<p><strong>All of aider’s results reported here are pass@1 results,
obtained without using the SWE Bench <code class="language-plaintext highlighter-rouge">hints_text</code>.</strong>
Aider was benchmarked on the same
<a href="https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs">570 randomly selected SWE Bench problems</a>
that were used in the
<a href="https://www.cognition.ai/post/swe-bench-technical-report">Devin evaluation</a>.
See the <a href="#references">references</a>
for more details on the data presented in this chart.</p>

<h2 id="interactive-not-agentic">Interactive, not agentic</h2>

<p>Aider achieved this result mainly through its existing features that focus on static
code analysis, reliable LLM code editing, and pragmatic UX for automatically
fixing linting and testing errors.
Aider intentionally has quite limited and narrow “agentic behavior”
to avoid long delays, high token costs
and the need for users to repeatedly code review incorrect solutions.
It’s also worth noting that aider currently does not use
RAG, vector search, tools or give the LLM access to search the web
or unilaterally execute code.</p>

<p>Aider is first and foremost an interactive tool for engineers to get real work done in
real code bases using a chat interface.
Aider provides a pair programming UX where users can ask for a change 
and see code edits performed in real-time.
Aider can also offer additional help like fixing lint or test errors,
but the user is always in full interactive control.
This allows them to quickly steer misunderstandings back on course and
avoid wasting time and token costs.</p>

<h2 id="benchmark-methodology">Benchmark methodology</h2>

<p>Benchmarking was conducted as follows:</p>

<ul>
  <li>Aider with GPT-4o was launched in each problem’s git repository
with the problem statement
submitted as the opening chat message from “the user”.</li>
  <li>After that aider ran as normal, except all of aider’s
suggestions were always accepted without user approval.</li>
  <li>A <a href="https://github.com/Aider-AI/aider-swe-bench#the-aider-agent">simple harness</a> was used to retry the SWE Bench problem if aider produced code that wasn’t <em>plausibly correct</em>.
Plausibly correct means that aider reported that it had successfully edited the repo
without causing syntax errors or breaking any <em>pre-existing</em> tests.</li>
  <li>If the solution from aider with GPT-4o wasn’t plausible, the harness launched aider to try again from scratch using Claude 3 Opus.</li>
  <li>If no plausible solution was found after those two tries, the harness picked the “most plausible” solution with the fewest edit/lint/test problems.</li>
</ul>

<p>It’s important to be clear that
<em>aider and the benchmark harness
only had access to the pre-existing tests in each problem’s repo</em>.
The held out “acceptance tests” were <em>only</em> used
after benchmarking to compute statistics on which problems aider
correctly resolved.</p>

<p>This is the same approach
that was used for
<a href="https://aider.chat/2024/05/22/swe-bench-lite.html">aider’s recent SOTA result on SWE Bench Lite</a>.
For the Lite benchmark,
aider alternated between GPT-4o and Opus for up to six total attempts.
To manage the cost of running the main SWE Bench benchmark,
aider was limited to two total attempts:
one with GPT-4o and one with Opus.</p>

<p>For a detailed discussion of the benchmark
methodology, see the
<a href="https://aider.chat/2024/05/22/swe-bench-lite.html">article about aider’s SWE Bench Lite results</a>.
Also, the
<a href="https://github.com/Aider-AI/aider-swe-bench">aider SWE Bench repository on GitHub</a>
contains the harness and statistics code used for the benchmarks.</p>

<p>The benchmarking process was similar to how a developer might use aider to
resolve a GitHub issue:</p>

<ul>
  <li>They could launch aider in their repo with the command below, which
tells aider they want to accept every suggestion
and to use pytest to run tests.
    <ul>
      <li><code class="language-plaintext highlighter-rouge">aider --yes --test-cmd pytest</code></li>
    </ul>
  </li>
  <li>They could start the chat by pasting in the URL or text of a GitHub issue.
Aider will pull in the URL’s content and then try and resolve the issue.</li>
  <li>If aider doesn’t produce code that lints and tests clean, the user might decide to
<a href="https://aider.chat/docs/git.html">use git to revert the changes</a>,
and try again with <code class="language-plaintext highlighter-rouge">aider --opus</code>.</li>
</ul>

<h2 id="aider-with-gpt-4o-alone-was-sota">Aider with GPT-4o alone was SOTA</h2>

<p>Using aider with GPT-4o to make a single attempt at resolving each problem
achieved a score of 17.0%.
This was itself a state-of-the-art result, before being surpassed by the main
result being reported here
that used aider with both GPT-4o &amp; Opus.</p>

<h2 id="aider-with-gpt-4o--opus">Aider with GPT-4o &amp; Opus</h2>

<p>The benchmark harness started by using aider with GPT-4o to try
and resolve each problem.
For problems where this didn’t produce a plausible solution,
the harness tried again using aider with Opus.
So at most, two attempts were made for each problem.</p>

<p>The table below breaks down the proposed solutions that
were found from each attempt at the 570 problems.
A proposed solution is either:</p>

<ul>
  <li>A plausible solution where
aider reported no outstanding errors from editing, linting and testing.</li>
  <li>Or, the “most plausible” solution generated by either attempt, with the
<a href="https://aider.chat/2024/05/22/swe-bench-lite.html#finding-a-plausible-solution">fewest outstanding editing, linting or testing errors</a>.</li>
</ul>

<p>The table also provides details on the 108 solutions that were ultimately
verified as correctly resolving their issue.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: center">Attempt</th>
      <th>Agent</th>
      <th style="text-align: right">Number of<br />proposed<br />solutions</th>
      <th style="text-align: right">Percent of<br />proposed<br />solutions</th>
      <th style="text-align: right">Number of<br />correctly<br />resolved<br />solutions</th>
      <th style="text-align: right">Percent of<br />correctly<br />resolved<br />solutions</th>
      <th style="text-align: right">Score on<br />SWE Bench<br />Lite</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: center">1</td>
      <td>Aider with GPT-4o</td>
      <td style="text-align: right">419</td>
      <td style="text-align: right">73.5%</td>
      <td style="text-align: right">87</td>
      <td style="text-align: right">80.6%</td>
      <td style="text-align: right">15.3%</td>
    </tr>
    <tr>
      <td style="text-align: center">2</td>
      <td>Aider with Opus</td>
      <td style="text-align: right">151</td>
      <td style="text-align: right">26.5%</td>
      <td style="text-align: right">21</td>
      <td style="text-align: right">19.4%</td>
      <td style="text-align: right">3.7%</td>
    </tr>
    <tr>
      <td style="text-align: center"><strong>Total</strong></td>
      <td> </td>
      <td style="text-align: right"><strong>570</strong></td>
      <td style="text-align: right"><strong>100%</strong></td>
      <td style="text-align: right"><strong>108</strong></td>
      <td style="text-align: right"><strong>100%</strong></td>
      <td style="text-align: right"><strong>18.9%</strong></td>
    </tr>
  </tbody>
</table>

<h2 id="non-plausible-but-correct-solutions">Non-plausible but correct solutions?</h2>

<p>A solution doesn’t actually have to be plausible in order to correctly resolve the issue.
Recall that plausible is simply defined as aider
reporting that it successfully completed all file edits,
repaired and resolved any linting errors
and resolved any test failures.
But there are many reasons why aider might fail to do those things
and yet still produce a solution that will pass
acceptance testing:</p>

<ul>
  <li>There may have been pre-existing failing tests in the repo,
before aider even started working on the SWE Bench problem.
Aider may not have resolved such issues, and yet they may not be
relevant to the acceptance testing.
The SWE Bench acceptance testing just confirms that tests pass or fail
in the same pattern as the “gold patch” developed by a human to resolve the
problem.
Some tests may fail during acceptance testing,
and that’s ok as long as they failed for the gold
patch too.</li>
  <li>There may have been pre-existing linting problems in the repo.
If lingering linting issues affected code paths that are not well tested,
they may not impact acceptance testing.</li>
  <li>Aider may have reported file editing errors because it thought the LLM
specified edits that it wasn’t able to successfully apply.
This can only happen when the LLM specified edits in
a way that doesn’t comply with the editing instructions in the system prompt.
Given that the LLM isn’t complying with the system prompt,
it may have become confused and
asked for redundant or otherwise irrelevant edits.
Such outstanding edit errors might not be fatal for acceptance testing.</li>
  <li>Etc.</li>
</ul>

<p>Keeping all this in mind, we can understand why
GPT-4o accounts for 15.3% of the benchmark score in the table above,
but benchmarking with just one attempt of aider with GPT-4o scored 17.0%.
When an Opus attempt is allowed after GPT-4o,
it may propose some <em>incorrect</em> solutions which
are “more plausible” than some of GPT-4o’s non-plausible solutions.
These more plausible, incorrect solutions can
eclipse some of
the earlier non-plausible correct solutions that GPT-4o generated.
This is why GPT-4o’s score in the table 
showing the combined GPT-4o &amp; Opus results (15.3%)
is lower than the result from just one try using aider with GPT-4o (17.0%).</p>

<p>For these reasons, adding additional attempts is not guaranteed to monotonically
increase the number of resolved problems.
New solutions may resolve some new problems but they may also
eclipse and discard some of the previous non-plausible correct solutions.</p>

<p>Luckily, the net effect of additional attempts
usually increases or at least maintains the
number of resolved solutions.
This was the case for all the attempts made in both this main SWE Bench result and the
earlier Lite result.</p>

<h2 id="computing-the-benchmark-score">Computing the benchmark score</h2>

<p>The benchmark harness produced one proposed solution for each of
the 570 SWE Bench problems.</p>

<p>A separate evaluation script was used to
test each of these solutions with the full test suite,
including the held out acceptance tests.
For this final acceptance testing, any edits that aider made to tests
were discarded.
This ensured that the correct,
unmodified test suite was used for acceptance testing.
The evaluation script compared each proposed solution’s test results
with results from testing
the “gold” patch that was developed by a human to correctly resolve the issue.
If they matched, the proposed solution correctly resolved the issue.</p>

<p>These acceptance tests were only ever run outside of aider
and the benchmark harness, and only to compute statistics about the
correctly resolved instances.
They were never run, used, or even visible during aider’s attempts to resolve the problems.</p>

<p>Aider correctly resolved 108 out of 570 SWE Bench instances that were benchmarked,
or 18.9%.</p>

<h2 id="acknowledgments">Acknowledgments</h2>

<p>Much thanks to the team behind the
<a href="https://www.swebench.com">SWE Bench</a>
family of AI coding benchmarks.
Also thanks to Albert Örwall who has
<a href="https://github.com/aorwall/SWE-bench-docker">dockerized the SWE Bench evaluation scripts</a>
making it faster, easier, and more reliable to run the acceptance tests.</p>

<h2 id="references">References</h2>

<p>All of aider’s results reported here are pass@1 results,
obtained without using the SWE Bench <code class="language-plaintext highlighter-rouge">hints_text</code>.</p>

<p>The “aider agent” internally makes multiple “attempts” at solving the problem,
but it picks and returns one single candidate solution.
Only that one candidate solution is evaluated with the acceptance tests
and contributes to the benchmark score.
Thus it is a pass@1 result.</p>

<p>This is contrast to a pass@N result for N&gt;1, where N attempts are made
and all N solutions are evaluated by the acceptance tests.
If <em>any</em> of the N solution pass, that counts as a pass@N success.</p>

<p>Below are the references for the other pass@1 unhinted SWE-Bench results
displayed in the graph at the beginning of this article.</p>

<ul>
  <li><a href="https://www.cognition.ai/post/swe-bench-technical-report">13.9% Devin, benchmarked on 570 instances.</a></li>
  <li><a href="https://www.swebench.com">13.8% Amazon Q Developer Agent, benchmarked on 2,294 instances.</a></li>
  <li><a href="https://www.swebench.com">12.5% SWE- Agent + GPT-4, benchmarked on 2,294 instances.</a></li>
  <li><a href="https://arxiv.org/pdf/2404.05427v2">10.6% AutoCode Rover, benchmarked on 2,294 instances.</a></li>
  <li><a href="https://www.swebench.com">10.5% SWE- Agent + Opus, benchmarked on 2,294 instances.</a></li>
</ul>

<p>The graph contains average pass@1 results for AutoCodeRover.
The <a href="https://github.com/nus-apr/auto-code-rover">AutoCodeRover GitHub page</a>
features their pass@3 results
without being clearly labeled.
Table 2 of their
<a href="https://arxiv.org/pdf/2404.05427v2">paper</a>
reports an <code class="language-plaintext highlighter-rouge">ACR-avg</code> result of 10.59% which is an average pass@1 result.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Aider sets SOTA for the main SWE Bench, after recently setting SOTA for the Lite version.]]></summary></entry></feed>