SOTA on a SWE-bench [Assisted] in October 2024

Ṁ1462

Oct 11

ALL

1.3%

90-100%

18%

75-90%

60-75%

17%

35-60%

47%

15-35%

0-15%

The SWE-bench is a benchmark developed to evaluate if language models can resolve real-world GitHub issues. The leaderboard showcases various models and their performances in terms of the percentage of SWE-bench instances they resolved. Each instance in the SWE-bench represents a GitHub issue. The leaderboard is categorized into two main sections: Unassisted and Assisted.

Assisted: In this category, models are evaluated with the "oracle" retrieval setting. This setting provides the model with the correct files to edit, allowing the benchmark to primarily focus on a model's patch generation ability.

This question is only about the Assisted category of this benchmark.

http://www.swebench.com/#
Current SOTA is <5%

The prediction market will resolve based on the SWE-bench leaderboard standings as of 11th October 2024.

In the extremely unlikely case that the number would fit in two intervals, the lowest will be chosen.

#️ Technology

#AI

#Technical AI Timelines

Get Ṁ1,000 play money

5 Comments

Sort by:

We still haven't seen 3.5 Sonnet. It's likely that the winner on the leaderboard is worse than what the SOTA could be

If there are no new submissions (claims) for Assisted settings, this question will duplicate the Unassisted one and resolve the same.

This is not duplicated. Right now everyone has different agent systems that helps the models to solve the github issues.

In original paper, assisted meant that authors themselves gave precise files that had to be changed in order to complete the task.

Ok, thanks. You right.

Related questions

Related questions