Will AI agents be able to regularly code small features for us in a year?
➕
Plus
184
Ṁ140k
2025
56%
chance

I'm thinking of something like https://mentat.ai/, but that actually works.

I will provide a paragraph or so describing the change I want made. Then it should create a GitHub PR, which I will review and leave only a few comments before merging. The whole process should take less than 30 minutes. This should work fairly reliably.

I tried this yesterday and it failed haha:
https://github.com/manifoldmarkets/manifold/pull/2694

See more discussion in my post:

https://jamesgrugett.com/p/software-automation-will-make-us

Get Ṁ1,000 play money
Sort by:

I bought yes because I've seen GitHub's Copilot Workspace already do promisingly well in my brief tests. By mid-2025, I can definitely see it being good enough to do real work on some codebases (especially if you have a good test suite).

opened a Ṁ20,000 YES at 60% order

if James doesn’t get accepted into AI grants, then there will be something better as an alternative, otherwise manicode will be coding features for us in a year

Will you @JamesGrugett provide additional repo-level, AI-specific documentation as you describe in https://manifold.markets/JamesGrugett/will-manicode-be-accepted-into-ai-g ?

From a reading of the question description text, I'd say that shouldn't be allowed: description mentions mentat.ai and "provide a paragraph or so"--both of which suggest no such AI-specific handholding.

Will Manicode be accepted into AI Grant batch 4?
24% chance. https://aigrant.com/ Here is my application (selected questions only): Provide a short summary of your product Better code generation than Cursor Describe your product in more detail Run manicode in your terminal. Ask it to do any coding task. It will make changes to your files. ...and it will do a really good job. Why? It has full access to read and write to your files, run terminal commands, and scrape the web It can: grab files it needs for context, edit multiple files at once (no copy-pasting), run the type checker, run tests, install dependencies, and search for documentation. These abilities are key to doing a good job and will only become more powerful as LLM's continue to level up. It uses so-called "knowledge" files LLM's perform so much better with extra context! With Manicode, we've come up with this idea to check in knowledge.md files in any directory, and write down extra bits of context, like which 3 files you need to edit in order to create a new endpoint. Or which patterns are being deprecated and which should be used. Or which directories can import from other directories. Every codebase has lots of i tomplicit knowledge like this that you have to impart to your engineers. Once written down, it makes Claude really fly! It's truly a night and day difference. It's synchronous, and you can give feedback You're chatting with it. It takes ~30 seconds to get back to you and then you can tell it what you want to do next or what it did wrong. This keeps Manicode on track and aligned. It learns The flow of using Manicode is: Ask it to do something If it fails, point out its error Manicode fixes the error and automatically writes down how it can improve for next time in a knowledge file You push the commit, and now Manicode has become even more capable when the next engineer runs it in the codebase. This is the magic loop that will make Manicode productive for experienced engineers in giant codebases. We're unafraid to spend for better results We can use as many parallel API calls with as much context as we can to produce the best code, because we know that the alternative is human labor, which is much more expensive. We're targeting the largest market for software engineers It's a tool for daily use by experts (not just junior engineers) It's for software maintainers (not just people starting new projects) We're starting with a console application, because it's simple and has great distribution Every developer knows how to install new packages with npm or pip. Most developers already have the terminal accessible: as a pane in your vscode window, for example. The timing is right Claude Sonnet 3.5 passed some bar of coding competence, and the form factor of a fully capable agent that can change any file works now, whereas before you could only reliably edit one function at a time. There is a moat after all Handling every tech stack well, knowing when to run their tests and type check, integrating with git, linear, slack, and email, supporting database migrations, etc, etc, etc. You can build hundreds or thousands of special case prompt magic to improve things so that it always just magically works the first time. A startup arriving at this 6 months late wouldn't catch up. Try it out! > npm install -g manicode > manicode Intro video (https://www.youtube.com/embed/ZzT4HIhnzio)Demo video https://www.loom.com/share/2067e3ad5fdf4565905f6aeb8f13b215?sid=de0e9ad8-447a-485a-bcb3-71b8a5a43665 Addendum I submitted this last night. The few things I forgot to include: The prototype is communicating with my server over websockets and so is significantly more complex than running a local script. It is already set up to work on any project immediately. I intend to charge $100 per month per user to get off the free plan (and some usage based fees after that if you use it a huge amount). Giving manicode full access to your files and terminal where it can run stuff without confirmation from the user sounds scary, but is actually not risky in reality, especially if you have version control. This quality of doing something that normal people think "goes too far" or seems unsafe is a correlated with good startup ideas, because it means fewer people are likely to have thought of it. (E.g. For Airbnb: You let random strangers sleep in your house? Or Manifold: You let anyone ask and judge the resolution of their own question?) They said they will let us know if we won by September 20th at the latest.

Hi, great question!

When I created this market, I didn't imagine I would be building my own AI agent for coding.

Regarding human-created context on the codebase, I do think that should be allowed! Adding a bit of documentation seems like fair game. If, however, the context were specifying in detail how to make the coding changes for the specific feature, that would seem unfair.

Also, I think a little bit of back-and-forth with the AI should be allowed, since I did specify you could leave some comments, and that it should take under 30 minutes.

I think manicode does not yet qualify, since I'm not sure it would work 90% of the time, without manual intervention or extended back-and-forth.

Thanks for clarifying.

To be frank, the fact that you are literally designing your own AI presumably optimized for Manifold Github functionality wildly changes the odds on this question. Obviously can't know what projects will spin up over the course of the year (so fair play), but the phrasing of this question came off to me as pointing at 3rd party, general AI agents rather than Manifold-bespoke AI agents.

I understand. I will try to raise the bar of expectations if it feels like manicode is especially good at the manifold codebase compared to others. I don't really think this will be the case though.

While it is not coding, AI code review could be helpful. Take for example https://coderabbit.ai. It does a pretty nice summary as well as code review. They are also free for open source so you could try them out.

Here is an example that shows how it could be useful: https://github.com/jsonresume/jsonresume.org/pull/131#issuecomment-2236198926

I have this at ~30%. Anyone want to explain their reasoning? 90% success is a very high bar, compare to SWE bench which includes test cases (presumably James doesn’t always pre specify these) and yet current sota is only 20%.

Does "fairly reliably" roughly mean 75% success, 90%, 98%, ...?

90%!

Too subjective for me to bet much on. Expectations will shift as much or more than capabilities over the next year.

I think that in a year we'll see some outstanding successes when the feature is straightforward and uses a common pattern (i.e. add some CRUD route handlers to a REST API for a popular server framework).

But for more complicated things, and for codebases which go off the beaten path a bit, we'll still see broken PRs and code which superficially looks right but has an unusual number of subtle bugs.

In a year, I don't know if this market will resolve based on asking it to do something easy or hard, where the difficulty for a human might not correlate to difficulty for an AI-bot in a easily predictable way.

My general bias is that, with experience, a programmer will learn to avoid pitfalls of any tool, making the tool more useful over time, even without the tool changing at all.

I have a clear idea of what I'm looking for. It needs to be able to make good changes to the codebase for a variety of small-ish requests, which often involve some refactoring along the way. (Leaving code better after the change than before would be a good sign!)

I think this qualifies as a harder objective in your characterization. I'm totally on board with the idea that even now AI coding agents could become more useful operating within a more limited framework.

You've explored this a bit already -- do you know if any AI coding agents integrate with CI/CD to build & test the code they write? It seems like that could go a long way towards fixing the "code only superficially looks correct" issue.

If a first agent could write a comprehensive set of unit tests and end-to-end tests (including performance goals for desired level of scale), then it seems like you could let a second agent take as many implementation attempts as it needs to reach those goals.

That doesn't help with the broader "is AI generated code clean enough to directly incorporate into my codebase?" issue though. I suspect that we'll go through a period of "AI writes custom libraries to do a specific task. Humans don't mess with them, they just use them." That's not very different from how we treat compilers. If we want to alter the library, we'll tweak our requirements and let the AI generate it again, possibly using the old library for reference.

It's a good idea! Especially with languages that have types as another layer of checking.

MentatBot seemed to make lots of errors that could be tested, but they do say that testing approaches is a key part of how it works: https://mentat.ai/blog/mentatbot-sota-coding-agent

bought Ṁ500 YES

Seriously this is priced so ridiculously wrong

bought Ṁ2,000 YES

@JamesGrugett Is this just a ploy to get us to buy more mana so we can bet this up to 99%?

"Yeah, we're a tech debt as a service startup"

this market will have controversial resolution!

bought Ṁ100 YES

this market will have a controversial yes resolution**