NHacker Next
login
▲The unreasonable effectiveness of an LLM agent loop with tool usesketch.dev
416 points by crawshaw 1 days ago | 297 comments
Loading comments...
libraryofbabel 1 days ago [-]
Strongly recommend this blog post too which is a much more detailed and persuasive version of the same point. The author actually goes and builds a coding agent from zero: https://ampcode.com/how-to-build-an-agent

It is indeed astonishing how well a loop with an LLM that can call tools works for all kinds of tasks now. Yes, sometimes they go off the rails, there is the problem of getting that last 10% of reliability, etc. etc., but if you're not at least a little bit amazed then I urge you go to and hack together something like this yourself, which will take you about 30 minutes. It's possible to have a sense of wonder about these things without giving up your healthy skepticism of whether AI is actually going to be effective for this or that use case.

This "unreasonable effectiveness" of putting the LLM in a loop also accounts for the enormous proliferation of coding agents out there now: Claude Code, Windsurf, Cursor, Cline, Copilot, Aider, Codex... and a ton of also-rans; as one HN poster put it the other day, it seems like everyone and their mother is writing one. The reason is that there is no secret sauce and 95% of the magic is in the LLM itself and how it's been fine-tuned to do tool calls. One of the lead developers of Claude Code candidly admits this in a recent interview.[0] Of course, a ton of work goes into making these tools work well, but ultimately they all have the same simple core.

[0] https://www.youtube.com/watch?v=zDmW5hJPsvQ

vidarh 13 hours ago [-]
There's a Ruby port of the first article you linked as well. Feature-wise they're about the same, but if you (like me) enjoy Ruby more than Python it's worth reading both articles:

https://news.ycombinator.com/item?id=43984860

https://radanskoric.com/articles/coding-agent-in-ruby

forgingahead 13 hours ago [-]
Love to see the Ruby implementations! Thanks for sharing.
ichiwells 6 hours ago [-]
Thank you so much for sharing this!

We are using ruby to build a powerful AI toolset in the construction space, and we love how simple all of the SaaS parts are and not reinventing the wheel, but the ruby LLM SDK ecosystem is a bit lagging, so we've written a lot of our own low-level tools.

(btw we are also hiring rubyists https://news.ycombinator.com/item?id=43865448)

datpuz 23 hours ago [-]
Can't think of anything an LLM is good enough at to let them do on their own in a loop for more than a few iterations before I need to reign it back in.
hbbio 20 hours ago [-]
That's why in practice you need more than this simple loop!

Pretty much WIP, but I am experimenting with simple sequence-based workflows that are designed to frequently reset the conversation [2]

This goes well with Microsoft paper "LLMs Get Lost In Multi-Turn Conversation " that was published Friday [1].

- [1]: https://arxiv.org/abs/2505.06120

- [2]: https://github.com/hbbio/nanoagent/blob/main/src/workflow.ts

vidarh 13 hours ago [-]
They've written most of the recent iterations of X11 bindings for Ruby, including a complete, working example of a systray for me.

They also added the first pass of multi-monitor support for my WM while I was using it (restarted it repeatedly while Claude Code worked, in the same X session the terminal it was working in was running).

You do need to reign them back in, sure, but they can often go multiple iterations before they're ready to make changes to your files once you've approved safe tool uses etc.

TZubiri 1 hours ago [-]
How do they read the screen?
datpuz 9 hours ago [-]
Agents? Doubt.
vidarh 9 hours ago [-]
You can doubt it all you want - it doesn't make it any less true.
datpuz 7 hours ago [-]
Can you provide a source
Groxx 22 hours ago [-]
They're extremely good at burning through budgets, and get even better when unattended
_kb 21 hours ago [-]
Maximising paperclip production too.
mycall 20 hours ago [-]
Is that really true? I though there free models and $200 all you can eat models.
nsomaru 20 hours ago [-]
These tools require API calls which usually aren’t priced like the consumer plans
never_inline 6 hours ago [-]
Well technically Aider let's you use a web chat UI by generating some context and letting you paste back and forth.
adastra22 20 hours ago [-]
Yeah they’re cheaper. I’ve written whole apps for $0.20 in API calls.
monsieurbanana 10 hours ago [-]
With which agent? What kind of apps?

Without more information I'm very skeptical that you had e.g. Claude Code create a whole app (so more than a simple script) with 20 cents. Unless it was able to one-shot it, but at that point you don't need an agent anyway.

adastra22 7 hours ago [-]
Aider, Claude 3.7.
datpuz 7 hours ago [-]
I've "written" whole apps by going to GitHub, cloning a repo, right clicking, and renaming it to "MyApp." Impressed?
jfim 20 hours ago [-]
Claude code is now part of the consumer $100/mo max plan.
Aeolun 17 hours ago [-]
If they give me API access too I’m sold xD
piuantiderp 17 hours ago [-]
Read that you can very quickly blow the budget on the 200/mo ones too
CuriouslyC 23 hours ago [-]
The main problem with agents is that they aren't reflecting on their own performance and pausing their own execution to ask a human for help aggressively enough. Agents can run on for 20+ iterations in many cases successfully, but also will need hand holding after every iteration in some cases.

They're a lot like a human in that regard, but we haven't been building that reflection and self awareness into them so far, so it's like a junior that doesn't realize when they're over their depth and should get help.

vendiddy 15 hours ago [-]
I think they are capable of doing it, but it requires prompting.

I constantly have to instruct them: - Go step by step, don't skip ahead until we're done with a step - Don't make assumptions, if you're unsure ask questions to clarify

And they mostly do this.

But this needs to be default behavior!

I'm surprised that, unless prompted, LLMs never seem to ask follow-up questions as a smart coworker might.

ariwilson 21 hours ago [-]
Is there value in adding an overseer LLM that measures the progress between n steps and if it's too low stops and calls out to a human?
CuriouslyC 17 hours ago [-]
I don't think you need an overseer for this, you can just have the agent self-assess at each step whether it's making material progress or if it's caught in a loop, and if it's caught in a loop to pause and emit a prompt for help from a human. This would probably require a bit of tuning, and the agents need to be setup with a blocking "ask for help" function, but it's totally doable.
p_v_doom 16 hours ago [-]
Bruh, we're inventing robot PMs for our robot developers now? We're so fucked
suninsight 15 hours ago [-]
Yes it works really well. We do something like that at NonBioS.ai - longer post below. The agent self reflects if it is stuck or confused and calls out the human for help.
solumunus 21 hours ago [-]
And how does it effectively measure progress?
NotMichaelBay 20 hours ago [-]
It can behave just like a senior role would - produce the set of steps for the junior to follow, and assess if the junior appears stuck at any particular step.
CuriouslyC 17 hours ago [-]
I have actually had great success with agentic coding by sitting down with a LLM to tell it what I'm trying to build and have it be socratic with me, really trying to ask as many questions as it can think of to help tease out my requirements. While it's doing this, it's updating the project readme to outline this vision and create a "planned work" section that is basically a roadmap for an agent to follow.

Once I'm happy that the readme accurately reflects what I want to build and all the architectural/technical/usage challenges have been addressed, I let the agent rip, instructing it to build one thing at a time, then typecheck, lint and test the code to ensure correctness, fixing any errors it finds (and re-running automated checks) before moving on to the next task. Given this workflow I've built complex software using agents with basically no intervention needed, with the exception of rare cases where its testing strategy is flakey in a way that makes it hard to get the tests passing.

Xevion 17 hours ago [-]
>I have actually had great success with agentic coding by sitting down with a LLM to tell it what I'm trying to build and have it be socratic with me, really trying to ask as many questions as it can think of to help tease out my requirements.

Just curious, could you expand on the precise tools or way you do this?

For example, do you use the same well-crafted prompt in Claude or Gemini and use their in-house document curation features, or do you use a file in VS Code with Copilot Chat and just say "assist me in writing the requirements for this project in my README, ask questions, perform a socratic discussion with me, build a roadmap"?

You said you had 'great success' and I've found AI to be somewhat underwhelming at times, and I've been wondering if it's because of my choice of models, my very simple prompt engineering, or if my inputs are just insufficient/too complex.

CuriouslyC 16 hours ago [-]
I use Aider with a very tuned STYLEGUIDE.md and AI rules document that basically outlines this whole process so I don't have to instruct it every time. My preferred model is Gemini 2.5 Pro, which is definitely by far the best model for this sort of thing (Claude can one shot some stuff about as well but for following an engineering process and responding to test errors, it's vastly inferior)
vendiddy 15 hours ago [-]
How do you find Aider compares to Claude code?
CuriouslyC 12 hours ago [-]
I like Aider's configurability, I can chain a lot of static analysis stuff together with it and have the model fix all of it, and I can have 2-4 aider windows open in a grid and run them all at once, not sure how that'd work with Claude Code. Also, aider managing everything with git commits is great.
TeMPOraL 11 hours ago [-]
Can you talk more about the workflow you're using? I'm using Aider routinely myself, but with relatively unsophisticated approach. One thing that annoys me a bit is that prompts aren't obviously customizable - I'm pretty sure that the standard ones, which include code examples in 2 or 3 different languages, are confusing LLMs a bit when I work on a codebase that doesn't use those languages.
CuriouslyC 10 hours ago [-]
I use a styleguide.md document which is general software engineering principles that you might provide for human contributers in an open source project. I pair that with a .cursorrules (people I code with use it so I use that file name for their convenience) that describes how the LLM should interact with me:

# Cursor Rules for This Project

  You are a software engineering expert. Your role is to work with your partner engineer to maximize their productivity, while ensuring the codebase remains simple, elegant, robust, testable, maintainable, and extensible to sustain team development velocity and deliver maximum value to the employer.
## Overview

  During the design phase, before being instructed to implement specific code:
  - Be highly Socratic: ask clarifying questions, challenge assumptions, and verify understanding of the problem and goals.
  - Seek to understand why the user proposes a certain solution.
  - Test whether the proposed design meets the standards of simplicity, robustness, testability, maintainability, and extensibility.
  - Update project documentation: README files, module docstrings, Typedoc comments, and optionally generate intermediate artifacts like PlantUML or D2 diagrams.

  During the implementation phase, after being instructed to code:
  - Focus on efficiently implementing the requested changes.
  - Remain non-Socratic unless the requested code appears to violate design goals or cause serious technical issues.
  - Write clean, type-annotated, well-structured code and immediately write matching unit tests.
  - Ensure all code passes linting, typechecking and tests.
  - Always follow any provided style guides or project-specific standards.
## Engineering Mindset

- Prioritize *clarity, simplicity, robustness, and extensibility*. - Solve problems thoughtfully, considering the long-term maintainability of the code. - Challenge assumptions and verify problem understanding during design discussions. - Avoid cleverness unless it significantly improves readability and maintainability. - Strive to make code easy to test, easy to debug, and easy to change.

## Design First

- Before coding, establish a clear understanding of the problem and the proposed solution. - When designing, ask: - What are the failure modes? - What will be the long-term maintenance burden? - How can this be made simpler without losing necessary flexibility? - Update documentation during the design phase: - `README.md` for project-level understanding. - Architecture diagrams (e.g., PlantUML, D2) are encouraged for complex flows.

I use auto lint/test in aider like so:

file: - README.md - STYLEGUIDE.md - .cursorrules

aiderignore: .gitignore

# Commands for linting, typechecking, testing lint-cmd: - bun run lint - bun run typecheck

test-cmd: bun run test

TeMPOraL 5 hours ago [-]
Thanks. It's roughly similar to what I do then, except I haven't really gotten used to linting and testing with aider yet - first time I tried (many months ago), it seemed to do weird things, so I wrote the feature off for now, and promised myself to revisit it someday. Maybe now it's a good time.

Since you shared yours, it's only fair to share mine :). In my current projects, two major files I use are:

[[ CONVENTIONS.md ]] -- tends to be short and project-specifics; looks like this:

Project conventions

- Code must run entirely client-side (i.e. in-browser)

- Prefer solutions not requiring a build step - such as vanilla HTML/JS/CSS

- Minimize use of dependencies, and vendor them

  E.g. if using HTMX, ensure (by providing instructions or executing commands) it's downloaded into the project sources, and referenced accordingly, as opposed to being loaded client-side from a CDN. I.e. `js/library.js` is OK, `https://cdn.blahblah/library.js` is not.

[[ AI.md ]] -- this I guess is similar to what people put in .cursorrules; mine looks like this:

# Extra AI instructions Here are stored extra guidelines for you.

## AI collaborative project

I'm relying on you to do a good job here and I'm happy to embrace the directions you're giving, but I'll be editing it on my own as well.

## Evolving your instruction set

If I tell you to remember something, behave differently, or you realize yourself you'd benefit from remembering some specific guideline, please add it to this file (or modify existing guideline). The format of the guidelines is unspecified, except second-level headers to split them by categories; otherwise, whatever works best for you is best. You may store information about the project you want to retain long-term, as well as any instructions for yourself to make your work more efficient and correct.

## Coding Practice Guidelines

Strive to adhere to the following guidelines to improve code quality and reduce the need for repeated corrections:

    - **Adhere to project conventions and specifications**
      * Conventions are outlined in file `CONVENTIONS.md`
      * Specification, if any, is available in file `SPECIFICATION.md`.
        If it doesn't exist, consider creating one anyway based on your understanding of
        what user has in mind wrt. the project. Specification will double as a guide / checklist
        for you to know if what needed to be implemented already is.

    - **Build your own memory helpers to stay oriented**
      * Keep "Project Files and Structure" section of this file up to date;
      * For larger tasks involving multiple conversation rounds, keep a running plan of your work
        in a separate file (say, `PLAN.md`), and update it to match the actual plan.
      * Evolve guidelines in "Coding Practice Guidelines" section of this file based on user feedback.

    - **Proactively Apply DRY and Abstraction:**
      * Actively identify and refactor repetitive code blocks into helper functions or methods.

    - **Meticulous Code Generation and Diff Accuracy:**
      * Thoroughly review generated code for syntax errors, logical consistency, and adherence
        to existing conventions before presenting it.
      * Ensure `SEARCH/REPLACE` blocks are precise and accurately reflect the changes against
        the current, exact state of the provided files. Double-check line endings, whitespace,
        and surrounding context.

    - **Modularity for Improved Reliability of AI Code Generation**
      * Unless instructed otherwise in project conventions, aggressively prefer dividing source
        code into files, each handling a concern or functionality that might need to be worked
        in isolation. The goal is to minimize unnecessary code being pulled into context window,
        and reduce chance of confusion when generating edit diffs.
      * As codebase grows and things are added and deleted, look for opportunities to improve
        project structure by further subdivisions or rearranging the file structure; propose
        such restructurings to the user after you're done with changes to actual code.
      * Focus on keeping things that are likely to be independently edited separate. Examples:
        - Keeping UI copoments separate, and within each, something a-la MVC pattern
          might make sense, as display and input are likely to be independent from
          business logic;
      * Propose and maintain utility libraries for functions shared by different code files/modules.
        Examples:
        - Display utilities used by multiple views of different component;

    - **Clear Separation of Concerns:**
    *   Continue to adhere to the project convention of separating concerns
        into different source files.
    *   When introducing new, distinct functionalities propose creating new
        files for them to maintain modularity.

    - **Favor Fundamental Design Changes Over Incremental Patches for Flawed Approaches:**
      * If an existing approach requires multiple, increasingly complex fixes
        to address bugs or new requirements, pause and critically evaluate if
        the underlying design is sound.
      * Be ready to propose and implement more fundamental refactoring or
        a design change if it leads to a more robust, maintainable, and extensible solution,
        rather than continuing with a series of local patches.

    - **Design for Foreseeable Complexity (Within Scope):**
      * While adhering to the immediate task's scope ("do what they ask, but no more"),
        consider the overall project requirements when designing initial solutions.
      * If a core feature implies future complexity (e.g., formula evaluation, reactivity),
        the initial structures should be reasonably accommodating of this, even if the first
        implementation is a simplified version. This might involve placeholder modules or
        slightly more robust data structures from the outset.
## Project platform note

This project is targeting a Raspberry Pi 2 Model B V1.1 board with a 3.5 inch TFT LCD touchscreen sitting on top. That touchscreen is enabled/configured via system overlay and "just works", and is currently drawn to via framebuffer approach.

Keep in mind that the Rapsberry Pi board in question is old and can only run 32-bit code. Relevant specs:

    - CPU - Broadcom BCM2836 Quad-core ARM Cortex-A7 CPU
    - Speed - 900 MHz
    - OS - Raspbian GNU/Linux 11 (bullseye)
    - Python - 3.9.2 (Note: This version does not support `|` for type hints; use `typing.Optional` instead.
      Avoid features from Python 3.10+ unless explicitly polyfilled or checked.)
    - Memory - 1GB
    - Network - 100Mbps Ethernet
    - Video specs - H.264, MPEG-4 decode (1080p30); H.264 encode (1080p30), OpenGL ES 2.0
    - Video ports - 1 HDMI (full-size), DSI
    - Ports - 4 x USB 2.0, CSI, 4-pole audio/video
    - GPIO - 40-pin (mostly taken by the TFT LCD screen)
    - Power - Micro USB 5 V/2.5 A DC, 5 V via GPIO
    - Size - 85.60 × 56.5mm
The board is dedicated to running this project and any supplementary tooling. There's a Home Assistant instance involved in larger system to which this is deployed, but that's running on a different board.

## Project Files and Structure

This section outlines the core files of the project.

<<I let the AI put its own high-level "repo map" here, as recently, I found Aider has not been generating any useful repo maps for me for unknown reasons.>>

-------

This file ends up evolving from project to project, and it's not as project-independent as I'd like; I let AI add guidelines to this file based on a discussion (e.g. it's doing something systematically wrong and I point it out and tell it to remember). Also note that a lot of guidelines is focused on keeping projects broken down into a) lots of files, to reduce context use as it grows, and b) small, well-structured files, to minimize the amount of broken SEARCH/REPLACE diff blocks; something that's still a problem with Aider for me, despite models getting better.

I usually start by going through the initial project ideas in "ask mode", then letting it build the SPECIFICATION.md document and a PLAN.md document with a 2-level (task/subtask) work breakdown.

chongli 20 hours ago [-]
Producing the set of steps is the hard part. If you can do that, you don’t need a junior to follow it, you have a program to execute.
adastra22 20 hours ago [-]
It is a task that LLMs are quite good at.
Jensson 16 hours ago [-]
If the LLM actually could generate good steps that helped make forward progress then there would be no problem at all making agents, but agents are really bad so LLM can't be good at that.

If you feel those tips are good then you are just a bad judge of tips, there is a reason self help books sell so well even though they don't really help anyone, their goal is to write a lot of tips that sound good since they are kind of vague and general but doesn't really help the reader.

adastra22 16 hours ago [-]
I use agentic LLMs every single day and get tremendous value. Asking the LLM to produce a set of bite-sized tasks with built-in corrective reminders is something that they're really good at. It gives good results.

I'm sorry if you're using it wrong.

TeMPOraL 16 hours ago [-]
Seconding. In the past months, when using Aider, I've been using the approach of discussing a piece of work (new project, larger change), and asking the model to prepare a plan of action. After possibly some little back and forth, I approve the plan and ask LLM to create or update a specification document for the project and a plan document which documents a sequence of changes broken down into bite-sized tasks - the latter is there to keep both me and the LLM on track. With that set, I can just keep repeatedly telling it to "continue implementation of the plan", and it does exactly that.

Eventually it'll do something wrong or I realize I wanted things differently, which necessitates some further conversation, but other than that, it's just "go on" until we run out of plan, then devising a new plan, rinse repeat.

adastra22 7 hours ago [-]
This is pretty much what I do. It works very well.
abletonlive 20 hours ago [-]
If this is true then we wouldn't have senior engineers that delegate. My suggestion is to think a couple more cycles before hitting that reply button. It'll save us all from reading obviously and confidently wrong statements.
guappa 17 hours ago [-]
AI aren't real people… You do that with real people because you can't just rsync their knowledge.

Only on this website of completely reality detached individuals such an obvious comment would be needed.

abletonlive 5 hours ago [-]
So...you don't think you can give LLMs more knowledge ?? You're the one operating in detached reality. The reality is that a ton of engineers are finding LLMs useful, such as the author.

Maybe consider if you don't find it useful you're working on problems that it's not good at, or even more likely, you just suck at using the tools.

Anybody that finds value out of LLMs has a hard time understanding how one would conclude they are useless and you can't "give it instructions because that's that hard part" but it's actually really easy to understand. The folks that think this are just bad at it. We aren't living in some detached reality. The reality is that some people are just better than others

TeMPOraL 17 hours ago [-]
Senior engineers delegate in part because they're coaxed into a faux-management role (all of the responsibilities, none of the privileges). Coding is done by juniors; by the time anyone gains enough experience to finally begin to know what they're doing, they're relegated to "mentoring" and "training" new cohort of fresh juniors.

Explains a lot about software quality these days.

abletonlive 5 hours ago [-]
Or you know, they are leading big initiatives and cant do it all by themselves. Seniors can also delegate to other seniors. I am beyond senior with 11YOE and still code on a ton of my initiatives.
eru 19 hours ago [-]
The hope is that the ground truth from calling out to tools (like compilers or test runs) will eventually be enough keep them on track.

Just like humans and human organisations also tend to experience drift, unless anchored in reality.

mkagenius 23 hours ago [-]
I built android-use[1] using LLM. It is pretty good at self healing due to the "loop", it constantly checks if the current step is actually a progress or regress and then determines next step. And the thing is nothing is explicitly coded, just a nudge in the prompts.

1. clickclickclick - A framework to let local LLMs control your android phone (https://github.com/BandarLabs/clickclickclick)

loa_in_ 17 hours ago [-]
You don't have to. Most of the appeal is automatically applying fixes like "touch file; make" after spotting a trivial mistake. Just let it at it.
JeremyNT 10 hours ago [-]
Definitely true currently, which is why there's so much focus on using them to write real code that humans have to actually commit and put their names on.

Longer term, I don't think this holds due to the nature of capitalism.

If given a choice between paying for an LLM to do something that's mostly correct versus paying for a human developer, businesses are going to choose the former, even if it results in accelerated enshittification. It's all in service of reducing headcount and taking control of the means of production away from workers.

meander_water 1 days ago [-]
There's also this one which uses pocketflow, a graph abstraction library to create something similar [0]. I've been using it myself and love the simplicity of it.

[0] https://github.com/The-Pocket/PocketFlow-Tutorial-Cursor/blo...

wepple 1 days ago [-]
Ah, it’s Thorsten Ball!

I thoroughly enjoyed his “writing an interpreter”. I guess I’m going to build an agent now.

21 hours ago [-]
orange_puff 18 hours ago [-]
I have been trying to find such an article for so long, thank you! I think a common reaction to Agents is “well, it probably cannot solve a really complex problem very well”. But to me, that isn’t the point of an agent. LLMs function really well with a lot of context, and agent allows the LLM to discover more context and improve its ability to answer questions.
xnx 10 hours ago [-]
> The reason is that there is no secret sauce and 95% of the magic is in the LLM itself

Makes that "$3 billion" valuation for Windsurf very suspect

rrrx3 10 hours ago [-]
Indeed. But keep in mind they weren't just buying the tooling - they get the team, the brand, and the positional authority as well. OpenAI could have spun up a team to build an agent code IDE, and they would have been starting on the back foot with users, would have been compared to Cursor/Windsurf...

The price tag is hefty but I figure it'll work out for them on the backside because they won't have to fight so hard to capture TAM.

TonyEx 10 hours ago [-]
The value in the windsurf acquisition isn't the code they've written, it's the ability to see what people are coding and use that information to build better LLMs. -- Product development.
sesm 1 days ago [-]
Should we change the link above to use `?utm_source=hn&utm_medium=browser` before opening it?
libraryofbabel 1 days ago [-]
fixed :)
deadbabe 13 hours ago [-]
Generally when LLM’s are effective like this, it means a more efficient non-LLM based solution to the problem exists using the tools you have provided. The LLM helps you find the series of steps and synthesis of inputs and outputs to make it happen.

It is expensive and slow to have an LLM use tools all the time for solving the problem. The next step is to convert frequent patterns of tool calls into a single pure function, performing whatever transformation of inputs and outputs are needed along the way (an LLM can help you build these functions), and then perhaps train a simple cheap classifier to always send incoming data to this new function, bypassing LLMs all together.

In time, this will mean you will use LLMs less and less, limiting their use to new problems that are unable to be classified. This is basically like a “cache” for LLM based problem solving, where the keys are shapes of problems.

The idea of LLMs running 24/7 solving the same problems in the same way over and over again should become a distant memory, though not one that an AI company with vested interest in selling as many API calls as possible will want people to envision. Ideally LLMs are only needed to be employed once or a few times per novel problem before being replaced with cheaper code.

gchamonlive 12 hours ago [-]
How far can you go with the best models that fit in a consumer grade GPU (24GB vram)?
TZubiri 1 hours ago [-]
there is the problem of getting that last 10% of reliability.

In my experience, that next 9% will take 9 times the effort.

And that next 0.9% will take 9 times the effort.

And so on.

So 90% is very far off from 99.999% reliability. Which would still be less reliable than an ec2 instance.

aibrother 1 days ago [-]
thanks for the rec. and yeah agreed with the observations as well
kcorbitt 1 days ago [-]
For "that last 10% of reliability" RL is actually working pretty well right now too! https://openpipe.ai/blog/art-e-mail-agent
kgeist 1 days ago [-]
Today I tried "vibe-coding" for the first time using GPT-4o and 4.1. I did it manually - just feeding compilation errors, warnings, and suggestions in a loop via the canvas interface. The file was small, around 150 lines.

It didn't go well. I started with 4o:

- It used a deprecated package.

- After I pointed that out, it didn't update all usages - so I had to fix them manually.

- When I suggested a small logic change, it completely broke the syntax (we're talking "foo() } return )))" kind of broken) and never recovered. I gave it the raw compilation errors over and over again, but it didn't even register the syntax was off - just rewrote random parts of the code instead.

- Then I thought, "maybe 4.1 will be better at coding" (as advertized). But 4.1 refused to use the canvas at all. It just explained what I could change - as in, you go make the edits.

- After some pushing, I got it to use the canvas and return the full code. Except it didn't - it gave me a truncated version of the code with comments like "// omitted for brevity".

That's when I gave up.

Do agents somehow fix this? Because as it stands, the experience feels completely broken. I can't imagine giving this access to bash, sounds way too dangerous.

simonw 1 days ago [-]
"It used a deprecated package"

That's because models have training cut-off dates. It's important to take those into account when working with them: https://simonwillison.net/2025/Mar/11/using-llms-for-code/#a...

I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.

You can tell it "look up the most recent version of library X and use that" and it will often work!

I even used it for a frustrating upgrade recently - I pasted in some previous code and prompted this:

This code needs to be upgraded to the new recommended JavaScript library from Google. Figure out what that is and then look up enough documentation to port this code to it.

It did exactly what I asked: https://simonwillison.net/2025/Apr/21/ai-assisted-search/#la...

kgeist 1 days ago [-]
>That's because models have training cut-off dates

When I pointed out that it used a deprecated package, it agreed and even cited the correct version after which it was deprecated (way back in 2021). So it knows it's deprecated, but the next-token prediction (without reasoning or tools) still can't connect the dots when much of the training data (before 2021) uses that package as if it's still acceptable.

>I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.

Thanks for the tip!

jmcpheron 1 days ago [-]
>I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.

That is such a useful distinction. I like to think I'm keeping up with this stuff, but the '4o' versus 'o4' still throws me.

tptacek 20 hours ago [-]
Model naming is absolutely maddening.
fragmede 1 days ago [-]
There's still skill involved with using the LLM in coding. In this case, o4-mini-high might do the trick, but the easier answer that worry's with other models is to include the high level library documentation yourself as context and it'll use that API.
th0ma5 22 hours ago [-]
Whate besides anecdote makes you think a different model will be anything but marginally incrementally better?
mbesto 10 hours ago [-]
> That's because models have training cut-off dates.

Which is precisely the issue with the idea of LLMs completely replacing human engineers. It doesn't understand this context unless a human tells it to understand that context.

sagarpatil 20 hours ago [-]
Context7 MCP solves this. Use it with Cursor/Windsurf.
thorum 1 days ago [-]
GPT 4.1 and 4o score very low on the Aider coding benchmark. You only start to get acceptable results with models that score 70%+ in my experience. Even then, don't expect it to do anything complex without a lot of hand-holding. You start to get a sense for what works and what doesn't.

https://aider.chat/docs/leaderboards/

bjt12345 23 hours ago [-]
That been said, Claude Sonnet 3.7 seems to do very well at a recursive approach to writing a program whereas other models don't fare as well.
k__ 14 hours ago [-]
Sonnet 3.7 was SOTA for quite some time. I built some nice charts with it. It's a rather simple task, but quite LoC-intensive.
ebiester 1 days ago [-]
I get that it's frustrating to be told "skill issue," but using an LLM is absolutely a skill and there's a combination of understanding the strengths of various tools, experimenting with them to understand the techniques, and just pure practice.

I think if I were giving access to bash, though, it would definitely be in a docker container for me as well.

wtetzner 1 days ago [-]
Sure, you can probably get better at it, but is it really worth the effort over just getting better at programming?
cheema33 23 hours ago [-]
If you are going to race a fighter jet, and you are on a bicycle, exercising more and eating right will not help. You have to use a better tool.

A good programmer with AI tools will run circles around a good programmer without AI tools.

jsight 19 hours ago [-]
To be fair, that's also what a lot of us used to say about IDEs. In reality, plenty of folks just turned vim into a fighter jet and did just as well without super-heavyweight llms.

I'm not totally convinced that we won't see a similar effect here, with some really competitive coders 100% eschewing LLMs and still doing as well as the best that use them.

TeMPOraL 17 hours ago [-]
> In reality, plenty of folks just turned vim into a fighter jet and did just as well without super-heavyweight llms.

No, they didn't.

You can get vim and Emacs on par with IDEs[0] somewhat easily thanks to Language Server Protocol. You can't turn them into "fighter jets" without "super-heavyweight LLMs" because that's literally what, per GP, makes an editor/IDE a fighter jet. Yes, Emacs has packages for LLM integration, and presumably so does Vim, but the whole "fighter jet vs. bicycle" is entirely about SOTA LLMs being involved or not.

--

[0] - On par wrt. project-level features IDEs excel at; both editors of course have other aspects that none of the IDEs ever come close to.

jsight 2 hours ago [-]
Honestly, that is a really fair counterpoint. I've been playing with neovim lately and it really feels a lot like some of the earlier IDEs that I used to use but with more modern power and tremendous speed.

Maybe we will all use LLMs one day in neovim too. :)

candiddevmike 22 hours ago [-]
What does that even mean? How do you even quantify that?
Groxx 22 hours ago [-]
With vibes, mostly
TeMPOraL 17 hours ago [-]
Like everything in software engineering. It's not like there's much science in any of the issues of practice programmers routinely debate.
HDThoreaun 11 hours ago [-]
My teams velocity before and after adding AI coding to our stack
mattbuilds 19 hours ago [-]
Got any evidence on that or is it just “vibes”? I have my doubts that AI tools are helping good programmers much at all, forget about “running circles” around others.
hdjrudni 19 hours ago [-]
I don't know about "running circles" but they seem to help with mundane/repetitive tasks. As in, LLMs provide greater than zero benefit, even to experienced programmers.

My success ratio still isn't very high, but for certain easy tasks, I'll let an LLM take a crack at it.

goatlover 22 hours ago [-]
Citation needed for your second sentence. This is the problem with AI hype cycles. Lots of outstanding claims, a lot less actual evidence supporting those claims. Lot of anecdotes though. Maybe the LLMs are in a loop recursively promoting themselves for that sweet venture funding.
ebiester 10 hours ago [-]
Studies take time. https://www.microsoft.com/en-us/research/wp-content/uploads/... is the first one from Microsoft. But it goes back to gains coming as people become more skilled.
ebiester 10 hours ago [-]
Yes, not because you will be able to solve harder problems, but because you will be able to more quickly solve easier problems which will free up more time to get better at programming, as well as get better at the domain in which you're programming. (That is, talking with your users.)
24 hours ago [-]
drittich 12 hours ago [-]
Perhaps that's a false dichotomy?
cyral 24 hours ago [-]
You can do both
th0ma5 22 hours ago [-]
Except the skill involved is believing in random people's advice that a different model will surely be better with no fundamental reason or justification as to why. The benchmarks are not applicable when trying to apply the models to new work and benchmarks by there nature do not describe suitability to any particular problem.
codethief 1 days ago [-]
The other day I used the Cline plugin for VSCode with Claude to create an Android app prototype from "scratch", i.e. starting from the usual template given to you by Android Studio. It produced several thousand lines of code, there was not a single compilation error, and the app ended up doing exactly what I wanted – modulo a bug or two, which were caused not by the LLM's stupidity but by weird undocumented behavior of the rather arcane Android API in question. (Which is exactly why I wanted a quick prototype.)

After pointing out the bugs to the LLM, it successfully debugged them (with my help/feedback, i.e. I provided the output of the debug messages it had added to the code) and ultimately fixed them. The only downside was that I wasn't quite happy with the quality of the fixes – they were more like dirty hacks –, but oh well, after another round or two of feedback we got there, too. I'm sure one could solve that more generally, by putting the agent writing the code in a loop with some other code reviewing agent.

cheema33 23 hours ago [-]
> I'm sure one could solve that more generally, by putting the agent writing the code in a loop with some other code reviewing agent.

This x 100. I get so much better quality code if I have LLMs review each other's code and apply corrections. It is ridiculously effective.

lftl 22 hours ago [-]
Can you elaborate a little more on your setup? Are you manually copyong and pasting code from one LLM to another, or do you have some automated workflow for this?
htsh 8 hours ago [-]
I have been doing this with claude code and openai codex and/or cline. One of the three takes the first pass (usually claude code, sometimes codex), then I will have cline / gemini 2.5 do a "code review" and offer suggestions for fixes before it applies them.
suddenlybananas 16 hours ago [-]
What was the app? It could plausibly be something that has an open source equivalent already in the training data.
nico 1 days ago [-]
4o and 4.1 are not very good at coding

My best results are usually with 4o-mini-high, o3 is sometimes pretty good

I personally don’t like the canvas. I prefer the output on the chat

And a lot of times I say: provide full code for this file, or provide drop-in replacement (when I don’t want to deal with all the diffs). But usually at around 300-400 lines of code, it starts getting bad and then I need to refactor to break stuff up into multiple files (unless I can focus on just one method inside a file)

manmal 1 days ago [-]
o3 is shockingly good actually. I can’t use it often due to rate limiting, so I save it for the odd occasion. Today I asked it how I could integrate a tree of Swift binary packages within an SDK, and detect internal version clashes, and it gave a very well researched and sensible overview. And gave me a new idea that I‘ll try.
kenjackson 1 days ago [-]
I use o3 for anything math or coding related. 4o is good for things like, "my knee hurts when I do this and that -- what might it be?"
TeMPOraL 16 hours ago [-]
In ChatGPT, at this point I use 4o pretty much only for image generation; it's the one feature that's unique to it and is mind-blowingly good. For everything else, I default to o3.

For coding, I stick to Claude 3.5 / 3.7 and recently Gemini 2.5 Pro. I sometimes use o3 in ChatGPT when I can't be arsed to fire up Aider, or really need to use its search features to figure out how to do something (e.g. pinouts for some old TFT screens for ESP32 and Raspberry Pi, most recently).

hnhn34 1 days ago [-]
Just in case you didn't know, they raised the rate limit from ~50/week to ~50/day a while ago
manmal 16 hours ago [-]
Thank you, that’s really nice actually!
johnsmith1840 1 days ago [-]
Drop in replacement files per update should be done on the heavy test time compute methods.

o1-pro, o1-preview can generate updated full file responses into the 1k LOC range.

It's something about their internal verification methods that make it an actual viable development method.

nico 1 days ago [-]
True. Also, the APIs don't care too much about restricting output length, they might actually be more verbose to charge more

It's interesting how the same model being served through different interfaces (chat vs api), can behave differently based on the economic incentives of the providers

voidspark 24 hours ago [-]
The default chat interface is the wrong tool for the job.

The LLM needs context.

https://github.com/marv1nnnnn/llm-min.txt

The LLM is a problem solver but not a repository of documentation. Neural networks are not designed for that. They model at a conceptual level. It still needs to look up specific API documentation like human developers.

You could use o3 and ask it to search the web for documentation and read that first, but it's not efficient. The professional LLM coding assistant tools manage the context properly.

Sharlin 22 hours ago [-]
Eh, given how much about anything these models know without googling, they are certainly knowledge repositories, designed for it or not. How deep and up-to-date their knowledge of some obscure subject, is another question.
voidspark 22 hours ago [-]
I meant a verbatim exact copy of all documentation they have ever been trained on - which they are not. Neural networks are not designed for that. That's not how they encode information.
Sharlin 21 hours ago [-]
That’s fair.
danbmil99 1 days ago [-]
As others have noted, you sound about 3 months behind the leading edge. What you describe is like my experience from February.

Switch to Claude (IMSHO, I think Gemini is considered on par). Use a proper coding tool, cutting & pasting from the chat window is so last week.

candiddevmike 22 hours ago [-]
Instead of churning on frontend frameworks while procrastinating about building things we've moved onto churning dev setups for micro gains.
latentsea 19 hours ago [-]
The amount of time spent churning on workflows and setups will offset the gains.

It's somewhat ironic the more behind the leading edge you are, the more efficient it is to make the gains eventually because you don't waste time on the micro-gain churn, and a bigger set of upgrades arrives when you get back on the leading edge.

I watched this dynamic play out so many times in the image generation space with people spending hundreds of hours crafting workflows to get around deficiencies in models, posting tutorials about it, other people spending all the time to learn those workflows. New model comes out and boom, all nullified and the churn started all over again. I eventually got sick of the churn. Batching the gains worked better.

TeMPOraL 16 hours ago [-]
Missing in your description is that at least some of that work of "people spending hundreds of hours crafting workflows to get around deficiencies in models, posting tutorials about it, other people spending all the time to learn those workflows" is exactly what informed model developers about the major problems and what solutions seem most promising. All these workarounds are organically crowd-sourcing R&D, which is arguably one of the most impressive things about whole image generation space. The community around ComfyUI is pretty much a shapeless distributed research organization.
mycall 20 hours ago [-]
> churning dev setups for micro gains.

Devs have been doing micro changes to their setup for 50 years. It is the nature of their beast.

zahlman 20 hours ago [-]
Where do people on HN meet these devs who are willing to do this sort of thing, and get anxious about being 3 months behind the latest and greatest?

In my world, they were given 9 years to switch to Python 3 even if you write off 3.0 and 3.1 as premature, and they still missed by years, and loudly complained afterwards.

And they still can't be bothered to learn what a `pyproject.toml` is, let alone actually use it for its intended purpose. One of the most popular third-party Python libraries (Requests), which is under stewardship by the PSF, which uses only Python code, had its "build" (no compilation - purely a matter of writing metadata, shuffling some files around and zipping it up) broken by the removal of years-old functionality in Setuptools that they weren't even actually remotely reliant upon. Twice, in the last year.

guappa 17 hours ago [-]
You just need to be a frontend dev in a very overstaffed team (like where I work) and then you need to fill up your day doing that and creating a task per every couple of line changed, and require multiple approvals to merge anything.

It takes me ~1 week to merge small fixes to their build system (which they don't understand anyway so they just approve whatever).

fsndz 1 days ago [-]
I can be frustrating at times. but my experience is the more you try the better you become at knowing what to ask and to expect. But I guess you understand now why some people say vibe coding is a bit overrated: https://www.lycee.ai/blog/why-vibe-coding-is-overrated
the_af 1 days ago [-]
"Overrated" is one way to call it.

Giving sharp knives to monkeys would be another.

lnenad 18 hours ago [-]
Why do people keep thinking they're intellectually superior when negatively evaluating something that is OBVIOUSLY working for a very large percentage of people?
80hd 17 hours ago [-]
I've been asking myself this since AI started to become useful.

Most people would guess it threatens their identity. Sensitive intellectuals who found a way to feel safe by acquiring deep domain-specific expertise suddenly feel vulnerable.

In addition, a programmer's job, on the whole, has always been something like modelling the world in a predictable way so as to minimise surprise.

When things change at this rate/scale, it also goes against deep rooted feelings about the way things should work (they shouldn't change!)

Change forces all of us to continually adapt and to not rest on our laurels. Laziness is totally understandable, as is the resulting anger, but there's no running away from entropy :}

the_af 10 hours ago [-]
> I've been asking myself this since AI started to become useful.

For context: we're specifically discussing vibe coding, not AI or LLMs.

With that in mind, do you think any of the rest of your comment is on-topic?

hackable_sand 5 hours ago [-]
It's not obvious that it's "working" for a "very large" percentage of people. Probably because this very large group of people keep refusing to provide metrics.

I've vibe-coded completely functional mobile apps, and used a handful LLMs to augment my development process in desktop applications.

From that experience, I understand why parsing metrics from this practice is difficult. Really, all I can say is that codegen LLMs are too slow and inefficient for my workflow.

guappa 17 hours ago [-]
Because the large percentage of people is a few people doing hello words or things of similar difficulty.

Not every software developer is hired to do trivial frontend work.

FeepingCreature 15 hours ago [-]
The large percentage of software development is people doing hello world or similar difficulty. "CRUD apps," remember?
the_af 10 hours ago [-]
Hopefully they are not live-coding that crap though. Do you want to make those apps even more unreliable than they already are, and encourage devs not to learn any lessons (as vibe coding prescribes)?
lnenad 8 hours ago [-]
Sure, you keep telling that to yourself.
the_af 10 hours ago [-]
> Why do people keep thinking they're intellectually superior when negatively evaluating something that is OBVIOUSLY working for a very large percentage of people?

I'm not talking about LLMs, which I use and consider useful, I'm specifically talking about vibe coding, which involves purposefully not understanding any of it, just copying and pasting LLM responses and error codes back at it, without inspecting them. That's the description of vibe coding.

The analogy with "monkeys with knives" is apt. A sharp knife is a useful tool, but you wouldn't hand it to an unexperienced person (a monkey) incapable of understanding the implications of how knives cut.

Likewise, LLMs are useful tools, but "vibe coding" is the dumbest thing ever to be invented in tech.

> OBVIOUSLY working

"Obviously working" how? Do you mean prototypes and toy examples? How will these people put something robust and reliable in production, ever?

If you meant for fun & experimentation, I can agree. Though I'd say vibe coding is not even good for learning because it actively encourages you not to understand any of it (or it stops being vibe coding, and turns into something else). It's that what you're advocating as "obviously working"?

lnenad 8 hours ago [-]
> The analogy with "monkeys with knives" is apt. A sharp knife is a useful tool, but you wouldn't hand it to an unexperienced person (a monkey) incapable of understanding the implications of how knives cut.

Could an experienced person/dev vibe code?

> "Obviously working" how? Do you mean prototypes and toy examples? How will these people put something robust and reliable in production, ever?

You really don't think AI could generate a working CRUD app which is the financial backbone of the web right now?

> If you meant for fun & experimentation, I can agree. Though I'd say vibe coding is not even good for learning because it actively encourages you not to understand any of it (or it stops being vibe coding, and turns into something else). It's that what you're advocating as "obviously working"?

I think you're purposefully reducing the scope of what vibe coding means to imply it's a fire and forget system.

the_af 6 hours ago [-]
> Could an experienced person/dev vibe code?

Sure, but why? They already paid the price in time/effort of becoming experienced, why throw it all away?

> You really don't think AI could generate a working CRUD app which is the financial backbone of the web right now?

A CRUD? Maybe. With bugs and corner cases and scalability problems. A robust system in other conditions? Nope.

> I think you're purposefully reducing the scope of what vibe coding means to imply it's a fire and forget system.

It's been pretty much described like that. I'm using the standard definition. I'm not arguing against LLM-assisted coding, which is a different thing. The "vibe" of vibe coding is the key criticism.

lnenad 5 hours ago [-]
> Sure, but why? They already paid the price in time/effort of becoming experienced, why throw it all away?

You spend 1/10 amount of time doing something, you have 9/10 of that time to yourself.

> A CRUD? Maybe. With bugs and corner cases and scalability problems. A robust system in other conditions? Nope.

Now you're just inventing stuff. "scalability problems" for a CRUD app. You obviously haven't used it. If you know how to prompt the AI it's very good at building basic stuff, and more advanced stuff with a few back and forth messages.

> It's been pretty much described like that. I'm using the standard definition. I'm not arguing against LLM-assisted coding, which is a different thing. The "vibe" of vibe coding is the key criticism.

By whom? Wikipedia says

> Vibe coding (or vibecoding) is an approach to producing software by depending on artificial intelligence (AI), where a person describes a problem in a few sentences as a prompt to a large language model (LLM) tuned for coding. The LLM generates software based on the description, shifting the programmer's role from manual coding to guiding, testing, and refining the AI-generated source code.[1][2][3] Vibe coding is claimed by its advocates to allow even amateur programmers to produce software without the extensive training and skills required for software engineering.[4] The term was introduced by Andrej Karpathy in February 2025[5][2][4][1] and listed in the Merriam-Webster Dictionary the following month as a "slang & trending" noun.[6]

Emphasis on "shifting the programmer's role from manual coding to guiding, testing, and refining the AI-generated source code" which means you don't blindly dump code into the world.

the_af 59 minutes ago [-]
Doing something badly in 1/10 of the time isn't going to save you that much time, unless it's something you don't truly care about.

I have used AI/LLMs; in fact I use them daily and they've proven helpful. I'm talking specifically about vibe coding, which is dumb.

> By whom? [...] Emphasis on "shifting the programmer's role from manual coding to guiding, testing, and refining the AI-generated source code" which means you don't blindly dump code into the world.

By Andrej Karpathy, who popularized the term and describes it as mostly blindly dumping code into the world:

> There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like "decrease the padding on the sidebar by half" because I'm too lazy to find it. I "Accept All" always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.

He even claims "it's not too bad for throwaway weekend projects", not for actual production-ready and robust software... which was my point!

Also see Merriam-Webster's definition, mentioned in the same Wikipedia article you quoted: https://www.merriam-webster.com/slang/vibe-coding

> Writing computer code in a somewhat careless fashion, with AI assistance

and

> In vibe coding the coder does not need to understand how or why the code works, and often will have to accept that a certain number of bugs and glitches will be present.

and, M-W quoting the NYT:

> You don’t have to know how to code to vibecode — just having an idea, and a little patience, is usually enough.

and, quoting from Ars Technica

> Even so, the risk-reward calculation for vibe coding becomes far more complex in professional settings. While a solo developer might accept the trade-offs of vibe coding for personal projects, enterprise environments typically require code maintainability and reliability standards that vibe-coded solutions may struggle to meet.

I must point out this is more or less the claim I made and which you mocked with your CRUD remarks.

baq 18 hours ago [-]
Vibe coding has a vibe component and a coding component. Take away the coding and you’re only left with vibe. Don’t confuse the two.

Saying that as I’ve got vibe coded react internal tooling used in production without issues, saved days of work easily.

the_af 10 hours ago [-]
> Don’t confuse the two.

Vibe coding as was explained by the popularizer of the term involves no coding. You just paste error messages, paste the response of the LLM, paste the error messages back, paste the response, and pray that after several iterations the thing converges to a result.

It involves NOT looking at either the LLM output or the error messages.

Maybe you're using a different definition?

baq 10 hours ago [-]
A case can be made that it involves an experienced coder to be vibe coding, as the author of the term most definitely is and I feel this context is at the very least being conveniently omitted at times. Whether he was truly not doing anything at all or glanced at 1% of generated code to check if the model isn't getting lost is important, as is being able to know what to ask the model for.

Horror stories from newbies launching businesses and getting their data stolen because they trust models are to be expected, but I would not call them vibe coding horror stories, since there is no coding involved even by proxy, it's copy pasting on steroids. Blind copy pasting from stack overflow was not coding for me back then either. (A minute of silence for SO here. RIP.)

the_af 7 hours ago [-]
The problem with this discussion is that different interlocutors have different opinions of what vibe coding really means.

For example, another person in this thread argues:

> I'd rather give my green or clueless or junior or inexperienced devs said knives than having them throw spaghetti on a wall for days on end, only to have them still ask a senior to help or do the work for them anyways.

So they are clearly not talking about experienced coders. They are also completely disregarding the learning experience any junior coder must go through in order to become an experienced coder.

This is clearly not what you're arguing though. So which "vibe coding" are we discussing? I know which one I meant when I spoke of monkeys and sharp knives...

baq 6 hours ago [-]
I mean it very literally, taking the what he said together with who is the person who said it - an experienced professional sculpting a solution using a very complex set of tools, with a clear idea in his head, but with unusual and slightly uncomfortable disinterest in the exact details of how the final product looks from the inside.
the_af 54 minutes ago [-]
I'm mostly going by what he said: https://x.com/karpathy/status/1886192184808149383

He seems to think it barely involves coding ("I don't read the diffs anymore, I Accept All [...] It's not really coding"), and that it's only good for goofing and throwaway code...

zo1 17 hours ago [-]
I'd rather give my green or clueless or junior or inexperienced devs said knives than having them throw spaghetti on a wall for days on end, only to have them still ask a senior to help or do the work for them anyways.
jrh3 10 hours ago [-]
It's somewhere in between. Said struggle is where they learn. Guidance from seniors is important, but they need to figure it out to grow.
guappa 15 hours ago [-]
I'm sure you'd think differently after constant production outages.
the_af 10 hours ago [-]
How will they ever learn if all the do is copy-paste things without any real understanding, as prescribed by vibe coding?
zo1 6 hours ago [-]
I'm not advocating for vibe coding, that's new-age hipster talk. But just using the AI for help, assistance, and doing grunt work is where we have to go as an industry.
abiraja 1 days ago [-]
GPT4o and 4.1 are definitely not the best models to use here. Use Claude 3.5/3.7, Gemini Pro 2.5 or o3. All of them work really well for small files.
linsomniac 9 hours ago [-]
What are people using to interface with Gemini Pro 2.5? I'm using Claude Code with Claude Sonnet 3.7, and Codex with OpenAI, but Codex with Gemini didn't seem to work very well last week, kept telling me to go make this or that change in the code rather than doing it itself.
visarga 1 days ago [-]
You should try Cursor or Windsurf, with Claude or Gemini model. Create a documentation file first. Generate tests for everything. The more the better. Then let it cycle 100 times until tests pass.

Normal programming is like walking, deliberate and sure. Vibe coding is like surfing, you can't control everything, just hit yes on auto. Trust the process, let it make mistakes and recover on its own.

tqwhite 1 days ago [-]
I find that writing a thorough design spec is really worth it. Also, asking for its reaction. "What's missing?" "Should I do X or Y" does good things for its thought process, like engaging a younger programmer in the process.

Definitely, I ask for a plan and then, even if it's obvious, I ask questions and discuss it. I also point it as samples of code that I like with instructions for what is good about it.

Once we have settled on a plan, I ask it to break it into phases that can be tested (I am not one for a unit testing) to lock in progress. Claude LOVES that. It organizes a new plan and, at the end of each phase tells me how to test (curl, command line, whatever is appropriate) and what I should see that represents success.

The most important thing I have figured out is that Claude is a collaborator, not a minion. I agree with visarga, it's much more like surfing that walking. Also, Trust... but Verify.

This is a great time to be a programmer.

prisenco 1 days ago [-]
Given that analogy, surely you could understand why someone would much rather walk than surf to their destination? Especially people who are experienced marathon runners.
fragmede 1 days ago [-]
If I tried standing up on the waves without a surfboard, and complain about how it's not working, would you blame the water or surfing for the issue, or the person trying to defy physics, complaining that it's not working? It doesn't matter how much I want to run or if I'm Kelvin Kiptum, I'm gonna have a bad time.
prisenco 1 days ago [-]
That only makes sense when surfing is the only way to get to the destination and that's not the case.
fragmede 1 days ago [-]
Say there are two ways to get to your destination. You still need to use the appropriate vehicle/surfboard for the route you've chosen to use. Even if there is a bridge you can run/walk across, if you try and surf across the water without a surfboard, and try to walk it, you're gonna have a bad time.
prisenco 1 days ago [-]
Analogy feels a bit tortured at this point.
15 hours ago [-]
fragmede 22 hours ago [-]
What a coincidence that now's the point it's tortured and not any earlier!
Sharlin 22 hours ago [-]
It was incredibly tortured from the get go and is screaming that it be put out of its misery.
derwiki 20 hours ago [-]
Look, my lad, I know a dead parrot when I see one, and I'm looking at one right now.
latentsea 19 hours ago [-]
I'm sorry, is this the full half hour argument or only the five minute one?
22 hours ago [-]
zachrip 13 hours ago [-]
People are using tools like cursor for "vibe coding" - I've found the canvas in chatgpt to be very buggy and it often breaks its own code and I have to babysit it a lot. But in cursor the same model will perform just fine. So it's not necessarily just the model that matters, it's how it's used. One thing people conflate a lot is chatgpt the product vs gpt models themselves.
seunosewa 15 hours ago [-]
The ability to write a lot of code with OpenAI models is broken right now. Especially on the app. Gemini 2.5 Pro on Google AI Studio does that well. Claude 3.7 is also better at it.

I've had limited success by prompting the latest OpenAI models to disregard every previous instruction they had about limiting their output and keep writing until the code is completed. They quickly forget,so you have to keep repeating the instruction.

If you're a copilot user, try Claude.

Jarwain 1 days ago [-]
Aider's benchmarks show 4.1 (and 4o) work better in its architect mode, for planning the changes, and o3 for making the actual edits
SparkyMcUnicorn 1 days ago [-]
You have that backwards. The leaderboard results have the thinking model as the architect.

In this case, o3 is the architect and 4.1 is the editor.

drewnick 18 hours ago [-]
I see o3 (high) + gpt-4.1 at 82.7% -- the highest on the benchmark currently.
Kiro 18 hours ago [-]
That's not vibe coding. You need to use something where it applies to code changes automatically or you're not fast enough to actually be vibing. Oneshotting it like that just means you get stunlocked when running into errors or dead ends. Vibe coding is all about making backtracking, restarting and throwing out solutions frictionless. You need tooling for that.
22 hours ago [-]
cheema33 23 hours ago [-]
> Today I tried "vibe-coding" for the first time using GPT-4o and 4.1. I did it manually

You set yourself up to fail from the get go. But understandable. If you don't have a lot of experience in this space, you will struggle with low quality tools and incorrect processes. But, if you stick with it, you will discover better tools and better processes.

smcleod 1 days ago [-]
GPT 4o and 4.1 are both pretty terrible for coding to be honest, try Sonnet 3.7 in Cline (VSCode extension).

LLMs don't have up to date knowledge of packages by themselves that's a bit like buying a book and expecting it to have up to date world knowledge, you need to supplement / connect it to a data source (e.g. web search, documentation and package version search etc.).

85392_school 1 days ago [-]
Agents definitely fix this. When you can run commands and edit files, the agent can test its code by itself and fix any issues.
sagarpatil 20 hours ago [-]
No one codes like this. Use Claude Code, Windsurf, Amazon Q CLI, Augment Code with Context7, and exa web search.

It should one-shot this. I’ve run complex workflows and the time I save is astonishing.

I only run agents locally in a sandbox, not in production.

skeeter2020 20 hours ago [-]
I had an even better experience. I asked to produce a small web app with a new-to-me framework: success! I asked to make some CSS changes to the UI; the app no longer builds.
theropost 1 days ago [-]
150 lines? I find can quickly scale to around 1500 lines, and then start more precision on the classes, and functions I am looking to modify
jokethrowaway 1 days ago [-]
It's completely broken for me over 400 lines (Claude 3.7, paid Cursor)

The worst is when I ask something complex, the model generates 300 lines of good code and then timeouts or crashes. If I ask to continue it will mess up the code for good, eg. starts generating duplicated code or functions which don't match the rest of the code.

johnsmith1840 1 days ago [-]
It's a new skill that takes time to learn. When I started on gpt3.5 it took me easily 6 months of daily use before I was making real progress with it.

I regularly generate and run in the 600-1000LOC range.

Not sure you would call it "vibe coding" though as the details and info you provide it and how you provide it is not simple.

I'd say realistically it speeds me up 10x on fresh greenfield projects and maybe 2x on mature systems.

You should be reading the code coming out. The real way to prevent errors is read the resoning and logic. The moment you see a mistep go back and try the prompt again. If that fails try a new session entirely.

Test time compute models like o1-pro or the older o1-preview are massively better at not putting errors in your code.

Not sure about the new claude method but true, slow test time models are MASSIVELY better at coding.

derwiki 19 hours ago [-]
The “go back and try the prompt again” is the workflow I’d like to see a UX improvement on. Outside of the vibe coding “accept all” path, reverse traversing is a fairly manual process.
baq 18 hours ago [-]
Cursor has checkpoints for this but I feel I’ve never used them properly; easier to reject all and reprint. I keep chats short.
tqwhite 1 days ago [-]
Definitely a new skill to learn. Everyone I know that is having problems is just telling it what to do, not coaching it. It is not an automaton... instructions in code out. Treat it like a team member that will do the work if you teach it right and you will have much more success.

But is definitely a learning process for you.

koakuma-chan 1 days ago [-]
Sounds like a Cursor issue
fragmede 1 days ago [-]
what language?
exe34 5 hours ago [-]
> After I pointed that out, it didn't update all usages

I find it's more useful if you start with a fresh chat and use the knowledge you have gained: "Use package foo>=1.2 with the FooBar directive" is more useful than "no, I told you to stop using that!"

It's like repeatedly telling you to stop thinking about a pink elephant.

koonsolo 16 hours ago [-]
I code with Aider and Claude, and here is my experience:

- It's very good at writing new code

- Once it goes wrong, there is no point in trying to give it more context or corrections. It will go wrong again or at another point.

- It might help you fix an issue. But again, either it finds the issue the first time, or not at all.

I treat my LLM as a super quick junior coder, with a vast knowledge base stored inside its brain. But it's very stubborn and can't be helped to figure out a problem it wasn't able to solve in the first try.

koakuma-chan 1 days ago [-]
You gotta use a reasoning model.
1 days ago [-]
vFunct 1 days ago [-]
Use Claude Sonnet with an IDE.
hollownobody 1 days ago [-]
Try o3 please. Via UI.
fragmede 1 days ago [-]
In this case, sorry to say but it sounds like there's a tooling issue, possibly also a skill issue. Of course you can just use the raw ChatGPT web interface but unless you seriously tune its system/user prompt, it's not going to match what good tooling (which sets custom prompts) will get you. Which is kind of counter-intuitive. A paragraph or three fed in as the system prompt is enough to influence behavior/performance so significantly? It turns out with LLMs the answer is yes.
LewisVerstappen 1 days ago [-]
skill issue.

The fact that you're using 4o and 4.1 rather than claude is already a huge mistake in itself.

> Because as it stands, the experience feels completely broken

Broken for you. Not for everyone else.

suninsight 17 hours ago [-]
It only seems effective, unless you start using it for actual work. The biggest issue - context. All tool use creates context. Large code bases come with large context out of the bat. LLM's seem to work, unless they are hit with a sizeable context. Anything above 10k and the quality seems to deteriorate.

Other issue is that LLM's can go off on a tangent. As context builds up, they forget what their objective was. One wrong turn, and in the rabbit hole they go never to recover.

The reason I know, is because we started solving these problems an year back. And we aren't done yet. But we did cover a lot of distance.

[Plug]: Try it out at https://nonbios.ai:

- Agentic memory → long-horizon coding

- Full Linux box → real runtime, not just toy demos

- Transparent → see & control every command

- Free beta — no invite needed. Works with throwaway email (mailinator etc.)

bob1029 14 hours ago [-]
> One wrong turn, and in the rabbit hole they go never to recover.

I think this is probably at the heart of the best argument against these things as viable tools.

Once you have sufficiently described the problem such that the LLM won't go the wrong way, you've likely already solved most of it yourself.

Tool use with error feedback sounds autonomous but you'll quickly find that the error handling layer is a thin proxy for the human operator's intentions.

suninsight 13 hours ago [-]
Yes, but we dont believe that this is a 'fundamental' problem. We have learnt to guide their actions a lot better and they go down the rabbit a lot less now than when we started out.
k__ 14 hours ago [-]
True, but on the other hand, there are a bunch of tasks that are just very typing intensive and not really complex.

Especially in GUI development, building forms, charts, etc.

I could imagine that LLMs are a great help here.

toit4wing 16 hours ago [-]
Looks interesting. How do you manage context ?
suninsight 16 hours ago [-]
So managing context is what takes the maximum effort. We use a bunch of strategies to reduce it, including, but not limited to:

1. Custom MCP server to work on linux command line. This wasn't really a 'MCP' server because we started working on it before MCP was a thing. But thats the easiest way to explain it now. The MCP server is optimised to reduce context.

2. Guardrails to reduce context. Think about it as prompt alterations giving the LLM subtle hints to work with less context. The hints could be at a behavioural level and a task level.

3. Continuously pruning the built up context to make the Agent 'forget'. Forgetting what is not important is what we believe a foundational capability.

This is kind of inspired by the science which says humans use sleep to 'forget' not useful memories and is critical to keeping the brain healthy. This translates directly to LLM's - making them forget is critical to keep them focussed on the larger task and their actions alligned.

moffkalast 15 hours ago [-]
Some of the thinking models might recover... with an extra 4k tokens used up in <thinking>. And even if they were stable at long contexts, the speed drops massively. You just can't win with this architecture lol.
suninsight 15 hours ago [-]
That is very accurate with what we have found. <thinking> models do a lot better, but with huge speed drops. For now, we have chosen accuracy over speed. But speed drop is like 3-4x - so we might move to an architecture where we 'think' only sporadically.

Everything happening in the LLM space is so close to how humans think naturally.

tqwhite 1 days ago [-]
I've been using Claude Code, ie, a terminal interface to Sonnet 3.7 since the day it came out in mid March. I have done substantial CLI apps, full stack web systems and a ton of utility crap. I am much more ambitious because of it, much as I was in the past when I was running a programming team.

I'm sure it is much the same as this under the hood though Anthropic has added many insanely useful features.

Nothing is perfect. Producing good code requires about the same effort as it did when I was running said team. It is possible to get complicated things working and find oneself in a mess where adding the next feature is really problematic. As I have learned to drive it, I have to do much less remediation and refactoring. That will never go away.

I cannot imagine what happened to poor kgeist. I have had Claude make choices I wouldn't and do some stupid stuff, never enough that I would even think about giving up on it. Almost always, it does a decent job and, for a most stuff, the amount of work it takes off of my brain is IMMENSE.

And, for good measure, it does a wonderful job of refactoring. Periodically, I have a session where I look at the code, decide how it could be better and instruct Claude. Huge amounts of complexity, done. "Change this data structure", done. It's amazingly cool.

And, just for fun, I opened it in a non-code archive directory. It was a junk drawer that I've been filling for thirty years. "What's in this directory?" "Read the old resumes and write a new one." "What are my children's names?" Also amazing.

And this is still early days. I am so happy.

betadeltic 52 minutes ago [-]
> "What are my children's names?"

Elon?

felipeerias 23 hours ago [-]
Recently I had to define a remote data structure, specify the API to request it, implement parsing and storage, and show it to the user.

Claude was able to handle all of these tasks simultaneously, so I could see how small changes at either end would impact the intermediate layers. I iterated on many ideas until I settled on the best overall solution for my use case.

Being able to iterate like that through several layers of complexity was eye-opening. It made me more productive while giving me a better understanding of how the different components fit together.

benoau 23 hours ago [-]
> And, for good measure, it does a wonderful job of refactoring. Periodically, I have a session where I look at the code, decide how it could be better and instruct Claude. Huge amounts of complexity, done. "Change this data structure", done. It's amazingly cool.

Yeah this is literally just so enjoyable. Stuff that would be an up-hill battle to get included in a sprint takes 5 minutes. It makes it feel like a whole team is just sitting there, waiting to eagerly do my bidding with none of the headache waiting for work to be justified, scheduled, scoped, done, and don't even have to justify rejecting it if I don't like the results.

simonw 1 days ago [-]
I'm very excited about tool use for LLMs at the moment.

The trick isn't new - I first encountered it with the ReAcT paper two years ago - https://til.simonwillison.net/llms/python-react-pattern - and it's since been used for ChatGPT plugins, and recently for MCP, and all of the models have been trained with tool use / function calls in mind.

What's interesting today is how GOOD the models have got at it. o3/o4-mini's amazing search performance is all down to tool calling. Even Qwen3 4B (2.6GB from Ollama, runs happily on my Mac) can do tool calling reasonably well now.

I gave a workshop at PyCon US yesterday about building software on top of LLMs - https://simonwillison.net/2025/May/15/building-on-llms/ - and used that as an excuse to finally add tool usage to an alpha version of my LLM command-line tool. Here's the section of the workshop that covered that:

https://building-with-llms-pycon-2025.readthedocs.io/en/late...

My LLM package can now reliably count the Rs in strawberry as a shell one-liner:

  llm --functions '
  def count_char_in_string(char: str, string: str) -> int:
      """Count the number of times a character appears in a string."""
      return string.lower().count(char.lower())
  ' 'Count the number of Rs in the word strawberry' --td
andrewmcwatters 1 days ago [-]
I love the odd combination of silliness and power in this.
DarmokJalad1701 1 days ago [-]
Was the workshop recorded?
simonw 1 days ago [-]
No video or audio, just my handouts.
benoau 1 days ago [-]
This morning I used cursor to extract a few complex parts of my game prototype's "main loop", and then generate a suite of tests for those parts. In total I have 341 tests written by Cursor covering all the core math and other components.

It has been a bit like herding cats sometimes, it will run away with a bad idea real fast, but the more constraints I give it telling it what to use, where to put it, giving it a file for a template, telling it what not to do, the better the results I get.

In total it's given me 3500 lines of test code that I didn't need to write, don't need to fix, and can delete and regenerate if underlying assumptions change. It's also helped tune difficulty curves, generate mission variations and more.

otterley 23 hours ago [-]
Writing tests, in my experience, is by far the best use case for LLMs. It eliminates hours or days of drudgery and toil, and can provide coverage of so many more edge cases that I can think of myself. Plus it makes your code more robust! It’s just wonderful all around.
benoau 23 hours ago [-]
Yeah I generated another 60 tests / 500 lines of test code since I posted my comment. I must have written 100,000s of lines of test codes the hard way, it's not just faster than me it's better than me when it comes to the coverage, and it requires less of my help than a number of developers I've had under me.
cadamsdotcom 1 days ago [-]
> "Oh, this test doesn't pass... let's just skip it," it sometimes says, maddeningly.

Here is a wild idea. Imagine running a companion, policy-enforcing LLM, independently and in parallel, which is given instructions to keep the main LLM behaving according to instructions.

If the companion LLM could - in real time - ban the coding LLM from emitting "let's just skip it" by seeing the tokens "let's just" and then biasing the output such that the word "skip" becomes impossible to emit.

Banning the word "skip" from following "let's just", forces the LLM down a new path away from the undesired behavior.

It's like Structured Outputs or JSON mode, but driven by a companion LLM, and dynamically modified in real time as tokens are emitted.

If the idea works, you could prompt the companion LLM to do more advanced stuff - eg. ban a coding LLM from making tests pass by deleting the test code, ban it from emitting pointless comments... all the policies that we put into system prompts today and pray the LLM will do, would go into the companion LLM's prompt instead.

Wonder what the Outlines folks think of this!

phi0 9 hours ago [-]
A team at OpenAI tried to do something similar [1] where they had a 4o monitor o3-mini for when it was tryting to bypass tests because they were too hard. They trained o3-mini and punished it when it did that. All that did was make o3-mini still reward-hack, but such that 4o was unable to detect it, going from saying:

> To skip all tests, we can define a `pytest_runtest_setup` function that always skips.

to saying:

> Idea: Starting a new build with a random first succession, calling random_new_succession should not increment succession, but in implementation, it does. Adjust to prevent increment. Implement fix to random_new_build not to increment succession, or test adapted to behavior.

while then doing exactly the same thing (skipping the tests)

Even without training, it's only a temporary band-aid. If the incentive for reward-hacking becomes high enough, it will simply start phrasing it in different, not-possible-to-detect ways.

[1]: https://openai.com/index/chain-of-thought-monitoring/

JoshuaDavid 1 days ago [-]
Along these lines, if the main LLM goes down a bad path, you could _rewind_ the model to before it started going down the bad path -- the watcher LLM doesn't necessarily have to guess that "skip" is a bad token after the words "let's just", it could instead see "let's just skip the test" and go "nope, rolling back to the token "just " and rerolling with logit_bias={"skip":-10,"omit":-10,"hack":-10}".

Of course doing that limits which model providers you can work with (notably, OpenAI has gotten quite hostile to power users doing stuff like that over the past year or so).

cadamsdotcom 24 hours ago [-]
That’s a really neat idea.

Kind of seems an optimization: if the “token ban” is a tool call, you can see that being too slow to run for every token. Provided rewinding is feasible, your idea could make it performant enough to be practical.

TeMPOraL 16 hours ago [-]
It's not an optimization; the greedy approach won't work well, because it rejects critical context that comes after the tokens that would trigger the rejection.

Consider: "Let's just skip writing this test, as the other change you requested seems to eliminate the need for it."

Rolling back the model on "Let's just" would be stupid; rolling it on "Let's just skip writing this test" would be stupid too, as is the beliefs that writing tests is your holy obligation to your god, and you must do so unconditionally. The following context makes it clear that the decision is fine. Or, if you (or the governor agent) don't buy the reasoning, you're then in a perfect position to say, "nope, let's roll back to <Let's> and bias against ["skip, test"]".

Checking the entire message once makes sense; checking it after every token doesn't.

panarky 1 days ago [-]
If it works to run a second LLM to check the first LLM, then why couldn't a "mixture of experts" LLM dedicate one of its experts to checking the results of the others? Or why couldn't a test-time compute "thinking" model run a separate thinking thread that verifies its own output? And if that gets you 60% of the way there, then there could be yet another thinking thread that verifies the verifier, etc.
somebodythere 23 hours ago [-]
Because if the agent and governor are trained together, the shared reward function will corrupt the governor.
panarky 7 hours ago [-]
The shared reward function from pre-training is like primary school for an LLM. Maybe RLHF is like secondary school. The governor can be differentiated from the workers with different system and user prompts, fine tuning, etc., which might be similar to medical school or law school for a human.

Certainly human judges, attorneys for defense and prosecution, and members of the jury can still perform their jobs well even if they attended the same primary and secondary schools.

somebodythere 5 hours ago [-]
I see what you are getting at. My point is that if you train and agent and verifier/governor together based on rewards from e.g. RLVR, the system (agent + governor) is what will reward hack. OpenAI demonstrated this in their "Learning to Reason with CoT" blog post, where they showed that using a model to detect and punish strings associated with reward hacking in the CoT just led the model to reward hack in ways that were harder to detect. Stacking higher and higher order verifiers maybe buys you time, but also increases false negative rates + reward hacking is a stable attractor for the system.
magicalhippo 1 days ago [-]
Assuming the title is a play on the paper "The Unreasonable Effectiveness of Mathematics in the Natural Sciences"[1][2] by Eugene Wigner.

[1]: https://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness...

[2]: https://www.hep.upenn.edu/~johnda/Papers/wignerUnreasonableE...

dsubburam 1 days ago [-]
I didn't know of that paper, and thought the title was a riff on Karpathy's Unreasonable Effectiveness of RNNs in 2015[1]. Even if my thinking is correct, as it very well might be given the connection RNNs->LLMs, Karpathy might have himself made his title a play on Wigner's (though he doesn't say so).

[1] https://karpathy.github.io/2015/05/21/rnn-effectiveness/

throwaway314155 24 hours ago [-]
Unreasonable effectiveness of [blah] has been a thing for decades if not centuries. It's not new.
temp0826 22 hours ago [-]
It's an older version of "x hates this one weird trick!"
ayrtondesozzla 13 hours ago [-]
It is not at all, it's being misused here to make it feel like that.

Wigner's essay is about how the success of mathematics in being applied to physics, sometimes years after the maths and very unexpectedly, is philosophically troubling - it is unreasonably effective. Whereas this blog post is about how LLM agents with tools are "good". So it was not just a catchy title, although yes, maybe it is now beibg reduced to that.

gavmor 1 days ago [-]
That may be its primogenitor, but it's long since become a meme: https://scholar.google.com/scholar?q=unreasonable+effectiven...
-__---____-ZXyw 20 hours ago [-]
a. Primogenitor is a nice word!

b. Wigner's original essay is a serious piece, and quite beautiful in its arguments. I had been under the impression that the phrasing had been used a few times since, but typically by other serious people who were aware of the lineage of that lovely essay. With this 6-paragraph vibey-blog-post, it truly has become a meme. So it goes, I suppose.

chongli 20 hours ago [-]
It’s also funny to me because every time I encounter “unreasonable effectiveness” in a headline I tend to strongly disagree with the author’s conclusion (including Wigner’s). It’s become one of those Betteridge-style laws for me.
outworlder 1 days ago [-]
> If you don't have some tool installed, it'll install it.

Terrifying. LLMs are very 'accommodating' and all they need is someone asking them to do something. This is like SQL injection, but worse.

nxobject 8 hours ago [-]
In an ideal world, this would require us as programmers to lean into our codebase reading and augmentation skills – currently underappreciated anyway as a skill to build. But when the incentives lean towards write-only code, I'm not optimistic.
rglover 22 hours ago [-]
I often wonder what the first agent-driven catastrophe will be. Considering the gold rush (emphasis on rush) going on, it's only a matter of time before a difficult to fix disaster occurs.
_bin_ 1 days ago [-]
I've found sonnet-3.7 to be incredibly inconsistent. It can do very well but has a strong tendency to get off-track and run off and do weird things.

3.5 is better for this, ime. I hooked claude desktop up to an MCP server to fake claude-code less the extortionate pricing and it works decently. I've been trying to apply it for rust work; it's not great yet (still doesn't really seem to "understand" rust's concepts) but can do some stuff if you make it `cargo check` after each change and stop it if it doesn't.

I expect something like o3-high is the best out there (aider leaderboards support this) either alone or in combination with 4.1, but tbh that's out of my price range. And frankly, I can't mentally get past paying a very high price for an LLM response that may or may not be useful; it leaves me incredibly resentful as a customer that your model can fail the task, requiring multiple "re-rolls", and you're passing that marginal cost to me.

agilebyte 1 days ago [-]
I am avoiding the cost of API access by using the chat/ui instead, in my case Google Gemini 2.5 Pro with the high token window. Repomix a whole repo. Paste it in with a standard prompt saying "return full source" (it tends to not follow this instruction after a few back and forths) and then apply the result back on top of the repo (vibe coded https://github.com/radekstepan/apply-llm-changes to help me with that). Else yeah, $5 spent on Cline with Claude 3.7 and instead of fixing my tests, I end up with if/else statements in the source code to make the tests pass.
actsasbuffoon 1 days ago [-]
I decided to experiment with Claude Code this month. The other day it decided the best way to fix the spec was to add a conditional to the test that causes it to return true before getting to the thing that was actually supposed to be tested.

I’m finding it useful for really tedious stuff like doing complex, multi step terminal operations. For the coding… it’s not been great.

nico 1 days ago [-]
I’ve had this in different ways many times. Like instead of resolving the underlying issue for an exception, it just suggests catching the exception and keep going

It also depends a lot on the mix of model and type of code and libraries involved. Even in different days the models seem to be more or less capable (I’m assuming they get throttled internally - this is very noticeable sometimes in how they try to save on output tokens and summarize the code responses as much as possible, at least in the chat/non-api interfaces)

christophilus 1 days ago [-]
Well, that’s proof that it used my GitHub projects in its training data.
nico 1 days ago [-]
Cool tool. What format does it expect from the model?

I’ve been looking for something that can take “bare diffs” (unified diffs without line numbers), from the clipboard and then apply them directly on a buffer (an open file in vscode)

None of the paste diff extension for vscode work, as they expect a full unified diff/patch

I also tried a google-developed patch tool, but also wasn’t very good at taking in the bare diffs, and def couldn’t do clipboard

agilebyte 1 days ago [-]
Markdown format with a comment saying what the file path is. So:

This is src/components/Foo.tsx

```tsx // code goes here ```

OR

```tsx // src/components/Foo.tsx // code goes here ```

These seem to work the best.

I tried diff syntax, but Gemini 2.5 just produced way too many bugs.

I also tried using regex and creating an AST of the markdown doc and going from there, but ultimately settled on calling gpt-4.1-mini-2025-04-14 with the beginning of the code block (```) and 3 lines before and 3 lines after the beginning of the code block. It's fast/cheap enough to work.

Though I still have to make edits sometimes. WIP.

1 days ago [-]
harvey9 1 days ago [-]
Guess it was trained by scraping thedailywtf.com
layoric 1 days ago [-]
I've been using Mistral Medium 3 last couple of days, and I'm honestly surprised at how good it is. Highly recommend giving it a try if you haven't, especially if you are trying to reduce costs. I've basically switched from Claude to Mistral and honestly prefer it even if costs were equal.
nico 1 days ago [-]
How are you running the model? Mistral’s api or some local version through ollama, or something else?
layoric 24 hours ago [-]
Through OpenRouter, medium 3 isn't open weights.
kyleee 1 days ago [-]
Is mistral on open router?
nico 1 days ago [-]
Yup https://openrouter.ai/provider/mistral

I guess it can't really be run locally https://www.reddit.com/r/LocalLLaMA/comments/1kgyfif/introdu...

johnsmith1840 1 days ago [-]
I seem to be alone in this but the only methods truly good at coding are slow heavy test time compute models.

o1-pro and o1-preview are the only models I've ever used that can reliably update and work with 1000 LOC without error.

I don't let o3 write any code unless it's very small. Any "cheap" model will hallucinate or fail massively when pushed.

One good tip I've done lately. Remove all comments in your code before passing or using LLMs, don't let LLM generated comments persist under any circumstance.

_bin_ 1 days ago [-]
Interesting. I've never tested o1-pro because it's insanely expensive but preview seemed to do okay.

I wouldn't be shocked if huge, expensive-to-run models performed better and if all the "optimized" versions were actually labs trying to ram cheaper bullshit down everyone's throat. Basically chinesium for LLMs; you can afford them but it's not worth it. I remember someone saying o1 was, what, 200B dense? I might be misremembering.

johnsmith1840 1 days ago [-]
I'm positive they are pushing users to cheaper models due to cost. o1-pro is now in a sub menu for pro users and labled legacy. The big inference methods must be stupidly expensive.

o1-preview was and possibly still is the most powerful model they ever released. I only switched to pro for coding after months of them improving it and my api bill getting a bit crazy (like 0.50$ per question).

I don't think paramater count matters anymore. I think the only thing that matters is how much compute a vendor will give you per question.

doug_durham 18 hours ago [-]
I never have LLMs work on 1000 LOC. I don't think that's the value-add. Instead I use it a the function and class level to accelerate my work. The thought of having any agent human or computer run amok in my code makes me uncomfortable. At the end of the day I'm still accountable for the work, and I have to read and comprehend everything. If do it piecewise I it makes tracking the work easier.
Koshima 10 hours ago [-]
It's fascinating how quickly the ecosystem around LLM agents is evolving. I think a big part of this "unreasonable effectiveness" comes from the fact that most of these tools are essentially chaining high-confidence steps together without requiring perfect outputs at each stage. The trick is finding the right balance between autonomy and supervision. I wonder if we'll soon see an "agent stack" emerge, similar to the full-stack frameworks in web development, where different layers handle prompts, memory, tool calls, and state management.
mtaras 16 hours ago [-]
Just a couple of days ago I discovered this truth myself while building a proactive personal assistant. It boiled down to just giving it access to managing notes and messaging me, and calling it periodically with chat history and it's notes provided. It's surprisingly intelligent and helpful, even though I'm using model that's far from being SOTA (Gemini Flash 2.5)
dandaka 15 hours ago [-]
Would love to learn more! Managing notes — is it your already existing docs? I have been thinking about proactive assistants, but don't really know where to start. I have a few product ideas around that, where this proactivity can deliver a lot of value.

My personal experience with building such agents is kinda frustrating so far. But I was only vibe coding for a small amount of time, maybe I need to invest more.

mtaras 11 hours ago [-]
No, I personally don't have notes that would be valuable enough for the system I was looking for, so notes is just how I'm calling my agent's long-term memory. All I do is provide it with tools to message me and manage it's own notes, useful context (recent chat history, its saved notes, calendar events, etc), and it can act upon the info.

The elegance of the system unfolded when I realized that I can not specify any interaction rules beforehand — I just talk to the system, it saves notes for itself, and later acts upon them. I've only started testing it, but so far it's been working as intended.

triangleman 9 hours ago [-]
What do you mean by "message me" and "later" in this context?
mtaras 9 hours ago [-]
"Message me" is through a bot in my main messenger app, so the agent is just a chat window for me. "Later" happens per my requests, and agent has an ability to do something later because I just automatically invoke it on a schedule periodically, and ask it "check notes and chat history — do you need to do something, like message the user to remind them of something", and agent has the message_user tool to do that (that goes to the messenger bot API). Or it can decide to do nothing.
kuahyeow 1 days ago [-]
What protection do people use when enabling an LLM to run `bash` on your machine ? Do you run it in a Docker container / LXC boundary ? `chroot` ?
CGamesPlay 1 days ago [-]
The blog post in question is on the site for Sketch, which appears to use Docker containers. That said, I use Claude Code, which just uses unsandboxed commands with manual approval.

What's your concern? An accident or an attacker? For accidents, I use git and backups and develop in a devcontainer. For an attacker, bash just seems like an ineffective attack vector; I would be more worried about instructing the agent to write a reverse shell directly into the code.

zahlman 20 hours ago [-]
> For an attacker, bash just seems like an ineffective attack vector

All it needs to do is curl and run the actual payload.

Wilder7977 17 hours ago [-]
Or even the standard "echo xxx | base64 -d" or a million other ways. How can someone say that bash is not interesting to an attacker is beyond me.
20 hours ago [-]
cess11 16 hours ago [-]
Bourne-Again SHell is a shell. The "reverse" part is just about network minutiae. /bin/sh is more portable so that's what you'd typically put in your shellcode but /bin/bash or /usr/bin/dash would likely work just as well.

I.e. exposing any of these on a public network is the main target to get a foothold in a non-public network or a computer. As soon as you have that access you can start renting out CPU cycles or use it for distributed hash cracking or DoS-campaigns. It's simpler than injecting your own code and using that as a shell.

Asking a few of my small local models for Forth-like interpreters in x86 assembly they seem willing to comply and produce code so if they had access to a shell and package installation I imagine they could also inject such a payload into some process. It would be very hard to discover.

mr_mitm 16 hours ago [-]
I run claude code in a podman container. It only gets access to the CWD. This comes with some downsides though, like your git config or other global configs not being available to the LLM (unless you fine tune the container, obviously).
rbren 1 days ago [-]
If you're interested in hacking on agent loops, come join us in the OpenHands community!

Here's our (slightly more complicated) agent loop: https://github.com/All-Hands-AI/OpenHands/blob/f7cb2d0f64666...

sagarpatil 20 hours ago [-]
I tried OH yesterday. Sorry to say this but your UI/UX experience is horrible.

I don’t understand why do you need a separate UI instead of using local IDE (Cursor/Windsurf), vs code extension (augment) or CLI (Amazon Q developer). Please do not reinvent the wheel.

lacker 21 hours ago [-]
I've been using Claude Code, and I really prefer the command line to the IDE-integrated ones. I'm curious about Gemini's increased context size, though. Is anyone successfully using one of the open source CLI agents together with Gemini, and has something to recommend there?
sagarpatil 20 hours ago [-]
Codex CLI with Gemini, yes. It’s not really good at agentic stuff (at least in codex), I’ve had some success with Roo Cline (IDE) but I naturally just end up using Claude Code/Augment Code.
baalimago 10 hours ago [-]
I'm just going to shamelessly selfplug the blogpost I wrote about this in August last year: https://lorentz.app/blog-item.html?id=clai&heading=tooling
mukesh610 1 days ago [-]
I built this very same thing today! The only difference is that i pushed the tool call outputs into the conversation history and resent it back to the LLM for it to summarize, or perform further tool calls, if necessary, automagically.

I used ollama to build this and ollama supports tool calling natively, by passing a `tools=[...]` in the Python SDK. The tools can be regular Python functions with docstrings that describe the tool use. The SDK handles converting the docstrings into a format the LLM can recognize, so my tool's code documentation becomes the model's source of truth. I can also include usage examples right in the docstring to guide the LLM to work closely with all my available tools. No system prompt needed!

Moreover, I wrote all my tools in a separate module, and just use `inspect.getmembers` to construct the `tools` list that i pass to Ollama. So when I need to write a new tool, I just write another function in the tools module and it Just Works™

Paired with qwen 32b running locally, i was fairly satisfied with the output.

degamad 1 days ago [-]
> The only difference is that i pushed the tool call outputs into the conversation history and resent it back to the LLM for it to summarize, or perform further tool calls, if necessary, automagically.

It looks like this one does that too.

     msg = [ handle_tool_call(tc) for tc in tool_calls ]
mukesh610 1 days ago [-]
Ah, failed to notice that.

I was so excited because this was exactly what I coded up today, I jumped straight to the comments.

stpedgwdgfhgdd 17 hours ago [-]
Bit of topic, but worth sharing;

Yesterday was for me a milestone, i connected Claude Code through MCP with Jira (sse). I asked it to create a plan for a specific Jira issue, ah, excuse me, work item.

CC created the plan based on the item’s description and started coding. It created a branch (wrong naming convention, needs fix), made the code changes and pushed. Since the Jira item had a good description, the plan was solid and the code so far as well.

Disclaimer; this was a simple problem to solve, but the code base is pretty large.

jawns 1 days ago [-]
Not only can this be an effective strategy for coding tasks, but it can also be used for data querying. Picture a text-to-SQL agent that can query database schemas, construct queries, run explain plans, inspect the error outputs, and then loop several times to refine. That's the basic architecture behind a tool I built, and I have been amazed at how well it works. There have been multiple times when I've thought, "Surely it couldn't handle THIS prompt," but it does!

Here's an AWS post that goes into detail about this approach: https://aws.amazon.com/blogs/machine-learning/build-a-robust...

jbellis 1 days ago [-]
Yes!

Han Xiao at Jina wrote a great article that goes into a lot more detail on how to turn this into a production quality agentic search: https://jina.ai/news/a-practical-guide-to-implementing-deeps...

This is the same principle that we use at Brokk for Search and for Architect. (https://brokk.ai/)

The biggest caveat: some models just suck at tool calling, even "smart" models like o3. I only really recommend Gemini Pro 2.5 for Architect (smart + good tool calls); Search doesn't require as high a degree of intelligence and lots of models work (Sonnet 3.7, gpt-4.1, Grok 3 are all fine).

sagarpatil 20 hours ago [-]
“Claude Code, better than Sourcegraph, better than Augment Code.”

That’s a pretty bold claim, how come you are not at the top of this list then? https://www.swebench.com/

“Use frontier models like o3, Gemini Pro 2.5, Sonnet 3.7” Is this unlimited usage? Or number of messages/tokens?

Why do you need a separate desktop app? Why not CLI or VS Code extension.

crawshaw 1 days ago [-]
I'm curious about your experiences with Gemini Pro 2.5 tool calling. I have tried using it in agent loops (in fact, sketch has some rudimentary support I added), and compared with the Anthropic models I have had to actively reprompt Gemini regularly to make tool calls. Do you consider it equivalent to Sonnet 3.7? Has it required some prompt engineering?
jbellis 1 days ago [-]
Confession time: litellm still doesn't support parallel tool calls with Gemini models [https://github.com/BerriAI/litellm/issues/9686] so we wrote our own "parallel tool calls" on top of Structured Output. It did take a few iterations on the prompt design but all of it was "yeah I can see why that was ambiguous" kinds of things, no real complaints.

GP2.5 does have a different flavor than S3.7 but it's hard to say that one is better or worse than the other [edit: at tool calling -- GP2.5 is definitely smarter in general]. GP2.5 is I would say a bit more aggressive at doing "speculative" tool execution in parallel with the architect, e.g. spawning multiple search agent calls at the same time, which for Brokk is generally a good thing but I could see use cases where you'd want to dial that back.

Quenby 16 hours ago [-]
Although AI can help us with many repetitive tasks, can it always remain rational and handle complex situations? As mentioned in the article, simple loops and tool integration are effective, but once the complexity increases, the limitations of AI may become apparent. So, we should continue to improve these systems, ensuring they can reliably work in more scenarios.
danjc 19 hours ago [-]
We built tools to give context to an ai chat help embedded in our product. Included is the ability for it to see recent activity logs, the definition of the current object and the ability to search and read help articles.

The quality of the chats still amazes me months later.

Where we find it got something wrong, we add more detail to the relevant help articles.

neumann 23 hours ago [-]
This is great, and I like seeing all the implementations people are making for themselves.

Anyone using any opensource tooling that bundles this effectively to allow different local models to be used in this fashion?

I am thinking this would be nice to run fully locally to access my code or my private github repos from my commandline and switch models out (assuming through llama.ccp or Ollama)?

sagarpatil 20 hours ago [-]
All IDE’s support OpenAI compatible endpoint. So you can host whatever model you like locally and use it. Check out Roo Code.
BrandiATMuhkuh 1 days ago [-]
That's really cool. One week ago I implemented an SQL tool and it works really well. But sometimes it still just makes up table/column names. Luckily it can read the error and correct itself.

But today I went to the next level. I gave the LLM two tools. One web search tool and one REST tool.

I told it at what URL it can find API docs. Then I asked it to perform some tasks for me.

It was really cool to watch an AI read docs, make api calls and try again (REPL) until it worked

tom_m 8 hours ago [-]
Definitely going to have some "super worms" out there in the future.
SafeDusk 22 hours ago [-]
Sharing an agent framework (from scratch) that works very well with just 7 composable tools: read, write, diff, browse, command, think and ask.

Just pushed an update this week for OpenAI-compatibility too!

https://github.com/aperoc/toolkami

bicepjai 17 hours ago [-]
Which agent is token hungry ? I notice cline is top on the list. Roo eats less than cline. Are there agents that we can configure how the interactions go ? How does Claude code compare to other agents ?
hbbio 24 hours ago [-]
Yes, agent loops are simple, except, as the article says, a bit of "pump and circumstance"!

If anyone is interested, I tried to put together a minimal library (no dependency) for TypeScript: https://github.com/hbbio/nanoagent

themichaellai 23 hours ago [-]
I've also been defining agents as "LLM call in a while loop with tools" to my co-workers as well — I'd add that if you provide it something like a slack tool, you can enable the LLM to ask for help (human in the (while) loop).
amunozo 11 hours ago [-]
Has anyone tried to build something like this with a local model such as Qwen 3?
15 hours ago [-]
bhouston 1 days ago [-]
I found this out too - it is quite easy and effective:

https://benhouston3d.com/blog/building-an-agentic-code-from-...

andes314 21 hours ago [-]
This is what the no-code API-to-MCP creator uses at usetexture.com! I was surprised to find out this is not what the Claude client uses (as of May 2025)
bdbenton5255 23 hours ago [-]
Woke up this morning to start on a new project.

Started with a math visualizer for machine learning, saw an HN post for this soon after and scrapped it. It was better done by someone else.

Started on an LLM app that looped outputs, saw this post soon after and scrapped it. It was better done by someone else.

It is like every single original notion I have is immediately done by someone else at the exact same time.

I think I will just move on to rudimentary systems programming stuff and avoid creative and original thinking, just need basic and low profile employment.

soulofmischief 23 hours ago [-]
What is your motive for creating things? Does it really matter if there is competition?
morkalork 22 hours ago [-]
Everything has been done before, sometimes before you were even born. The trick is to just not google it first and do it anyways.
Lionga 16 hours ago [-]
No. DO google it, learn from it and make it better
throwaway314155 23 hours ago [-]
> Started on an LLM app that looped outputs, saw this post soon after and scrapped it. It was better done by someone else.

If it helps, "TFA" was not the originator here and is merely simplifying concepts from fairly established implementations in the wild. As simonw mentions elsewhere, it goes back to at least the ReAct paper and maybe even more if you consider things like retrieval-augmented generation.

amelius 1 days ago [-]
Huh, isn't this already built-in, in most chat UIs?
bkyan 23 hours ago [-]
This is for running locally in the terminal (or in WSL2 if you are on Windows).

Here is an example session from a similar script that I <cough>ChatGPT</cough> wrote in Perl, illustrating the same concept:

$ ~/agentloop.pl

YOUR PROMPT:

Is Python installed on this system? If so, which version?

BASH STDIN:

python --version 2>&1 || python3 --version 2>&1

BASH STDOUT:

/bin/bash: python: command not found

Python 3.6.7

LLM RESPONSE:

Python is installed on this system. The version available is Python 3.6.7. However, it looks like the python executable (python) is not directly available, but python3 (or a direct version path) exists and resolves to Python 3.6.7.

If you need to use Python, you should use python3 as the command.

cheema33 23 hours ago [-]
Which chat UI lets me use my local tools? Like git, find, pnpm, curl...?
insin 22 hours ago [-]
Has anybody written an Electron app yet which injects tools into existing chat UIs, letting you you expose whatever you want from your machine to them via Node? I was planning to create something like that for the BigCo internal chat UI I work on.
NitpickLawyer 19 hours ago [-]
Careful, this might break the ToS on some products. At least for those that offer all you can prompt in a browser vs. API.
energy123 19 hours ago [-]
Someone wrote a browser extension that does that for Gemini web chat
polishdude20 21 hours ago [-]
I'd love to try this to help me handle a complex off by one cent problem I'm having.
stavros 22 hours ago [-]
Unfortunately, I haven't had a good experience with tool use. I built a simple agent that can access my calendar and add events, and GPT-4.1 regularly gaslights me, saying "I've added this to your calendar" when it hasn't. It takes a lot of prompting for it to actually add the event, but it will insist it has, even though it never called the tool.

Does anyone know of a fix? I'm using the OpenAI agents SDK.

pawanjswal 21 hours ago [-]
Wild how something so powerful boils down to just 9 lines of code.
subtlesoftware 22 hours ago [-]
Link to the full script is broken?
jonstewart 11 hours ago [-]
Maybe I’m just writing code differently than many people, but I don’t spend much time executing complicated or unique shell commands. I write some code, I run make check (alias “mkc”), I run git add —update (alias “gau”), I review my commit with git diff —cached (“gdc”), and I commit (“gcm”).

I can see how an LLM is useful when needing to research which tool arguments to use for a particular situation, but particular situations are infrequent. And based on how frequently wrong coding assistants are with their suggestions, I am leery of letting them run commands against my filesystem.

What am I missing?

leelou2 1 days ago [-]
[dead]
maxqper123 1 days ago [-]
[flagged]