Hacker News — vinext + Cloudflare Workers

new
past
show
ask
show
jobs
submit

▲Taming LLMs: Using Executable Oracles to Prevent Bad Code (john.regehr.org)

32 points by mad44 15 hours ago | 18 comments

dktoao 13 hours ago [-]

"Our goal should be to give an LLM coding agent zero degrees of freedom"

Wouldn't that just be called inventing a new language with all the overhead of the languages we already have? Are we getting to the point where getting LLMs to be productive and also write good code is going to require so much overhead and additional procedures and tools that we might as well write the code ourselves. Hmmm...

virgilp 12 hours ago [-]

Actually, no. We always needed good checks - that's why you have techniques like automated canary analysis, extensive testing, checking for coverage - these are forms of "executable oracles". If you wanted to be able to do continuous deployment - you had to be very thorough in your validation.

LLMs just take this to the extreme. You can no longer rely on human code reviews (well you can but you give away all the LLM advantages) so then if you take out "human judgement" *from validation*[1], you have to resort to very sophisticated automated validation. This is it - it's not about "inventing a new language", it's about being much more thorough (and innovative, and efficient) in the validation process.

[1] never from design, or specification - you shouldn't outsource that to AI, I don't think we're close to an AI that can do that even moderately effective without human help.

nitwit005 12 hours ago [-]

If the LLM generates code exactly matching a specification, the specification becomes a conventional programing language. The LLM is just transforming from one language to another.

sanxiyn 11 hours ago [-]

Yes, but a programming language with a proverbial sufficiently smart compiler. That is very useful.

Quekid5 11 hours ago [-]

Try writing an exhaustive spec for anything non-trivial and you might see the problem.

scuff3d 5 hours ago [-]

Been saying this for a while now. I work in aerospace, and I can tell you from first hand experience software engineers don't know what designing a spec is.

Aero, mechanical, and electrical engineers spend years designing a system. Design, requirements, reviews, redesign, more reviews, more requirements. Every single corner of the system is well understood before anything gets made. It's a detailed, time consuming, arduous process.

Software engineers think they can duplicate that process with a few skills and a weekend planning session with Claude Code. Because implementation is cheaper we don't have to go as hard as the mechanical and electrical folks, but to properly spec a system is still a massive amount of up front effort.

whattheheckheck 8 hours ago [-]

Llm boys discover the halting problem!

seanw444 13 hours ago [-]

Yeah, precision LLM coding is kind of an oxymoron. English language -> codebase is essentially lossily-compressed logic by definition. The less lossy the compression becomes, the more you probably approach re-inventing programming languages. Which then means that in order to use LLMs to code, you're accepting some degree of imprecision.

amelius 10 hours ago [-]

Zero degrees of freedom is a step too far.

What you want is correctness preserving transformations. Add to this some metrics such as code size, execution speed.

rco8786 9 hours ago [-]

Yea this feels like saying “if you give them good enough specs they’ll produce the code you want” which reduces to…writing the code yourself. Just with more steps.

shubhamintech 7 hours ago [-]

The oracle problem is tractable when the output is code: you can compile it, run tests, diff the output. For conversational AI it's much harder. We've seen teams use LLM-as-judge as their validation layer and it works until the judge starts missing the same failure modes as the generator.

JSR_FDED 9 hours ago [-]

> JustHTML was effectively tested into existence using a large, existing test suite.

I love the phrase “tested into existence”.

RS-232 12 hours ago [-]

Has anyone had success using 2 agents, with one as the creator and one as an adversarial "reviewer"? Is the output usually better or worse?

mapontosevenths 11 hours ago [-]

This is how its meant to be done. Usually with the reviewer being the stronger model.

That said, with both the test driven development this post describes and the reviewer model (its best to do both) you have to provide an escape hatch or out for the model. If you let the model get inescapably stuck with an impossible test or constraints it will just start deleting tests or rewriting the entire codebase in rust or something.

My escape hatch is "expert advice". I let the weak LLM phone a friend when its stuck and ask a smarter LLM for assistance. Its since stopped going crazy and replacing all my tests with gibberish... mostly.

sanxiyn 11 hours ago [-]

That works well. Anthropic wrote a writeup on it.

https://www.anthropic.com/engineering/harness-design-long-ru...

esafak 12 hours ago [-]

This is routine. We have Gemini (which is not our coding model) review our PRs and it genuinely catches mistakes. Even using the same model as the creator, without its context to bias it, would probably catch many mistakes.

peytongreen_dev 10 hours ago [-]

[dead]

ReptileMan 10 hours ago [-]

Now is Haskell's time to shine.

felixagentai 12 hours ago [-]

[flagged]

jameschaearley 12 hours ago [-]

[dead]

voxaai 13 hours ago [-]

[dead]

sayYayToLife 14 hours ago [-]

[dead]

voxaai 13 hours ago [-]

[flagged]

CrazyStat 11 hours ago [-]

Wow, a sudden change in writing style that’s not at all intended to disguise the fact that you’re an llm!

Rendered at 09:03:38 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.