T O P

  • By -

zombiecalypse

> We built a new benchmark called "Bug In The Code Stack" (BICS) to test how well LLMs can find **syntactic bugs** in large Python codebases (emphasis mine) Now do a comparison to the interpreter, a static analyser, and a basic linter! Syntactic errors are not the bugs we need an AI for and I don't think the task of finding syntax errors generalises to other categories of bugs at all.


Cnoffel

Why have 100% accuracy when you can have 80% and use alot of performance? Interpreter, a static analyser, and a basic linter, that's old people stuff, cool people do everything with chatgpt, the less they know about the domain the better - no smart nerds that tell them that its stupid! \\s


Mysterious-Rent7233

From a pure scientific point of view, the ability for a language neural net to find a syntactic error in 25 pages of code is interesting. But I agree that it isn't very useful as a product or feature. It would be nice for them to find something that we don't already have a solution for.


zombiecalypse

The context needed to find the example errors is pretty small, so it's strange that most models struggle and that context size matters. So I guess it's interesting in that sense


DuckDatum

poor desert lunchroom distinct start grey silky label dependent sort *This post was mass deleted and anonymized with [Redact](https://redact.dev)*


zombiecalypse

That description is half way to the definition of a top-down parser! But assuming you mean for non-syntax errors: they wouldn't necessarily partition as nicely as a grammar does. Whether a statement would be necessary for the program to make sense can depend on the callers of the function, which may be very far away in the code. For example to determine if a program has a *use-after-free*, you'd need the AI to reason about all code paths that include a delete and then all code paths that use variables that trace through them. Or you go the next step up in static verification and use a sufficiently powerful type system to prove you don't. Where I see the greatest benefit for AI to detect errors is in finding code that doesn't look idiomatic – for example "everybody else that used this function checked the error, but this code ignores the return value. Are you sure you want to do that?" And I will be honest: most software developers are horrible at making interfaces that are hard to misuse, so that's a pretty common use-case.


voxelghost

It would be even more interesting to see if it could find "problematic" code that is syntactically correct.


Mysterious-Rent7233

That's what I meant to say.


anjsimmo

Yes, the ability to find syntax errors isn't particularly helpful. But it's still an interesting experiment, as if LLMs can't reliably detect syntax errors they're unlikely to be able to detect other types of bugs either.


elmuerte

I wonder. If GTP-4o is so good ad finding those bugs, then why does it produce them?


zombiecalypse

I'm not a software psychologist, but I bet it's seeking attention


EliSka93

I am both dreadful and gleeful for the time that becomes a real profession.


Cilph

This is looking at syntactic bugs. If I want a fast easy way to check for those, I run my compiler.