Gabriel Orlanski

Posts

All the articles I've posted.

GPT-5.4 Writes Clean Code That Fails More Tests

Published: 6 Mar, 2026 at 06:00 AM
[code]

GPT-5.4 writes the cleanest code of any GPT model we've tested — and fails more tests than 5.3. Erosion drops, duplication drops, but pass rate and core both regress. We dig into why.
Opus 4.6 and GPT-5.3 Codex Score Higher, but the Code Is Still a Mess.

Published: 11 Feb, 2026 at 06:00 AM
[code]

Anthropic's Opus 4.6 copy-pastes. OpenAI's GPT-5.3 Codex over-abstracts. Both miss edge cases at the same rate. New SCBench results with a guide for when to trust each model.
Coding Agents Are Lazy Patchers

Published: 12 Jan, 2026 at 07:40 AM
[code]

AI coding agents become lazy patchers under iterative changes, copying code instead of refactoring. This creates massive god functions that are unmaintainable—explaining the gap between benchmark scores and real-world experience.
SlopCodeBench: Measuring Code Erosion Under Iterative Specification Refinement

G. Orlanski , D. Roy , A. Yun , C. Shin , A. Gu , A. Ge , D. Adila , A. Albarghouthi , and F. Sala

View code on GitHub Paper View paper on arXiv 2025 Technical Report

SlopCodeBench evaluates AI coding agents under iterative specification updates. Unlike single-shot benchmarks, SCBench reveals verbosity and structural erosion that make agent-written code unmaintainable over time.