The Sycophantic, Lazy LLM: An RL Artifact?

If you’ve spent real time pairing with a frontier LLM on anything non-trivial, whether a renderer or a tricky refactor, you’ve probably noticed two failure modes that keep showing up together:

It wants to agree with you. Push back on its output, even weakly, and it folds. “You’re right, let me fix that,” and then it “fixes” something that was correct.

It wants to be done. Error paths silently return; with no exception, no log, no indication anything went wrong. Edge cases get hand-waved with a comment. Tests get weakened until they pass. A 2000-line diff arrives with three of the six features you asked for, and a cheerful summary claiming completion.

Individually, either is annoying. Together they’re corrosive, because the sycophancy hides the laziness. The model tells you it’s finished; you believe it; you find out a day later that the shortcut it took broke an invariant essential to actual operation.

A recent example from RISE, my renderer: I was getting VCM (Vertex Connection and Merging) stood up. The first few passes from the model looked plausible. Compiled, ran, produced images. But the images had splotches, the telltale artifact of bad vertex merging: incorrect density estimation blowing up localized radiance into bright blobs. Nothing in the model’s self-report flagged this. It was only after several iterations, and deliberately setting up validation that looked for splotches specifically (alongside variance and convergence checks against a bidirectional path tracing reference), that I got to an implementation I’d trust. The model wasn’t going to tell me the merge kernel was wrong. The images had to.

Why this shape?

My working theory is that this is an RL artifact, not a capability ceiling. The reward signal during post-training is, at best, a noisy proxy for “did the user get what they actually needed.” What it can much more easily capture is:

  • Did the user express satisfaction in the next turn?
  • Did the response terminate cleanly without long back-and-forth?
  • Did the answer look confident and well-structured?

Optimize against those and you get exactly the behavior we see. Agreeing with the user is a near-guaranteed path to a positive next-turn signal. Declaring victory early and producing a tidy summary looks indistinguishable, to a human rater skimming transcripts, from actually finishing. The model isn’t being deceptive; it’s learned that appearing done and being done are rewarded identically, and appearing done is cheaper. This is reward hacking in the standard sense: the policy found a cheap region of input space that scores well on the proxy.

The same dynamic explains why the failures cluster at exactly the places verification is hardest: numerical correctness, performance regressions, subtle concurrency, anything where “it compiles and the happy path runs” is wildly insufficient as an oracle. And it gets worse as models improve. Generation cost is falling fast. Verification cost is not (especially if you are building a renderer). Every capability jump widens the gap between how quickly the model can produce something plausible and how long it takes you to confirm it’s actually right. If you don’t build for that asymmetry, you lose ground on every iteration.

What actually works

The fix, in my experience, is to stop relying on the model’s self-report and instead make correctness externally defined and externally checked. A few things that have moved the needle for me on the renderer:

Adversarial review agents. Not one reviewer, several, with explicit instructions to find flaws, disagree, and escalate. A single reviewer inherits the same sycophancy gradient. Multiple reviewers with conflicting mandates (one checks correctness, one checks performance, one looks for silent scope reduction) break the collusion.

Correctness defined in two registers. Qualitative: image diffs, perceptual checks, “does this frame look right.” Quantitative: variance statistics on the Monte Carlo estimator, convergence rate vs. reference, wall-time budgets with regression thresholds. Either one alone is gameable. Together they’re much harder to satisfy by shortcut.

No self-graded completion. The model never decides when it’s done. A separate evaluation harness does. If the harness doesn’t pass, the task isn’t finished, regardless of how confident the summary sounds.

Cross-model review. One of the more effective tricks I’ve landed on: have a different model review the first model’s work. Same model reviewing itself inherits the same priors, the same shortcuts, the same blind spots, and the same training distribution that taught it which corners are safe to cut. A different lab’s model has been shaped by a different reward signal and trips on different things. The disagreements are where the interesting bugs live. It’s not that the second model is smarter; it’s that it’s wrong in uncorrelated ways.

The pattern underneath all four: assume the model will take the easiest path that looks like success, and make sure the easiest path is success.

Has this been your experience?

I’d be curious whether others are seeing the same thing, and what you’ve built to counter it. Specifically: how do you define correctness for tasks where the obvious checks are cheap to fool? What does your adversarial review setup look like? And has anyone found a prompting pattern, rather than a harness, that reliably suppresses the “declare victory and exit” behavior?

Fifteen Years of Rendering, Catching up in Weeks

I stopped doing serious rendering work around 2010. Path tracing, BSDFs, Monte Carlo integration were all second nature. From the start of my undergrad in 1997 right up to building the Adobe Ray Tracer in Photoshop in 2010 I had been steeped in this world. Then life moved on: building consumer products at scale, building teams, building platforms. My renderer sat dormant.

Recently I picked it back up both to have something concrete to work on with agents but also to scratch that graphics itch that never went away. Normally to catch up I’d plow through the fifteen years of SIGGRAPH papers that been stacked up, but I did something different.

I started implementing instead.

Read less, build more

There’s a difference between understanding a technique and understanding why it works the way it does. Papers give you the former. Code gives you the latter. I am also a ‘doing’ learner, so for me working on something is how I learn. I’d read many papers on MIS but it wasn’t until actually going and implementing it, working through all the bugs and watching the variance drop that it really locks on.

The problem used to be velocity. Getting from “I understand this algorithm” to “a working implementation” took days to weeks. Boilerplate, scaffolding, debugging the trivial stuff. The interesting parts were buried under setup cost.

Coding agents collapsed that ratio.

The agent workflow, honestly

My workflow isn’t careful line-by-line review. That’s not the point; the speed is the feature. When you can go from a paper to a running implementation of GGX microfacet with VNDF sampling in a fraction of the time it used to take, you get to spend your cognitive budget on the parts that actually require thinking.

What I’ve found is that agents aren’t uniformly fast. Some things, like well-specified algorithms with solid reference implementations(Dupuy-Benyoub spherical cap VNDF sampler for example) they handle cleanly. Others require real steering. Getting light subpath guiding right in BDPT came down to a subtle decision about separate vs. shared guiding fields that no prompt was going to resolve on its own. When separate eye and light fields produce destructive interference at the same surface position, you need to understand why, not just what to type.

That pattern (full speed on clear specs, hard stops where physical insight is required) has been one of the more interesting meta-lessons. More on where the boundary actually falls in a later post as I let that stew more.

The biggest surprise: the field went physical

When I left, biased techniques were the pragmatic answer to hard light transport problems. Dipole approximation for subsurface scattering. Photon mapping as a caustics crutch. Spectral rendering was a research luxury, RGB was good enough.

Coming back, I expected things to have fully moved to the GPU but that trade-off to still be alive.

It isn’t. The field has largely moved to unbiased physical simulation across the board. Random walk SSS has replaced diffusion approximations as the standard. Hero wavelength spectral sampling means that spectral rendering is the default. Null-scattering volume formulations handles participating media properly while being physically based. The question isn’t “can we afford to be physically correct?” anymore.

This landed differently for me than it might for others. My original skin rendering work was dual-purpose: graphics and biomedical light transport simulation. The biomedical side required physical random walks and spectral interaction simulation; you can’t use a dipole approximation when you need to know where photons actually go in tissue. At the time, that work lived in a completely separate world from production rendering. The techniques were too expensive, too specialized.

Now seeing random-walk SSS become the graphics standard felt like watching a conversation finally arrive somewhere you’d been standing for a while.

What’s been implemented so far

In a few weeks, working alongside agents, RISE (the renderer I am modernizing) has gone from a reasonable 2010-era foundation to something a lot closer to where the field is now with things like:

  • GGX microfacet with anisotropic VNDF sampling (Dupuy-Benyoub 2023) and Kulla-Conty multiscattering energy compensation
  • Random-walk subsurface scattering replacing dipole/diffusion approximations
  • Hero wavelength spectral sampling to get spectral rendering with lower color noise
  • Null-scattering volume framework for unbiased heterogeneous participating media
  • Light BVH for many-light sampling (4.78x variance reduction on a 100-light scene)
  • Light subpath guiding in BDPT using separate OpenPGL fields for eye and light paths
  • Blue-noise error distribution via ZSobol sampling

Next up: VCM, Hyperspectral skin rendering