What Will Be Tuesday?

The first day of lockdown meetings nearly broke me. Back-to-back video conference through the afternoon; by 4 PM, and not even a full day in, I closed the laptop genuinely unsure how anyone was supposed to do that job for another week. Two or three weeks later it was just Tuesday. When offices reopened and I spent a full day walking between conference rooms and talking to humans in three dimensions, the way I had done for fifteen years before the pandemic, I came home exhausted in exactly the same way and wondered, again, how anyone was supposed to do that job. A few weeks later, again, it was just Tuesday.

What sticks with me about that loop, beyond the obvious thing about resilience, is how completely the post-adaptation state erases the pre-adaptation state. I cannot really remember what it felt like to find video meetings exhausting; I know I did, I have written evidence, but the feeling is gone. The wall between the two modes is opaque from both sides. You cannot see across it until you are over it, and once you are over it you cannot quite remember what was on the other side.

I have been thinking about that wall in the context of how we are currently building products on top of LLMs, because most of the smart product work happening right now is, I think, work on a wall.

LLMs are stochastic in a way that nothing in mainstream software has been before. Same prompt, different output; same model, different output; sometimes same seed and the floating-point gods still go their own way. We have known this since the Midjourney and ChatGPT moments, both of which somehow now feel like ancient history (a recurring property of this field is that two years ago is the Pleistocene). And yet almost every piece of software the modern knowledge worker uses has spent thirty years promising determinism so thoroughly that the promise stopped being a feature and became a substrate. Move the slider, the pixels move. Hit Cmd-S, the file saves. Call the function, the function runs. Rendering is a quiet exception, but the user only ever sees the converged, de-noised frame.

The product question of the last two years, and the one most of us are still working on, is how to bolt a stochastic engine onto that deterministic substrate without breaking the user. The good answers, the ones that are working, all turn out to involve writing a new contract with the user inside a carefully walled-off part of the product. Lightroom’s generative remove is the example I keep reaching for, less because it is close to my work and more because it is close to my weekends. It is, mechanically, the most non-deterministic thing the product has ever done, but Adobe walled it off cleanly: distinct mode, distinct button, three variations by default, retry until you are happy. The rest of Lightroom is exactly as deterministic as it was in 2008. The contract on the exposure slider is unchanged; the contract on the generative button is new and legible, and the user, encountering it, adapts in about four seconds because the boundary is visible.

I had a version of this conversation with a colleague this week. We are looking at an LLM-powered feature landing at around ninety percent accuracy on the task we care about, and his concern (which I think is real and serious, not curmudgeonly) was that ninety percent will quietly cost us developer trust. Developers are calibrated for a world where the compiler either accepts the program or it does not; a tool wrong one time in ten will, the argument goes, be distrusted after the second miss and abandoned after the third. My response, which I am still mostly sure about, is that ninety percent is fine if and only if the contract is explicit, and that the work is to build a UX where the new contract is as legible as the old one. We will find out in a few weeks.

But the more the conversation stays with me, the more I think we are both solving a transitional problem, and the more I wonder what the same problem looks like once the transition is over.

Because the engineer my colleague is worried about is a particular engineer, with a particular formation. Most of us who ended up in this profession have some version of the same origin story; mine is a triangle on a screen, seventeen years old, and the joy was equal parts the triangle itself and the reliability of it. I had typed an incantation, the machine had obeyed, and the machine would obey again the same way every time I asked. Heisenbugs are part of the folk mythology of the field precisely because they violate that contract; we named them, the way pre-modern cultures named comets, because they were the rare exception to a world that otherwise behaved. The entire emotional substrate of the profession, the thing that makes a certain kind of person stay up until three AM debugging instead of going to sleep, is built on the premise that the system, in principle, behaves.

The engineer we are designing for today is that engineer. The contract we are carefully writing for the ‘ninety percent feature’ is a contract that engineer can read. The wall we are putting around the stochastic part of the product is a wall that engineer needs.

What I do not know, and what I find myself returning to, is what happens to the craft when the next engineer does not need the wall. The eleven-year-old getting into programming this year is not having the triangle moment. They are having some other moment, and I think it is something more like the first time they asked a model to build a small game and it mostly did, and they fixed the parts that were wrong, and the loop of negotiation, not the loop of obedience, is what hooked them. That is a different formation. The dopamine source is different; the substrate of expectations is different; the relationship to the machine is different in a way that the word “different” probably understates.

You can sketch out, without much difficulty, some of the second-order effects once that engineer is the median engineer. The unit of work stops being the function and starts being the intent. Code review stops being a quality gate at the end and becomes most of the job. Reproducibility becomes a thing you opt into for specific surfaces (releases, debugging, regulated systems) rather than the default everywhere. Testing shifts from asserting outcomes to asserting distributions. The senior engineer becomes less the person who can build the system and more the person who can sense where the system is quietly bluffing, which is a kind of expertise the field does not yet really know how to teach and a sentence that will, accurately, annoy a lot of senior engineers.

The Lightroom equivalent is even further out and even more interesting to think about. The wall Adobe so carefully built between the deterministic sliders and the stochastic generative tools is a wall built for me, the photographer trained on twenty years of deterministic Lightroom. The photographer who learns the craft on a stochastic-first tool may not need the sliders at all, or may keep them around the way we kept the floppy-disk save icon for thirty years after floppies died, as a vestigial pointer to a contract that no longer underlies the thing. The interface to the photograph might just be intent, expressed in language and gesture, with the file as a downstream artifact rather than the unit of work. The skill that matters in that world is not knowledge of the tool; it is taste, the ability to know which of the candidates is the one. Taste replaces correctness as the limiting reagent of the craft. That is also a sentence that will accurately annoy a lot of craftspeople, and probably should. I think it even annoys me!

None of this is a prediction so much as a question I cannot stop turning over. The products we are shipping right now, the ones doing the careful work of writing legible contracts and walling off the stochastic parts, are unambiguously the right products for the user we currently have. They are also, almost certainly, transitional products, in roughly the way that the early web was a transitional medium for documents that wanted to be interactive experiences but were still written as if they were articles in a magazine. The transitional version is necessary; the transitional version is also not the end state, and the end state is opaque from this side of the wall in exactly the way the post-COVID office was opaque from the pre-COVID office.

The thing I keep coming back to is that we will not, I think, remember the wall once we are past it. The engineer who grew up negotiating with stochastic tools will find it as quaint that we ever expected single-shot correctness; as we find it quaint that anyone ever expected to know a phone number by heart. The photographer who grew up describing intent will find the slider as quaint as we find the darkroom timer. The transition will, from the other side, look like it took about three weeks, and the things we are working so hard to preserve will be invisibly gone, and the things we cannot yet imagine will just be Tuesday.

The interesting question for anyone building one of these products today is not how to write the right contract for today’s user, although that work is real and necessary. The interesting question is what the product looks like when the contract no longer needs to be written down, because the user no longer remembers a world in which it could have been any other way.

Postscript on the image. The render is the Sponza atrium lit by ten thousand individual candle flames, output from my hobby renderer. Sponza has been the graphics industry’s standard stress test scene for two decades; ten thousand small emitters in it is a stress test of the stress test, almost entirely of the light-sampling code, since no renderer can afford to evaluate every candle from every shading point and instead has to learn which ones matter. I had implemented light BVH and path guiding and spent the week getting all the glTF importing just right.

Thirty Hours on a Glass Egg, and I’d Do It Again

A friend who works at an early-stage startup told me recently that the number of people at his company truly using AI to transform how they build software is relatively small. I found this surprising. I had assumed the lean, ambitious places were pulling away, that the picture I was getting from larger companies or browsing my X feed was a story about organizational inertia: real engineers using AI in real ways (autocomplete, generating tests for code that already exists, summarizing a diff or a stack trace) but not yet letting it shape the substance of what they build. But this was a young startup with strong people, real product-market fit, and no dead weight.

I have been wrestling with my hobby renderer for the last two weeks, trying to get a satisfying implementation of Specular Manifold Sampling working. SMS, briefly, is a technique for rendering light paths that bounce through a chain of specular surfaces before reaching a diffuse one (think caustics through a glass egg, the kind of bright pattern your eye picks up on a dining table at lunch). These paths sit on a thin manifold in path space; a vanilla path tracer will essentially never stumble onto them (in my scene, the only light source is embedded inside the displaced surface of the egg, which makes it inaccessible to a path tracer altogether). SMS finds them by setting up a Newton-style root finder over surface positions, which in practice means you are debugging a tower of half-broken pieces: the Jacobian construction, the surface derivatives, the solver itself, the manifold walk’s step-size logic, and the unbiasedness corrections that sit on top.

It has been slower than it should be. I am somewhere past thirty hours of evenings and weekends and the renderer still occasionally produces images that look less like caustics and more like the static at the end of a VHS tape. None of this matters as there are no stakes. The renderer is a personal project, the kind of thing that exists because somewhere around 2010 I had a graduate-level grasp of light transport and I like to pretend I can get it back.

What is interesting (and what I want to talk about) is what the project has done for my fluency with AI coding tools. I have had to invent and discard several workflows for getting useful work out of the agent on this problem. Early on I would describe the whole feature and ask for an implementation; the result was confidently wrong code that compiled but produced garbage. I tried more aggressive verification next, asking the agent to derive the math before writing, which caught some errors and missed the structural ones. Eventually I had to do the decomposition myself: isolate the Jacobian, write finite-difference tests for the surface derivatives, hand the agent the pieces with explicit acceptance criteria. There wasn’t a single bug but a stack of them: typos and arithmetic errors in the derivatives, chain-rule connection issues, geometric precision problems, and some fundamental limitations of the algorithm itself. By the time I was done, I knew which questions to ask and how to ask them, and I now apply that template to other unfamiliar work.

If I were still a working veteran of physically based rendering, none of this would have been necessary. The thirty hours I have spent would have been a few focused days of typing code I already understood. The agent would have been an actively negative contribution; reading and correcting its output would have cost more than just writing it. By any reasonable metric of “did this project ship faster,” the AI tools made it worse.

That is the wrong metric. What I got from those thirty hours wasn’t an SMS implementation (I could have downloaded one). What I got was a workflow for using these tools on hard, unfamiliar code, a sense for which tasks to delegate and which to break down myself, and a calibrated intuition for when the agent is just being lazy. These are foundational skills, and like most foundational skills, the only way to build them is to spend time on a problem where the foundation is the point.

This, I suspect, is what is happening at my friend’s startup and at most large companies. The senior engineers (the ones whose buy-in matters most for a real shift) are exactly the people for whom the local economics of AI tools are worst. They have decades of muscle memory for their workflows. They are under deadline pressure. When the agent produces something unsatisfying, it is rational for them to label it AI slop and return to the way they have always worked. Each individual instance of this decision is correct. The cumulative effect is that the people with the deepest context, the ones who would benefit most from an extension of their abilities, are the last to develop the new skill.

The optimistic reading is that the junior engineers, who have no tried-and-true workflow to fall back on, are pulling ahead, and the team will rebalance over time. I am not sure this reading survives contact with what the juniors are actually building. They are becoming fluent with the tools faster, yes, but the expertise the senior engineers have (the kind that lets you smell a code path is wrong before you can articulate why, the kind I was relying on every time I had to decide whether to trust the agent on SMS) is not something the tools deliver as a side effect of being used. It comes from years of being wrong in instructive ways, and the tools are quite good at preventing exactly that. A junior engineer who never sees the failure modes the senior engineer learned from, because the agent papered over them, is an engineer with a faster cycle time and a thinner foundation. A team needs both populations excelling, and the current dynamic delivers neither.

Without a forcing function the gap widens, and the forcing function cannot be “use more AI.” Every senior engineer I know who has actually become fluent with these tools has a project they tinker with at home: a smart home control system they’re writing, a synth they’re building, a renderer they swore they’d retire in 2010 and apparently haven’t. The pattern is suspicious enough that I have started to think the side project isn’t a perk of being AI-fluent; it is a prerequisite. You need a place where the cost of inefficiency is zero, where you can afford to learn slowly because nobody is waiting on the output.

Most engineers don’t have this. They have families, exhaustion, hobbies that aren’t code: in short, lives. Telling them to develop a side project isn’t a strategy; it is just a way of ensuring that the people who already had personal projects keep their lead. If a company actually wants its senior people to internalize these tools, it has to provide what the side project provides, and that is harder than it sounds.

The first instinct will be to schedule the time. A recurring afternoon, a quarterly week, a percentage carved off the calendar. We have run this experiment before and we know what happens. The time gets absorbed back into the day job when a release slips. The projects that survive review-cycle scrutiny are the ones quietly producing impact, which means the deadline pressure has been displaced rather than removed. The whole point of the side project, the reason mine has worked as a learning vehicle, is that nobody is keeping score. The output is allowed to be embarrassing. If a company schedules an “AI fluency block” that culminates in a demo to leadership, it has built a small hackathon, not the conditions for learning.

What is actually needed, and what nobody wants to defend in a quarterly business review, is something closer to unproductive play. The output shouldn’t matter, including to the engineer. The work should be allowed to fail in ways that don’t interest anyone else. The form it takes has to be the engineer’s to choose, because the choosing is part of the mechanism. The moment any downstream metric attaches itself (a promo case, a visibility win, a tool that gets adopted, a brag in a staff meeting) the local economics that drove the engineer away from these tools in the first place reassert themselves. Play that ladders to outcomes is not play; it is unconfessed work, which is what an engineering organization already excels at producing.

This runs against everything we know about how to run an engineering organization. We have spent a decade getting good at concentrating effort on outcomes, and “give your most expensive engineers time to do things that won’t matter” is a sentence I would have laughed at in 2018. But the alternative is watching your most experienced engineers, the ones whose judgment is doing the most load-bearing work in your company, fall further behind a tool that is now writing a non-trivial fraction of all new code. I would rather lose four hours a week to genuine waste than reframe them as productive in disguise; the disguise is exactly what stops the mechanism from working.

The Veach egg, incidentally, is now kind of rendering correctly (within the limits of what is possible for a heavily displaced surface with my current Newton solver). I am told by my agent that the implementation is “production-ready,” which I have learned means roughly the opposite.

AI won’t fix your culture issues

For the last few days I’ve been building UI for a renderer I’m working on. It’s a project no one else will care about (just me, my code, and my own pixels). And still, getting AI agents to produce a UI that actually feels right has been a real chore. Endless iteration on cursor transition states, on button states, on panel animations, on the hundred small things that separate “functional” from “good.” It’s the kind of work that’s been on my mind, and it’s a big part of why I’ve been thinking about a pattern I’ve seen repeat over the years, across projects, teams and companies.

It tends to go something like this. Leaders observe, with growing frustration, that the same papercuts are still there, the test suite is still flaky, the retention curves haven’t budged. The consoling thought used to be “if only we had more headcount.” Now it’s: “Once everyone’s fully on Claude or Cursor or Copilot, we’ll have the time to clean all of this up.”

The premise is that the gap between the product you have and the product you wish you had is a capacity problem. Your team would have written better tests if they’d had more hours. They’d have polished the empty states if the feature backlog hadn’t been so long. They’d have shipped a tighter v1 if the date hadn’t been breathing down their neck. AI hands back hours, the logic goes, so AI fills the gap.

Take that argument seriously and look at what it actually implies. It implies that the people on your team know what good looks like, want to do it, and have been blocked only by the clock. On some teams that’s true. On most teams where the polish is missing or the tests are bad, it isn’t the clock. It’s the culture.

Polish doesn’t come from time. Polish comes from a person on the team who cannot let a panel-slide animation ship until the easing curve feels right (who’ll spend an hour tweaking cubic-bezier parameters because the default ease-out is technically fine but the motion doesn’t quite land under your thumb). When we built the first version of Google Photos, the PM and designer leading it were obsessive about exactly this kind of detail: animation curves, timings, whether photos were missing when you rapidly scrolled through a thousand of them. It’s a big part of why that first version felt so fast. Not because it was actually faster than its competitors on every dimension, but because every motion it made had been tuned until it felt responsive. Over time that kind of obsession hardens into systems (design reviews, performance budgets, the QA bars nobody argues with), but the systems exist only because someone cared first. That kind of obsession isn’t unlocked by capacity. It’s a trait the team has or doesn’t have, and it propagates because leadership rewards it, or withers because leadership rewards shipping the next thing instead.

I notice a related pattern in my own building right now. I’m targeting both Mac and Windows for this renderer’s UI, and the contrast is striking. On Mac, agents produce something that feels close to right surprisingly fast. The platform’s conventions are narrow, opinionated, and have been refined for decades, and the model can pattern-match into them; the HIG acts as an implicit loss function on the model’s output, having done the work of caring about polish on every example it has ever seen. On Windows, where the surface is wider and the conventions more permissive, I have to be the one sweating every detail. The agent produces something functional but generic, and I iterate on it for hours to get it to feel right.

The lesson generalizes uncomfortably. AI agents do their best work on top of a foundation that has already encoded taste, one that has already made the hard decisions about what good looks like. When the foundation is opinionated about quality, the agent inherits that opinion. When the foundation is permissive, the agent’s output is permissive, and the only thing standing between you and mediocre is your own iteration. Your team’s culture is that foundation. If the culture has already decided that flaky tests are unacceptable and that animations have to feel right, AI will produce work inside those constraints. If it hasn’t, AI will produce work that doesn’t, and you’ll be back to iterating it to good yourself, at the same rate you always did.

The reverse is also worth saying clearly. AI can help build the foundation, not just operate inside one: a leader who encodes quality into tooling (performance budgets, visual diffing in CI, a non-negotiable test bar, a design system the agent must conform to) lowers the cost of meeting the standard until meeting it becomes the default. But this still requires someone first deciding what good looks like and making it concrete; AI can scale a culture of quality, it cannot supply the will to define one.

A weak test suite tells the same story. Teams with a culture of testing don’t write tests because they have spare cycles. They write tests because the alternative (shipping code without them) feels unacceptable, the way scrubbing in without gloves would feel unacceptable to a surgeon. Sometimes a particular schedule really is impossible, and a test gets skipped. But over months and quarters, the repeated absence of tests is rarely a scheduling accident; it’s a revealed priority. “We didn’t have time” is the explanation. “We didn’t value it enough to displace something else” is the cause. Hand that team a productivity boost from AI and most of it will go where the priorities already point.

This is the part leaders find hardest to look at, because the culture is mostly a reflection of what they reward. If demos that look great but buckle under load get celebrated, you’re going to get more demos that look great and buckle under load. If the engineer who quietly fixes the flaky test gets less recognition than the engineer who launches a flashy new feature, your test suite is going to keep rotting. AI doesn’t change any of that. It just lets the same incentive structure produce more output, faster.

There’s a related fallacy worth naming. Leaders often convince themselves their product lacks PMF because they haven’t been able to iterate fast enough; AI will let them iterate faster, so PMF must be just around the corner. But teams that found PMF didn’t get there primarily through speed of iteration. They got there through quality of observation: watching real users, sitting with the data, being honest about what wasn’t working. A team that’s bad at honest observation will simply iterate faster toward the wrong things. The bottleneck on PMF was never typing speed.

The thread connecting all three is the same. Polish is attention made visible. In UI it shows up as motion and easing curves and the way a panel slides under your thumb; in product strategy it shows up as noticing the user behavior everyone else rounds away; in engineering it shows up as refusing to normalize a flaky test. Different surfaces, same muscle. And the muscle either gets exercised or atrophies based on what the team’s environment rewards.

The uncomfortable implication is that AI widens variance. Teams with a deep culture of quality, of polish, of genuine care about the user pour AI capacity into more depth, more rigor, more of the things they were already doing well. Teams that shipped sloppy work before will ship more of it, faster. The gap between the two grows, because AI is a multiplier and the thing it multiplies is whatever the team already values. But variance widening isn’t destiny. The same logic that makes this dangerous for inattentive leaders makes the work available to any leader willing to make the team’s standards explicit, and to recognize the work of meeting them.

So if you’re a leader staring at your product and seeing problems you’d quietly hoped AI would absorb (the brittle tests, the rough edges, the features that don’t quite land), the work in front of you isn’t really about adopting tools. It’s about looking hard at what you celebrate in reviews, what you promote people for, what you tolerate at ship time, and which details you yourself notice and care about in the product. The team takes its cues from that, not from the model.

AI is the most powerful capacity multiplier we’ve had in a generation. It can give you a thousand more hours; it cannot give you the heart to spend one of them on a button state.

Focus Is a Scaling Law, Whether You’re Scaling People or Agents

There’s a formula from 1967 that explains why your 200-person team ships slower than when it was 30. It wasn’t written about organizations; it was written about CPUs. But the principle still applies.

The Law

Amdahl’s Law describes the theoretical speedup of a task when you add more processors. The insight is disarmingly simple: if any fraction of the work is inherently serial (i.e. can’t be parallelized), then adding more processors hits diminishing returns fast. If 10% of your workload is serial, you will never get more than a 10x speedup no matter how many cores you throw at it; not 100x, not 50x, just ten.

The formula is clean:

Speedup = 1 / (S + (1 – S) / N)

Where S is the serial fraction and N is the number of processors. As N approaches infinity, speedup converges to 1/S. The serial fraction is the ceiling.

Now, the computing world didn’t just accept this ceiling. Gustafson’s Law (1988) offered the optimistic counterpoint: as you add processors, you can also scale the problem size, and when you do, the serial fraction shrinks as a proportion of total work. The entire GPU revolution is a testament to this; people restructured their problems to be massively data-parallel, effectively defeating Amdahl’s pessimism through reformulation.

This matters for the organizational analogy, because the same option theoretically exists for teams. You can surrender to your serial fraction, or you can reformulate how you work to shrink it. Most organizations believe they’re doing exactly this when they expand their product areas and grow the scope of the problems they tackle. But scaling the problem size in an organization isn’t as clean as scaling a matrix multiplication across more GPU cores. Expanding scope often introduces new coordination demands, new stakeholders, and new dependencies that increase S even as the total workload grows. The Gustafson escape hatch is real, but it requires deliberate restructuring of the work itself, not just doing more of it.

In computing, the serial portion is usually some shared resource: a memory bus, a lock, a dependency chain. In organizations, the serial portion is decision-making, alignment, and communication. And unlike transistors, people get tired.

Why 1.0 Teams Feel Fast

Most people who’ve built products have felt this: the 1.0 team moves at a pace that feels almost unreasonable. Twelve engineers ship what later takes sixty engineers twice as long to iterate on.

Not every 1.0 team gets this right; plenty flounder. But the ones that ship well tend to share a structural property: low serial fraction. The product doesn’t exist yet, so there are no live users to protect, no incumbent features to preserve, no competing roadmaps to reconcile. The requirements are comparatively clear (build the thing that does X), decisions are fast because the option space is constrained, the communication graph is small, and everyone shares the same mental model.

In Amdahl’s terms, S is low and most of the work is parallelizable. Each person or small group can take a chunk of the problem and run, and synchronization costs are minimal because the goal is singular and legible.

This is the part teams remember fondly and mistakenly attribute to culture or talent density. Those matter, but the dominant variable is the serial fraction; during a well-scoped 1.0, it’s naturally compressed.

The Post-1.0 Drag

Then the product launches and users arrive. Success creates options, and options destroy focus.

The roadmap fragments into a surface area: growth, retention, monetization, platform concerns, partner requests, technical debt, regulatory compliance. Each of these is legitimate, each pulls in a different direction, and each requires alignment across people and teams that didn’t need to coordinate before.

The serial fraction explodes. Not because the people got worse, but because the work changed character. Success forces a phase transition from discovery to delivery; from “figure out what to build” (high risk, low S) to “protect what we’ve built” (low risk, high S). Much of what fills the post-1.0 calendar is inherently serial: regulatory compliance, security reviews, privacy assessments, backward compatibility guarantees. You can’t parallelize a legal review or distribute a compliance decision across twelve engineers. The defensive surface area of a successful product is, almost by definition, non-parallelizable work.

A larger share of every engineer’s week is now spent in this serialized portion: syncs, design reviews, cross-team alignment, roadmap negotiation, stakeholder and dependency management. The meetings aren’t dysfunction; they’re a direct consequence of the problem becoming less parallelizable.

This is where things get worse than Amdahl’s Law alone predicts. In the original formula, S is fixed. But in organizations, S grows with team size; communication overhead scales roughly with the square of headcount, as every new person adds edges to the coordination graph. This is closer to Brooks’s Law (from The Mythical Man-Month) than Amdahl’s, and it makes the scaling picture uglier: you’re not just hitting a fixed ceiling, you’re watching the ceiling drop as you add people. And the human “interconnect” can’t be upgraded. It’s bounded by the latency of speech, the bandwidth of meetings, and something like Dunbar’s Number (the cognitive limit on how many working relationships a person can actually maintain). CPUs got faster buses; we got ‘chat’, which arguably made things worse.

Suppose a team starts at 10 engineers with S at 5%, giving an effective speedup of 6.9x. The team doubles to 20, but as it grows, S climbs to 25% from all the added coordination. The new effective speedup is 3.5x. You doubled the team and got absolutely slower; not just slower per capita, but less total output than before. And if S grows quadratically with N, there’s an optimal team size where total output peaks. Add one more person beyond that point and you’ve entered negative scaling territory, where each hire makes the team slower in absolute terms, not just per capita. Most organizations never do this math explicitly, which is how you end up with teams of a hundred producing less than they did at sixty.

None of this is new, of course. Coordination costs are well-studied in organizational theory; Coase and Williamson were writing about transaction costs and governance overhead decades before anyone applied Amdahl’s Law to an org chart. But the physics metaphor clarifies something that management frameworks sometimes obscure: there is a hard quantitative limit to what adding people can do, and it’s set by the serial fraction.

Focus Is the Governor

The pattern, once you see it, shows up everywhere. Teams that stay fast post-1.0 tend to have something in common: an almost stubborn narrowness about what they’re doing right now (not forever; just right now). They sequence rather than parallelize when the serial costs of parallelization exceed the gains, and they make explicit bets about what they are not doing, which is harder politically than it sounds.

“Focus” is doing a lot of work in that sentence, so let me be more specific. There are at least two distinct things that drive the serial fraction in organizations. The first is strategic clarity: does leadership know what to prioritize, and have they made that legible to the team? The second is structural coupling: does the org design create unnecessary dependencies between groups that could otherwise work independently? Put differently: strategic clarity is knowing what to do; structural decoupling is the ability to do it without asking permission. You can have clear strategy but terrible org structure, or clean structure with muddled priorities. Both inflate S, but they need different fixes.

What focus means in practice is reducing both: making fewer bets (strategic clarity) and designing teams so those bets can execute without constant cross-team synchronization (structural decoupling). This is the organizational equivalent of what GPU architects did; reformulating the problem to be more parallel, rather than throwing more cores at an inherently serial workload.

The hardest part of this isn’t knowing you should do it; it’s having the organizational standing to say no. Every priority you add, every initiative you run in parallel, every “also can we” in a planning meeting increases the serial fraction. They add synchronization points and force alignment conversations that wouldn’t otherwise need to happen. And many of them are imposed from above: VP-sponsored initiatives, customer commitments, competitive responses, regulatory deadlines. S is often not a choice the team lead gets to make; it’s a constraint they inherit.

The most insidious version of this problem is when the loss of focus is disguised as ambition. The roadmap looks impressive with fourteen workstreams and everyone is busy, but the serial fraction has quietly climbed to 40%, and no amount of additional headcount will get you past 2.5x speedup. You have a team of a hundred performing like a team of thirty, but with the coordination costs of a team of a hundred.

There’s a tradeoff worth naming here: you can reduce S by giving teams full autonomy (no alignment needed, everyone runs independently). But you risk building an incoherent product. The interesting question isn’t always “minimize S” but “what’s the right S for your situation?” At the extremes, S=0 is entropy (every team diverges, nothing integrates, the product dissolves into chaos) and S=1 is stasis (every decision requires full organizational consensus, nothing ships, bureaucracy calcifies). The job of leadership is to find the critical S that allows for coherence without strangulation, and to make that tradeoff deliberately rather than letting S inflate by accident.

Enter the Agents

This is where the story gets most interesting, because we’re about to replay all of this; faster, and with higher stakes.

The promise of agent-powered development is essentially an Amdahl’s Law play: add massively more parallel capacity by giving every engineer a fleet of tireless, fast, cheap workers. The analogy to adding cores is almost literal, and for certain classes of work (the kind where the task is well-specified, the interfaces are clean, and the dependencies are minimal), agents deliver real parallel speedup. They don’t get tired, don’t need to context-switch, and can work twenty tasks simultaneously.

Agents also solve one piece of the scaling problem that humans structurally cannot: they don’t have communication bandwidth constraints with each other. Two agents don’t need a meeting to sync; they can share state through code, specs, and structured interfaces at machine speed. The communication graph doesn’t grow quadratically the way it does with people. This is a genuine structural advantage that removes one of the key mechanisms that makes S grow with team size in human organizations.

But “machine speed” has its own limits. Agents still operate within context windows and need shared taxonomies (consistent naming, clean interfaces, coherent architecture). If the underlying codebase is a tangle of implicit dependencies and undocumented conventions, agents hit a coherency wall: they can read code fast but they can’t infer intent from spaghetti. The communication overhead shrinks but it doesn’t vanish; it shifts from meetings to architecture.

But here’s what agents don’t solve: focus.

An agent fleet still needs to know what to build. It needs requirements, priorities, architectural direction, and product judgment, all of which come from humans. And if the humans haven’t resolved what the product should do (if there are three competing visions, or the roadmap is a sprawl of fourteen workstreams), then the agents inherit all of that incoherence. They’ll execute fast, but they’ll execute in conflicting directions. You’ll get twelve implementations of six features where you needed three implementations of two features.

There’s a subtler bottleneck too: verification bandwidth. Every piece of agent output needs a human to review, integrate, and validate it. That evaluation is serial, and it scales linearly with agent output volume. The faster and more prolific your agents become, the more human attention each cycle demands. If one engineer is orchestrating ten agents across three workstreams, the bottleneck isn’t the agents’ throughput; it’s the engineer’s ability to evaluate whether what came back is correct, coherent, and actually what the product needs.

The serial fraction for agent-powered teams isn’t “writing code.” Agents parallelized that away. The serial fraction is deciding what to build, verifying what was built, and owning the end-to-end problem; a chain that runs through a human at every link.

As N grows massive via agents, the ratio of deciders to doers shifts dramatically, and the serial fraction converges onto a single point of failure: the product owner’s cognitive load. Call it the sequential bottleneck of judgment. Even in a massively parallel agent fleet, truth is a serial resource; someone has to decide what “correct” means, what tradeoffs are acceptable, and whether the output actually serves the user. One person’s ability to hold the problem in their head, make those calls, and verify outputs becomes the governing constraint on the entire system. You’ve replaced a team bottleneck with an individual bottleneck, which is faster right up until that individual saturates.

In fact, agents may make the focus problem more acute. When execution is cheap, the temptation to pursue everything simultaneously gets stronger. “Why not try all fourteen workstreams? The agents can handle it.” But each workstream still needs a human to own the problem end-to-end: to define what done looks like, to make judgment calls when the spec is ambiguous, to reconcile conflicts when two workstreams step on each other, and to verify that the output is actually right. If you don’t have enough humans with enough ownership depth, you get a lot of code and very little product. Call it the hallucination of progress: the codebase is growing, PRs are landing, dashboards are green, but the serial cost of reconciling all those workstreams is growing exponentially, and no one has the bandwidth to notice that the pieces don’t fit together.

The Amdahl’s Law framing predicts this precisely. You’ve increased N enormously by adding agents, but if S hasn’t changed (if the serial fraction is still decision-making, verification, and product judgment), then your speedup is still capped at 1/S. You’ve just made the ceiling more visible.

The Same Lesson, Faster

The organizations that will use agents well are the same ones that already manage human parallelism well: the ones that invest in reducing S. That means clear ownership, sequenced bets, ruthless prioritization, and leaders willing to say no to good ideas because they’d increase the serial fraction.

The difference is that the penalty for getting this wrong will be faster and more expensive. With human teams, a bloated roadmap means people spend too much time in meetings and ship slowly. With agent-powered teams, a bloated roadmap means you burn through compute and get a codebase full of half-coherent features that no one fully owns. The failure mode is faster but structurally identical.

Amdahl published his law as a cautionary note to computer architects who thought they could just keep adding processors. The lesson was: before you add more parallelism, look at your serial fraction. Reduce that first. Everything else follows.

The same applies to your org chart, and soon, to your agent fleet.

Reduce S. Not because the rest is easy (reformulating the problem never is), but because nothing else you do matters until you do.

The Sycophantic, Lazy LLM: An RL Artifact?

If you’ve spent real time pairing with a frontier LLM on anything non-trivial, whether a renderer or a tricky refactor, you’ve probably noticed two failure modes that keep showing up together:

It wants to agree with you. Push back on its output, even weakly, and it folds. “You’re right, let me fix that,” and then it “fixes” something that was correct.

It wants to be done. Error paths silently return; with no exception, no log, no indication anything went wrong. Edge cases get hand-waved with a comment. Tests get weakened until they pass. A 2000-line diff arrives with three of the six features you asked for, and a cheerful summary claiming completion.

Individually, either is annoying. Together they’re corrosive, because the sycophancy hides the laziness. The model tells you it’s finished; you believe it; you find out a day later that the shortcut it took broke an invariant essential to actual operation.

A recent example from RISE, my renderer: I was getting VCM (Vertex Connection and Merging) stood up. The first few passes from the model looked plausible. Compiled, ran, produced images. But the images had splotches, the telltale artifact of bad vertex merging: incorrect density estimation blowing up localized radiance into bright blobs. Nothing in the model’s self-report flagged this. It was only after several iterations, and deliberately setting up validation that looked for splotches specifically (alongside variance and convergence checks against a bidirectional path tracing reference), that I got to an implementation I’d trust. The model wasn’t going to tell me the merge kernel was wrong. The images had to.

Why this shape?

My working theory is that this is an RL artifact, not a capability ceiling. The reward signal during post-training is, at best, a noisy proxy for “did the user get what they actually needed.” What it can much more easily capture is:

Did the user express satisfaction in the next turn?
Did the response terminate cleanly without long back-and-forth?
Did the answer look confident and well-structured?

Optimize against those and you get exactly the behavior we see. Agreeing with the user is a near-guaranteed path to a positive next-turn signal. Declaring victory early and producing a tidy summary looks indistinguishable, to a human rater skimming transcripts, from actually finishing. The model isn’t being deceptive; it’s learned that appearing done and being done are rewarded identically, and appearing done is cheaper. This is reward hacking in the standard sense: the policy found a cheap region of input space that scores well on the proxy.

The same dynamic explains why the failures cluster at exactly the places verification is hardest: numerical correctness, performance regressions, subtle concurrency, anything where “it compiles and the happy path runs” is wildly insufficient as an oracle. And it gets worse as models improve. Generation cost is falling fast. Verification cost is not (especially if you are building a renderer). Every capability jump widens the gap between how quickly the model can produce something plausible and how long it takes you to confirm it’s actually right. If you don’t build for that asymmetry, you lose ground on every iteration.

What actually works

The fix, in my experience, is to stop relying on the model’s self-report and instead make correctness externally defined and externally checked. A few things that have moved the needle for me on the renderer:

Adversarial review agents. Not one reviewer, several, with explicit instructions to find flaws, disagree, and escalate. A single reviewer inherits the same sycophancy gradient. Multiple reviewers with conflicting mandates (one checks correctness, one checks performance, one looks for silent scope reduction) break the collusion.

Correctness defined in two registers. Qualitative: image diffs, perceptual checks, “does this frame look right.” Quantitative: variance statistics on the Monte Carlo estimator, convergence rate vs. reference, wall-time budgets with regression thresholds. Either one alone is gameable. Together they’re much harder to satisfy by shortcut.

No self-graded completion. The model never decides when it’s done. A separate evaluation harness does. If the harness doesn’t pass, the task isn’t finished, regardless of how confident the summary sounds.

Cross-model review. One of the more effective tricks I’ve landed on: have a different model review the first model’s work. Same model reviewing itself inherits the same priors, the same shortcuts, the same blind spots, and the same training distribution that taught it which corners are safe to cut. A different lab’s model has been shaped by a different reward signal and trips on different things. The disagreements are where the interesting bugs live. It’s not that the second model is smarter; it’s that it’s wrong in uncorrelated ways.

The pattern underneath all four: assume the model will take the easiest path that looks like success, and make sure the easiest path is success.

Has this been your experience?

I’d be curious whether others are seeing the same thing, and what you’ve built to counter it. Specifically: how do you define correctness for tasks where the obvious checks are cheap to fool? What does your adversarial review setup look like? And has anyone found a prompting pattern, rather than a harness, that reliably suppresses the “declare victory and exit” behavior?

Fifteen Years of Rendering, Catching up in Weeks

I stopped doing serious rendering work around 2010. Path tracing, BSDFs, Monte Carlo integration were all second nature. From the start of my undergrad in 1997 right up to building the Adobe Ray Tracer in Photoshop in 2010 I had been steeped in this world. Then life moved on: building consumer products at scale, building teams, building platforms. My renderer sat dormant.

Recently I picked it back up both to have something concrete to work on with agents but also to scratch that graphics itch that never went away. Normally to catch up I’d plow through the fifteen years of SIGGRAPH papers that been stacked up, but I did something different.

I started implementing instead.

Read less, build more

There’s a difference between understanding a technique and understanding why it works the way it does. Papers give you the former. Code gives you the latter. I am also a ‘doing’ learner, so for me working on something is how I learn. I’d read many papers on MIS but it wasn’t until actually going and implementing it, working through all the bugs and watching the variance drop that it really locks on.

The problem used to be velocity. Getting from “I understand this algorithm” to “a working implementation” took days to weeks. Boilerplate, scaffolding, debugging the trivial stuff. The interesting parts were buried under setup cost.

Coding agents collapsed that ratio.

The agent workflow, honestly

My workflow isn’t careful line-by-line review. That’s not the point; the speed is the feature. When you can go from a paper to a running implementation of GGX microfacet with VNDF sampling in a fraction of the time it used to take, you get to spend your cognitive budget on the parts that actually require thinking.

What I’ve found is that agents aren’t uniformly fast. Some things, like well-specified algorithms with solid reference implementations(Dupuy-Benyoub spherical cap VNDF sampler for example) they handle cleanly. Others require real steering. Getting light subpath guiding right in BDPT came down to a subtle decision about separate vs. shared guiding fields that no prompt was going to resolve on its own. When separate eye and light fields produce destructive interference at the same surface position, you need to understand why, not just what to type.

That pattern (full speed on clear specs, hard stops where physical insight is required) has been one of the more interesting meta-lessons. More on where the boundary actually falls in a later post as I let that stew more.

The biggest surprise: the field went physical

When I left, biased techniques were the pragmatic answer to hard light transport problems. Dipole approximation for subsurface scattering. Photon mapping as a caustics crutch. Spectral rendering was a research luxury, RGB was good enough.

Coming back, I expected things to have fully moved to the GPU but that trade-off to still be alive.

It isn’t. The field has largely moved to unbiased physical simulation across the board. Random walk SSS has replaced diffusion approximations as the standard. Hero wavelength spectral sampling means that spectral rendering is the default. Null-scattering volume formulations handles participating media properly while being physically based. The question isn’t “can we afford to be physically correct?” anymore.

This landed differently for me than it might for others. My original skin rendering work was dual-purpose: graphics and biomedical light transport simulation. The biomedical side required physical random walks and spectral interaction simulation; you can’t use a dipole approximation when you need to know where photons actually go in tissue. At the time, that work lived in a completely separate world from production rendering. The techniques were too expensive, too specialized.

Now seeing random-walk SSS become the graphics standard felt like watching a conversation finally arrive somewhere you’d been standing for a while.

What’s been implemented so far

In a few weeks, working alongside agents, RISE (the renderer I am modernizing) has gone from a reasonable 2010-era foundation to something a lot closer to where the field is now with things like:

GGX microfacet with anisotropic VNDF sampling (Dupuy-Benyoub 2023) and Kulla-Conty multiscattering energy compensation
Random-walk subsurface scattering replacing dipole/diffusion approximations
Hero wavelength spectral sampling to get spectral rendering with lower color noise
Null-scattering volume framework for unbiased heterogeneous participating media
Light BVH for many-light sampling (4.78x variance reduction on a 100-light scene)
Light subpath guiding in BDPT using separate OpenPGL fields for eye and light paths
Blue-noise error distribution via ZSobol sampling

Next up: VCM, Hyperspectral skin rendering