For the last few days I’ve been building UI for a renderer I’m working on. It’s a project no one else will care about (just me, my code, and my own pixels). And still, getting AI agents to produce a UI that actually feels right has been a real chore. Endless iteration on cursor transition states, on button states, on panel animations, on the hundred small things that separate “functional” from “good.” It’s the kind of work that’s been on my mind, and it’s a big part of why I’ve been thinking about a pattern I’ve seen repeat over the years, across projects, teams and companies.
It tends to go something like this. Leaders observe, with growing frustration, that the same papercuts are still there, the test suite is still flaky, the retention curves haven’t budged. The consoling thought used to be “if only we had more headcount.” Now it’s: “Once everyone’s fully on Claude or Cursor or Copilot, we’ll have the time to clean all of this up.”
The premise is that the gap between the product you have and the product you wish you had is a capacity problem. Your team would have written better tests if they’d had more hours. They’d have polished the empty states if the feature backlog hadn’t been so long. They’d have shipped a tighter v1 if the date hadn’t been breathing down their neck. AI hands back hours, the logic goes, so AI fills the gap.
Take that argument seriously and look at what it actually implies. It implies that the people on your team know what good looks like, want to do it, and have been blocked only by the clock. On some teams that’s true. On most teams where the polish is missing or the tests are bad, it isn’t the clock. It’s the culture.
Polish doesn’t come from time. Polish comes from a person on the team who cannot let a panel-slide animation ship until the easing curve feels right (who’ll spend an hour tweaking cubic-bezier parameters because the default ease-out is technically fine but the motion doesn’t quite land under your thumb). When we built the first version of Google Photos, the PM and designer leading it were obsessive about exactly this kind of detail: animation curves, timings, whether photos were missing when you rapidly scrolled through a thousand of them. It’s a big part of why that first version felt so fast. Not because it was actually faster than its competitors on every dimension, but because every motion it made had been tuned until it felt responsive. Over time that kind of obsession hardens into systems (design reviews, performance budgets, the QA bars nobody argues with), but the systems exist only because someone cared first. That kind of obsession isn’t unlocked by capacity. It’s a trait the team has or doesn’t have, and it propagates because leadership rewards it, or withers because leadership rewards shipping the next thing instead.
I notice a related pattern in my own building right now. I’m targeting both Mac and Windows for this renderer’s UI, and the contrast is striking. On Mac, agents produce something that feels close to right surprisingly fast. The platform’s conventions are narrow, opinionated, and have been refined for decades, and the model can pattern-match into them; the HIG acts as an implicit loss function on the model’s output, having done the work of caring about polish on every example it has ever seen. On Windows, where the surface is wider and the conventions more permissive, I have to be the one sweating every detail. The agent produces something functional but generic, and I iterate on it for hours to get it to feel right.
The lesson generalizes uncomfortably. AI agents do their best work on top of a foundation that has already encoded taste, one that has already made the hard decisions about what good looks like. When the foundation is opinionated about quality, the agent inherits that opinion. When the foundation is permissive, the agent’s output is permissive, and the only thing standing between you and mediocre is your own iteration. Your team’s culture is that foundation. If the culture has already decided that flaky tests are unacceptable and that animations have to feel right, AI will produce work inside those constraints. If it hasn’t, AI will produce work that doesn’t, and you’ll be back to iterating it to good yourself, at the same rate you always did.
The reverse is also worth saying clearly. AI can help build the foundation, not just operate inside one: a leader who encodes quality into tooling (performance budgets, visual diffing in CI, a non-negotiable test bar, a design system the agent must conform to) lowers the cost of meeting the standard until meeting it becomes the default. But this still requires someone first deciding what good looks like and making it concrete; AI can scale a culture of quality, it cannot supply the will to define one.
A weak test suite tells the same story. Teams with a culture of testing don’t write tests because they have spare cycles. They write tests because the alternative (shipping code without them) feels unacceptable, the way scrubbing in without gloves would feel unacceptable to a surgeon. Sometimes a particular schedule really is impossible, and a test gets skipped. But over months and quarters, the repeated absence of tests is rarely a scheduling accident; it’s a revealed priority. “We didn’t have time” is the explanation. “We didn’t value it enough to displace something else” is the cause. Hand that team a productivity boost from AI and most of it will go where the priorities already point.
This is the part leaders find hardest to look at, because the culture is mostly a reflection of what they reward. If demos that look great but buckle under load get celebrated, you’re going to get more demos that look great and buckle under load. If the engineer who quietly fixes the flaky test gets less recognition than the engineer who launches a flashy new feature, your test suite is going to keep rotting. AI doesn’t change any of that. It just lets the same incentive structure produce more output, faster.
There’s a related fallacy worth naming. Leaders often convince themselves their product lacks PMF because they haven’t been able to iterate fast enough; AI will let them iterate faster, so PMF must be just around the corner. But teams that found PMF didn’t get there primarily through speed of iteration. They got there through quality of observation: watching real users, sitting with the data, being honest about what wasn’t working. A team that’s bad at honest observation will simply iterate faster toward the wrong things. The bottleneck on PMF was never typing speed.
The thread connecting all three is the same. Polish is attention made visible. In UI it shows up as motion and easing curves and the way a panel slides under your thumb; in product strategy it shows up as noticing the user behavior everyone else rounds away; in engineering it shows up as refusing to normalize a flaky test. Different surfaces, same muscle. And the muscle either gets exercised or atrophies based on what the team’s environment rewards.
The uncomfortable implication is that AI widens variance. Teams with a deep culture of quality, of polish, of genuine care about the user pour AI capacity into more depth, more rigor, more of the things they were already doing well. Teams that shipped sloppy work before will ship more of it, faster. The gap between the two grows, because AI is a multiplier and the thing it multiplies is whatever the team already values. But variance widening isn’t destiny. The same logic that makes this dangerous for inattentive leaders makes the work available to any leader willing to make the team’s standards explicit, and to recognize the work of meeting them.
So if you’re a leader staring at your product and seeing problems you’d quietly hoped AI would absorb (the brittle tests, the rough edges, the features that don’t quite land), the work in front of you isn’t really about adopting tools. It’s about looking hard at what you celebrate in reviews, what you promote people for, what you tolerate at ship time, and which details you yourself notice and care about in the product. The team takes its cues from that, not from the model.
AI is the most powerful capacity multiplier we’ve had in a generation. It can give you a thousand more hours; it cannot give you the heart to spend one of them on a button state.
