Anthropic released Claude 4.7 yesterday. Before reading what the internet thought, I sat down with Can Hoskan in the office this morning and ran our own test. Same screenshot, same prompt, two different models. We wanted to see what actually changed for designers using Claude to generate UI.

The short version: 4.7 looks more Apple than 4.6, and also has more UI flaws than 4.6. If you are using these tools in your workflow, that trade off matters more than any benchmark score.

The setup for the Claude 4.7 test

We used a Google Analytics screenshot as the source. Nothing fancy, a standard dashboard with cards, typography, and data. The prompt asked both models to redesign it using Apple’s Human Interface Guidelines. Same prompt. Same image. Only the model changed.

We did not ask for dark mode. We did not ask for a specific screen size. We did not tell it what platform to design for. The point was to see what each model chose on its own.

How Claude 4.6 performed on Apple-style UI

4.6 came back with a layout that was only slightly Apple. Some of the colour choices felt right, those specific shades and tints of blue you see in Apple products. The typography helped. The surfaces were quiet enough to feel considered.

But the icons were invented. The spacing was inconsistent. And overall, if you were new to design, you probably would not have guessed this was meant to reference Apple. I rated it 4 or 5 out of 10 on the Apple feel. Safe. Grounded. Not broken, but not exciting either.

How Claude 4.7 performed on the same brief

4.7 was a different story. The output was immediately more Apple on first glance. Cleaner surfaces, better typography hierarchy, more of that Apple-style polish you can feel before you can name it. It also took initiative we did not ask for. It switched to dark mode. It built for iPad resolution. At one point it referred to its own output as a “native macOS app.”

On feel, I rated it 6.5 to 7 out of 10. A real jump from 4.6.

But then we looked at the details. Contrast ratios were off. Spacing was less consistent than 4.6. The icons were still invented, and arguably worse than before. So the prettier output was also the buggier output.

What this means for designers using AI to generate UI

Polish and correctness are not the same thing. 4.7 gives you a better-looking starting point, which is valuable. But it also hides more mistakes behind its polish, which is dangerous if you are not looking closely.

Treat every output from these models as a sketch, not a final. Fix the contrast yourself. Audit the spacing yourself. Replace the invented icons with real ones. The tool is getting better at feel, but the designer still owns the detail.

The honest takeaway

Nobody knows where AI UI generation is going. We are all predicting. What we do know is that 4.7 is closer on Apple feel and further on execution than 4.6. If you are building real products with these outputs, the uglier model might actually save you more time than the prettier one. That is a strange conclusion to arrive at, but that is what the test showed us this morning.

We will keep running these tests on the channel. Tell us what you want tested next.