Launch HN: Design Arena (YC S25) – Head-to-head AI benchmark for aesthetics

henriquegodoy · 2025-08-12T17:08:59 1755018539

This is actually really needed, current ai design tools are so predictable and formulaic, like every output feels like the same purple gradients with rounded corners and that one specific sans serif font that every model seems obsessed with, it's gotten to the point where you can spot ai-generated designs from a mile away because they all have this weird sterile aesthetic that screams "made by a model"

grace77 · 2025-08-12T17:17:59 1755019079

Exactly - we think the ai design tools are in the equivalent of the 'uncanny valley' territory that a lot of the diffusion models were stuck in just 1-2 months ago; most average diffusion models are still in this local optimum, but the best of the best seem to have escaped it.

BoorishBears · 2025-08-12T19:13:28 1755026008

I don't think this works right now tbh.

It has the same problem as LMArena (which already had webarena): better aesthetics are so far out of distribution you can't even train on the feedback you get here.

You just get a new form of turbo-slop as some hidden preference takes over. With text output that ended up being extensive markdown and emojis. Here that might be people accidentally associating frosted surfaces with relatively better aesthetics, for example.

The problem is so bad LMArena maintains a seperate ranking where they strip away styling entirely

Michelangelo11 · 2025-08-12T17:00:34 1755018034

This is interesting but, speaking frankly, I see many seemingly insurmountable issues. Here are some:

- Contests will often be won not by the entry that best adhered to the prompt, but the best-looking one. This happened in the contest "Input Prompt Build a brutalist website to a typeface maker," which I got as a recent example. The winning entry had megawatt-bright magenta and yellow, which shouldn't appear anywhere near brutalism, and in other design aspects had almost no connection to brutalism either -- but it was the most attractive of the bunch.

- The approach only gets you to a local maximum. Current LLMs aren't very good designers, as you say, so contests will involve picking between mostly middling entries. You'd want a design that's, say, a 9 or a 10 on a 10-point scale -- but some 95% of the entry distribution will probably be between 5.5 and 7.5 or so, and that's what users will get to pick from.

j_da · 2025-08-12T17:07:57 1755018477

All great points. A limitation with human feedback is that once you start asking for more than binary preferences (e.g. multiple rankings or written feedback), the quality of the feedback does decrease. For instance, many times humans can give a quick answer on preference, but when asked "why" they prefer one thing over the other, they might not be able to full explain it in language. This in general is very much an open area of research on collecting and incorporating the most optimal types of feedback.

I definitely agree with your second point. One idea we're experimenting with is adding a human baseline, in which the models are benchmarked against human generated designs as well.

grace77 · 2025-08-12T17:14:50 1755018890

yes! to the second point, someone in our show HN proposed encouraging human designers to compete in submissions as well - we tried implementing this and found that, at least right now, LLMs are still so bad at design that asking a human to beat them is trivial - our plan right now is to focus more on this once it becomes more of challenges and therefore hopefully more interesting/entertaining

willbeddow · 2025-08-13T01:58:48 1755050328

Hmm idk about the focus on aesthetics. GPT image is your top image model, a model which is famously poor on aesthetics (though excellent on prompt adherence). I admit it's a difficult thing to eval, though, as in most side by side comparisons users will always pick the image with better prompt adherence regardless of instructions.

koakuma-chan · 2025-08-13T04:48:37 1755060517

What incentive is there for real users to continuously evaluate models for you for free?

refrigerator · 2025-08-12T16:55:00 1755017700

Great concept — definitely needed and will hopefully push the labs to improve design abilities of models!

j_da · 2025-08-12T16:58:58 1755017938

Yes, exactly. We want to be a forcing function for better design models and agents.

transformi · 2025-08-12T16:19:45 1755015585

Cool - do you train model that will be the proxy from the votes of persons?

grace77 · 2025-08-12T16:28:15 1755016095

we're not training models or proxying human votes with models

ryhanshannon · 2025-08-12T16:20:13 1755015613

Is this an area that is not yet covered by other user rating benchmark sites like LLMarena?

grace77 · 2025-08-12T16:27:27 1755016047

yes! LMArena recently started pushing "webdev" arena, but there was no explicit emphasis on design or aesthetics, just web-based content

doctorpangloss · 2025-08-12T17:47:03 1755020823

Can you write what you imagine is a good “game dev” prompt?

grace77 · 2025-08-12T18:11:06 1755022266

We keep our system prompts across the board as bare bones as possible: https://www.designarena.ai/system-prompts

As for good game dev prompts, here's one from a user that made a pretty fun game: Make asteroids with 2 computers playing against each other on one screen. There should be asteroids flying and 2 ships being controlled by 2 computers. Pay attention to thoroughly implementing the logic to make the ships avoid asteroids at all costs. Absolutely no user input should be necessary, no click to start, no click to restart. The game starts automatically on load and automatically restarts when either computer is dead. The ships should survive as long as possible. The ships should fly around, avoid asteroids as a priority, but also shoot asteroids and each other. Make ships and asteroids positions random each time. Asteroids should split when shot. The goal is to create a robust algorithm for ships so they can survive as long as possible. The game should be playable at 500x500 screen resolution.

andrewstuart · 2025-08-12T16:46:30 1755017190

AI is terrible at making nice looking design layout with font selections.

Sure it can make great looking images but nothing can make a nice looking poster or basic page layout.

I’m waiting for someone to solve this. I’m not even sure it takes AI it might just be programmatic.

rovmut · 2025-08-12T17:53:06 1755021186

You've perfectly articulated the gap in the market. The solution isn't just a better image generator. I built a tool called LayoutCraft to solve this exact problem. It focuses entirely on creating a great layout with good font choices automatically. It uses AI to understand the request, but then applies a structured, programmatic 'blueprint' to build the layout. This is how it handles fonts and spacing properly, resulting in a clean design, not a chaotic image.

grace77 · 2025-08-12T16:51:22 1755017482

yes - we're trying to figure out why that is

ChrisArchitect · 2025-08-12T17:46:22 1755020782

So this went from a Show HN: to a Launch HN in a month?

(Show HN: https://news.ycombinator.com/item?id=44542578)

grace77 · 2025-08-12T18:14:25 1755022465

yes - the changes since Show HN have been builders (https://www.designarena.ai/builder), audio (https://www.designarena.ai/audio), video (https://www.designarena.ai/diffusion), and compare (https://www.designarena.ai/studio). We also had a feed at some point, but ripped that out because it looked messy

KaoruAoiShiho · 2025-08-12T16:35:07 1755016507

Curious if you guys got into YC for this idea or something else?

j_da · 2025-08-12T16:39:17 1755016757

We started out building a platform to one-shot games (single-player and multi-player), but realized that the model you used under the hood really made a difference in functionality and graphics. We started out building the benchmark as an internal tool for ourselves to see which model was the best, but found that benchmarking models on visual "taste" was something that people were generally interested in currently.

neonate · 2025-08-12T16:36:59 1755016619

Post says they were making an AI game engine, so that's probably what they got in with.