We tested style completion and recoloring on template families. Structural fidelity is high (position and area preservation near 100% for structural generation), but palette coverage lags at 77.6%. Worse, SSIM and LPIPS actively mislead: a structurally valid, style-consistent output scores lower than a hallucinated one that happens to agree more on pixels.
The take away is that pixel metrics are the wrong evaluation substrate for design. The field needs structure-aware metrics that operate on extracted primitives such as bounding boxes, color tokens, font properties instead of raw pixels.