LLM-as-Design-Critic: Empirical Alignment of AI Mobile UI Critiques with Human Graphic Design Judgment

Jun Sun; David Shao

doi:10.66372/JGER.v1i2.1

Authors

Jun Sun Business Analytics and Project Management, University of Connecticut, CT, USA Author
David Shao Information Design and Strategy Author

DOI:

https://doi.org/10.66372/JGER.v1i2.1

Keywords:

Large language models, UI critique, mobile interface design, human-AI alignment, graphic design judgment, usability, aesthetics, reproducible evaluation, UICrit, RICO

Abstract

This paper evaluates whether large-language-model-generated user-interface critiques align with human graphic design judgment. The experiment uses the public UICrit CSV distributed with source-labeled critique comments, bounding boxes, and five human rating dimensions for mobile screens. The raw file contains 2,981 annotator rows, 1,000 unique RICO screen identifiers, and 11,344 canonical critique source labels. We parsed the complete file, retained the 2,898 rows in which the comment and source-label lists were aligned, excluded three zero-critique rows for text experiments, and evaluated 11,112 source-aligned comments. The central comparison treats comments marked human or both as the human side and comments marked llm or both as the LLM side. We measured natural language similarity, topic-set agreement, bounding-box overlap, and the ability of LLM-side comments to predict human ratings under grouped five-fold cross-validation by RICO screen identifier. The LLM side reached a row-level TF-IDF cosine of 0.412, token Jaccard of 0.357, and topic-set Jaccard of 0.660 with human-side critiques. However, when shared comments marked both were removed, strict TF-IDF cosine fell to 0.057. Bounding boxes showed the same pattern: source-inclusive mean best IoU was 0.759, whereas strict LLM-only to human-only IoU was 0.323. For overall design quality, the LLM critique TF-IDF model achieved Pearson r = 0.185 and MAE = 0.786, matching the mean baseline in error, while LLM topic/count features improved to r = 0.243 and MAE = 0.757. Human critique text remained substantially stronger with r = 0.625 and MAE = 0.595. The empirical result is therefore specific: LLM critiques align with human designers at the level of broad design topics, but they do not yet reproduce the rating signal or the specific localized judgments of human design critique.

Author Biography

David Shao, Information Design and Strategy

LLM-as-Design-Critic: Empirical Alignment of AI Mobile UI Critiques with Human Graphic Design Judgment

Authors

DOI:

Keywords:

Abstract

Author Biography

Downloads

Published

Issue

Section

License

How to Cite

Manu

For Authors

About Journal

Editorial Team

Make a Submission

Ready to Publish