MiniAppBench: Evaluating the Shift
from Text to Interactive HTML Responses
in LLM-Powered Assistants

Zuhao Zhang^1,2* Chengyue Yu^1* Yuante Li³ Chenyi Zhuang^1† Linjian Mo¹ Shuai Li²

¹Inclusion AI, Ant Group ²Shanghai Jiao Tong University ³Carnegie Mellon University

^*Equal Contribution ^†Corresponding Author

📢 Latest Update — February 28, 2026

Interactive Leaderboard Now Available! Test your models on MiniAppBench by submitting to our leaderboard. Simply provide your LLM API endpoint and let our evaluation framework automatically assess performance across 500 real-world tasks.

Submit to Leaderboard →

Abstract

Human-AI interaction is evolving from static text responses to dynamic, interactive applications.

MiniAppBench is the first comprehensive benchmark designed to evaluate principle-driven, interactive application generation. While traditional benchmarks focus on static layouts or algorithmic snippets, MiniAppBench shifts the paradigm toward MiniApps—HTML-based applications that require both visual rendering and complex interaction logic.

Key Highlights:

🌍 Real-World Scale: Distilled from 10M+ generations
📊 Diverse Tasks: 500 tasks across 6 domains
🤖 MiniAppEval Framework: Agentic browser automation
📏 Multi-Dimensional: Intention, Static, Dynamic scoring
🔬 High Alignment: Pearson r > 0.85 with humans

From Text to MiniApps — **Figure 1.** The shift from text to MINIAPPS. Unlike static text, MINIAPPS transforms abstract explanations into intuitive visualizations and unlocks actionable tasks (e.g., diet tracking) that were previously impossible.

Benchmark Construction and Statistics

MiniAppBench Construction Pipeline — **Figure 2.** MiniAppBench data construction pipeline from production application (10M+ generations) to curated evaluation benchmark.

Task Distribution by Domain

Domain	Tasks	Description
🔬 Science	187	Simulators and virtual laboratories for chemistry, biology, physics, and geometry
🎮 Games	121	Logic puzzles, projectile motion games, systemic simulations, and casual/card games
🛠️ Tools	57	Practical utilities including schedulers, creative editors, and computational tools
📊 Visualization	56	SVG-based graphics, statistical charts, and interactive generative art
📚 Humanities	47	Interactive platforms for skill acquisition, concept deconstruction, and cultural study
💚 Lifestyle	32	Health and wellness trackers, interactive toys, and roleplay-based applications
Total	500	Comprehensive coverage of interactive application scenarios

Methodology: MiniAppEval

Unlike benchmarks with a single “ground truth,” MiniAppEval addresses the open-ended nature of interactive applications through an Agentic Framework (powered by Gemini 3 Pro) that processes four core inputs: (i) the user query $q_i$, (ii) a structured evaluation reference $r_i$, (iii) the generated source code, and (iv) a live, interactable MiniApp instance.

Exploration: An LLM-based agent interacts with the live MiniApp in a browser (clicking, dragging, typing).
Observation: The system captures a comprehensive interaction trajectory, recording DOM states, console logs, and the underlying source code, providing the raw evidence required for deep analysis.
Grading: The agent scores the MiniApp based on the collected evidence. The evaluation reference $r_i$ informs the inspection strategy but does not serve as a rigid oracle.
- Intention Alignment: Verifies if the MiniApp fulfills the high-level user goal.
- Static Quality: Evaluates structural and syntactic correctness without execution.
- Dynamic Logic: Assesses runtime behavior through trajectories, focusing on Sequential Logic and Robustness.

Experimental Results

We evaluated 15 state-of-the-art LLMs across 500 tasks, measuring pass rates by difficulty, domain, and overall performance.

Performance Leaderboard

Model	Avg (%)	Easy	Mid	Hard	Games	Science	Tools	Humanities	Viz	Lifestyle
Open-Source Large Language Models
Qwen3-32B	0.66	1.59	0.55	0.00	0.00	0.57	0.00	0.00	2.04	3.70
Qwen3-235B-A22B	2.88	6.43	2.35	0.00	0.93	0.60	4.00	4.88	7.27	10.34
Qwen3-Coder-480B	1.83	6.06	0.00	0.00	0.00	0.00	0.00	0.00	9.43	11.11
Kimi-K2-Instruct	6.19	14.17	5.03	0.00	3.77	3.11	4.08	4.88	17.65	18.52
GLM-4.5-Air	7.09	17.60	4.07	1.44	5.66	4.27	6.98	7.32	16.98	10.34
GLM-4.7	18.31	36.30	15.06	4.41	12.50	10.49	20.00	17.07	35.19	48.39
GLM-5	61.80	68.71	68.88	46.50	57.85	57.22	64.91	55.32	76.79	81.25
Closed-Source Large Language Models
Hunyuan-Turbos	2.32	6.32	0.87	0.00	0.00	0.00	3.03	0.00	13.51	3.57
Mimo-V2-Flash	12.48	28.68	8.33	2.22	13.46	6.02	10.87	11.63	23.53	36.36
Grok-4.1-Reasoning	13.77	29.66	12.12	2.19	8.41	6.58	20.00	17.50	32.65	25.93
MiniMax-M2.1	17.12	31.46	15.62	7.08	16.25	12.50	23.33	20.00	27.27	19.23
Gemini-3-Flash	17.62	32.76	16.89	4.10	14.95	10.60	17.95	18.18	30.61	41.38
Gemini-3-Pro	27.52	61.98	20.83	1.71	26.74	19.11	13.64	28.57	52.00	55.56
GPT-5.1	32.00	74.71	21.37	3.49	24.14	18.10	33.33	45.83	57.78	64.71
GPT-5.2	45.46	69.77	43.08	18.64	40.32	50.38	50.17	45.45	75.00	82.35
GPT-5.3-Codex	36.20	56.46	38.27	14.65	37.19	22.46	54.39	29.79	55.36	56.25
GPT-5.4	56.60	82.31	54.08	35.03	56.20	50.80	57.89	53.19	66.07	75.00
Claude-Sonnet-4.5	26.36	68.22	14.86	1.79	16.13	22.30	29.27	23.81	47.73	44.83
Claude-Opus-4.5	41.14	59.09	41.18	22.33	37.18	34.59	47.50	35.71	57.45	56.52
Claude-Opus-4.6	61.60	76.19	64.29	44.59	56.20	58.29	63.16	59.57	73.21	81.25
Average	28.58	43.88	27.25	12.79	22.11	21.62	29.85	24.97	39.88	45.18

Tokens and Time(s) columns have been omitted for brevity in this view.

Key Findings

Performance Gaps: Best closed-source model (GPT-5.2) achieves 45.46%, while best open-source (GLM-4.7) reaches 18.31%.
Difficulty Scaling: Pass rate drops significantly from 34.05% (Easy) → 13.89% (Mid) → 4.34% (Hard).
Domain Variance: Lifestyle (33.30%) is the easiest domain, whereas Science (11.64%) proves to be the hardest.
Validation: High Pearson correlation with human judgment ($r > 0.85$).

🏆 Leaderboard & Submission

We offer two ways to evaluate your model on MiniAppBench:

Option 1: Local Evaluation (Recommended for Development)

# Clone the repository
git clone https://github.com/MiniAppBench/miniappbench.git
cd miniappbench

# Install dependencies
pip install -r requirements.txt
playwright install chromium

# Run evaluation on your model
python -m examples.pipeline \
  --query-file data/query_validation_100.json \
  --model-name "your-model-name" \
  --api-key "your-api-key" \
  --batch "1-5" \
  --parallel \
  --concurrency 5

Option 2: Submit to Official Leaderboard

To have your results verified and displayed on the official leaderboard:

Prepare Your Submission: Provide your Model Name, Organization, and an OpenAI-compatible API Endpoint.
Automated Evaluation: Our evaluation servers will run all 500 benchmark tasks using the MiniAppEval agent.
Review & Publication: Evaluation typically completes within 6-12 hours. APIs used only for evaluation and deleted immediately after.

🚀 Submit Your Model to Leaderboard

📧 Questions? Contact us at zhangzuhao.zzh@antgroup.com.

Citation

@article{miniappbench2026,
  title={MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants},
  author={Zuhao Zhang and Chengyue Yu and Yuante Li and Chenyi Zhuang and Linjian Mo and Shuai Li},
  journal={arXiv},
  year={2026},
  url={https://arxiv.org/abs/2603.09652}
}

MiniAppBench — Advancing the Frontier of Interactive Human-AI Collaboration

Paper · Leaderboard · Code · Dataset

Total Visitors: · Last Updated: 2026-02-28

This site is open source. Improve this page.