Interactive Leaderboard Now Available! Test your models on MiniAppBench by submitting to our leaderboard. Simply provide your LLM API endpoint and let our evaluation framework automatically assess performance across 500 real-world tasks.
Human-AI interaction is evolving from static text responses to dynamic, interactive applications.
MiniAppBench is the first comprehensive benchmark designed to evaluate principle-driven, interactive application generation. While traditional benchmarks focus on static layouts or algorithmic snippets, MiniAppBench shifts the paradigm toward MiniApps—HTML-based applications that require both visual rendering and complex interaction logic.
| Domain | Tasks | Description |
|---|---|---|
| 🔬 Science | 187 | Simulators and virtual laboratories for chemistry, biology, physics, and geometry |
| 🎮 Games | 121 | Logic puzzles, projectile motion games, systemic simulations, and casual/card games |
| 🛠️ Tools | 57 | Practical utilities including schedulers, creative editors, and computational tools |
| 📊 Visualization | 56 | SVG-based graphics, statistical charts, and interactive generative art |
| 📚 Humanities | 47 | Interactive platforms for skill acquisition, concept deconstruction, and cultural study |
| 💚 Lifestyle | 32 | Health and wellness trackers, interactive toys, and roleplay-based applications |
| Total | 500 | Comprehensive coverage of interactive application scenarios |
Unlike benchmarks with a single “ground truth,” MiniAppEval addresses the open-ended nature of interactive applications through an Agentic Framework (powered by Gemini 3 Pro) that processes four core inputs: (i) the user query $q_i$, (ii) a structured evaluation reference $r_i$, (iii) the generated source code, and (iv) a live, interactable MiniApp instance.
We evaluated 15 state-of-the-art LLMs across 500 tasks, measuring pass rates by difficulty, domain, and overall performance.
| Model | Avg (%) | Easy | Mid | Hard | Games | Science | Tools | Humanities | Viz | Lifestyle |
|---|---|---|---|---|---|---|---|---|---|---|
| Open-Source Large Language Models | ||||||||||
| Qwen3-32B | 0.66 | 1.59 | 0.55 | 0.00 | 0.00 | 0.57 | 0.00 | 0.00 | 2.04 | 3.70 |
| Qwen3-235B-A22B | 2.88 | 6.43 | 2.35 | 0.00 | 0.93 | 0.60 | 4.00 | 4.88 | 7.27 | 10.34 |
| Qwen3-Coder-480B | 1.83 | 6.06 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 9.43 | 11.11 |
| Kimi-K2-Instruct | 6.19 | 14.17 | 5.03 | 0.00 | 3.77 | 3.11 | 4.08 | 4.88 | 17.65 | 18.52 |
| GLM-4.5-Air | 7.09 | 17.60 | 4.07 | 1.44 | 5.66 | 4.27 | 6.98 | 7.32 | 16.98 | 10.34 |
| GLM-4.7 | 18.31 | 36.30 | 15.06 | 4.41 | 12.50 | 10.49 | 20.00 | 17.07 | 35.19 | 48.39 |
| GLM-5 | 61.80 | 68.71 | 68.88 | 46.50 | 57.85 | 57.22 | 64.91 | 55.32 | 76.79 | 81.25 |
| Closed-Source Large Language Models | ||||||||||
| Hunyuan-Turbos | 2.32 | 6.32 | 0.87 | 0.00 | 0.00 | 0.00 | 3.03 | 0.00 | 13.51 | 3.57 |
| Mimo-V2-Flash | 12.48 | 28.68 | 8.33 | 2.22 | 13.46 | 6.02 | 10.87 | 11.63 | 23.53 | 36.36 |
| Grok-4.1-Reasoning | 13.77 | 29.66 | 12.12 | 2.19 | 8.41 | 6.58 | 20.00 | 17.50 | 32.65 | 25.93 |
| MiniMax-M2.1 | 17.12 | 31.46 | 15.62 | 7.08 | 16.25 | 12.50 | 23.33 | 20.00 | 27.27 | 19.23 |
| Gemini-3-Flash | 17.62 | 32.76 | 16.89 | 4.10 | 14.95 | 10.60 | 17.95 | 18.18 | 30.61 | 41.38 |
| Gemini-3-Pro | 27.52 | 61.98 | 20.83 | 1.71 | 26.74 | 19.11 | 13.64 | 28.57 | 52.00 | 55.56 |
| GPT-5.1 | 32.00 | 74.71 | 21.37 | 3.49 | 24.14 | 18.10 | 33.33 | 45.83 | 57.78 | 64.71 |
| GPT-5.2 | 45.46 | 69.77 | 43.08 | 18.64 | 40.32 | 50.38 | 50.17 | 45.45 | 75.00 | 82.35 |
| GPT-5.3-Codex | 36.20 | 56.46 | 38.27 | 14.65 | 37.19 | 22.46 | 54.39 | 29.79 | 55.36 | 56.25 |
| GPT-5.4 | 56.60 | 82.31 | 54.08 | 35.03 | 56.20 | 50.80 | 57.89 | 53.19 | 66.07 | 75.00 |
| Claude-Sonnet-4.5 | 26.36 | 68.22 | 14.86 | 1.79 | 16.13 | 22.30 | 29.27 | 23.81 | 47.73 | 44.83 |
| Claude-Opus-4.5 | 41.14 | 59.09 | 41.18 | 22.33 | 37.18 | 34.59 | 47.50 | 35.71 | 57.45 | 56.52 |
| Claude-Opus-4.6 | 61.60 | 76.19 | 64.29 | 44.59 | 56.20 | 58.29 | 63.16 | 59.57 | 73.21 | 81.25 |
| Average | 28.58 | 43.88 | 27.25 | 12.79 | 22.11 | 21.62 | 29.85 | 24.97 | 39.88 | 45.18 |
Tokens and Time(s) columns have been omitted for brevity in this view.
We offer two ways to evaluate your model on MiniAppBench:
# Clone the repository
git clone https://github.com/MiniAppBench/miniappbench.git
cd miniappbench
# Install dependencies
pip install -r requirements.txt
playwright install chromium
# Run evaluation on your model
python -m examples.pipeline \
--query-file data/query_validation_100.json \
--model-name "your-model-name" \
--api-key "your-api-key" \
--batch "1-5" \
--parallel \
--concurrency 5
To have your results verified and displayed on the official leaderboard:
📧 Questions? Contact us at zhangzuhao.zzh@antgroup.com.
@article{miniappbench2026,
title={MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants},
author={Zuhao Zhang and Chengyue Yu and Yuante Li and Chenyi Zhuang and Linjian Mo and Shuai Li},
journal={arXiv},
year={2026},
url={https://arxiv.org/abs/2603.09652}
}
MiniAppBench — Advancing the Frontier of Interactive Human-AI Collaboration
Paper · Leaderboard · Code · Dataset
Total Visitors: · Last Updated: 2026-02-28