MiniAppBench: Evaluating the Shift
from Text to Interactive HTML Responses
in LLM-Powered Assistants

Zuhao Zhang1,2* Chengyue Yu1* Yuante Li3 Chenyi Zhuang1† Linjian Mo1 Shuai Li2
1Inclusion AI, Ant Group 2Shanghai Jiao Tong University 3Carnegie Mellon University
*Equal Contribution Corresponding Author
Paper Leaderboard Code Dataset

📢 Latest Update — February 28, 2026

Interactive Leaderboard Now Available! Test your models on MiniAppBench by submitting to our leaderboard. Simply provide your LLM API endpoint and let our evaluation framework automatically assess performance across 500 real-world tasks.

Submit to Leaderboard →

Abstract

Human-AI interaction is evolving from static text responses to dynamic, interactive applications.

MiniAppBench is the first comprehensive benchmark designed to evaluate principle-driven, interactive application generation. While traditional benchmarks focus on static layouts or algorithmic snippets, MiniAppBench shifts the paradigm toward MiniApps—HTML-based applications that require both visual rendering and complex interaction logic.

Key Highlights:

From Text to MiniApps
Figure 1. The shift from text to MINIAPPS. Unlike static text, MINIAPPS transforms abstract explanations into intuitive visualizations and unlocks actionable tasks (e.g., diet tracking) that were previously impossible.

Benchmark Construction and Statistics

MiniAppBench Construction Pipeline
Figure 2. MiniAppBench data construction pipeline from production application (10M+ generations) to curated evaluation benchmark.

Task Distribution by Domain

Domain Tasks Description
🔬 Science 187 Simulators and virtual laboratories for chemistry, biology, physics, and geometry
🎮 Games 121 Logic puzzles, projectile motion games, systemic simulations, and casual/card games
🛠️ Tools 57 Practical utilities including schedulers, creative editors, and computational tools
📊 Visualization 56 SVG-based graphics, statistical charts, and interactive generative art
📚 Humanities 47 Interactive platforms for skill acquisition, concept deconstruction, and cultural study
💚 Lifestyle 32 Health and wellness trackers, interactive toys, and roleplay-based applications
Total 500 Comprehensive coverage of interactive application scenarios

Methodology: MiniAppEval

Unlike benchmarks with a single “ground truth,” MiniAppEval addresses the open-ended nature of interactive applications through an Agentic Framework (powered by Gemini 3 Pro) that processes four core inputs: (i) the user query $q_i$, (ii) a structured evaluation reference $r_i$, (iii) the generated source code, and (iv) a live, interactable MiniApp instance.

  1. Exploration: An LLM-based agent interacts with the live MiniApp in a browser (clicking, dragging, typing).
  2. Observation: The system captures a comprehensive interaction trajectory, recording DOM states, console logs, and the underlying source code, providing the raw evidence required for deep analysis.
  3. Grading: The agent scores the MiniApp based on the collected evidence. The evaluation reference $r_i$ informs the inspection strategy but does not serve as a rigid oracle.
    • Intention Alignment: Verifies if the MiniApp fulfills the high-level user goal.
    • Static Quality: Evaluates structural and syntactic correctness without execution.
    • Dynamic Logic: Assesses runtime behavior through trajectories, focusing on Sequential Logic and Robustness.

Experimental Results

We evaluated 15 state-of-the-art LLMs across 500 tasks, measuring pass rates by difficulty, domain, and overall performance.

Performance Leaderboard

Model Avg (%) Easy Mid Hard Games Science Tools Humanities Viz Lifestyle
Open-Source Large Language Models
Qwen3-32B0.661.590.550.000.000.570.000.002.043.70
Qwen3-235B-A22B2.886.432.350.000.930.604.004.887.2710.34
Qwen3-Coder-480B1.836.060.000.000.000.000.000.009.4311.11
Kimi-K2-Instruct6.1914.175.030.003.773.114.084.8817.6518.52
GLM-4.5-Air7.0917.604.071.445.664.276.987.3216.9810.34
GLM-4.718.3136.3015.064.4112.5010.4920.0017.0735.1948.39
GLM-561.8068.7168.8846.5057.8557.2264.9155.3276.7981.25
Closed-Source Large Language Models
Hunyuan-Turbos2.326.320.870.000.000.003.030.0013.513.57
Mimo-V2-Flash12.4828.688.332.2213.466.0210.8711.6323.5336.36
Grok-4.1-Reasoning13.7729.6612.122.198.416.5820.0017.5032.6525.93
MiniMax-M2.117.1231.4615.627.0816.2512.5023.3320.0027.2719.23
Gemini-3-Flash17.6232.7616.894.1014.9510.6017.9518.1830.6141.38
Gemini-3-Pro27.5261.9820.831.7126.7419.1113.6428.5752.0055.56
GPT-5.132.0074.7121.373.4924.1418.1033.3345.8357.7864.71
GPT-5.245.4669.7743.0818.6440.3250.3850.1745.4575.0082.35
GPT-5.3-Codex36.2056.4638.2714.6537.1922.4654.3929.7955.3656.25
GPT-5.456.6082.3154.0835.0356.2050.8057.8953.1966.0775.00
Claude-Sonnet-4.526.3668.2214.861.7916.1322.3029.2723.8147.7344.83
Claude-Opus-4.541.1459.0941.1822.3337.1834.5947.5035.7157.4556.52
Claude-Opus-4.661.6076.1964.2944.5956.2058.2963.1659.5773.2181.25
Average28.5843.8827.2512.7922.1121.6229.8524.9739.8845.18

Tokens and Time(s) columns have been omitted for brevity in this view.

Key Findings


🏆 Leaderboard & Submission

We offer two ways to evaluate your model on MiniAppBench:

# Clone the repository
git clone https://github.com/MiniAppBench/miniappbench.git
cd miniappbench

# Install dependencies
pip install -r requirements.txt
playwright install chromium

# Run evaluation on your model
python -m examples.pipeline \
  --query-file data/query_validation_100.json \
  --model-name "your-model-name" \
  --api-key "your-api-key" \
  --batch "1-5" \
  --parallel \
  --concurrency 5

Option 2: Submit to Official Leaderboard

To have your results verified and displayed on the official leaderboard:

  1. Prepare Your Submission: Provide your Model Name, Organization, and an OpenAI-compatible API Endpoint.
  2. Automated Evaluation: Our evaluation servers will run all 500 benchmark tasks using the MiniAppEval agent.
  3. Review & Publication: Evaluation typically completes within 6-12 hours. APIs used only for evaluation and deleted immediately after.
🚀 Submit Your Model to Leaderboard

📧 Questions? Contact us at zhangzuhao.zzh@antgroup.com.


Citation

@article{miniappbench2026,
  title={MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants},
  author={Zuhao Zhang and Chengyue Yu and Yuante Li and Chenyi Zhuang and Linjian Mo and Shuai Li},
  journal={arXiv},
  year={2026},
  url={https://arxiv.org/abs/2603.09652}
}

MiniAppBenchAdvancing the Frontier of Interactive Human-AI Collaboration

Paper  ·  Leaderboard  ·  Code  ·  Dataset

Total Visitors: Visitors · Last Updated: 2026-02-28