AI Coding · Agent

Five AI Models Run the Same Agent Task — The Gap Between Hype and Reality

By 轻口味 · Jul 3, 2026

Read original on juejin.cn ↗ Google Translate ↗ Alt translation

Benchmark scores and single-turn demos hide the real cost and reliability of running agent workflows. A model that looks strong on paper can fail to search, verify, or self-correct across a long chain — or cost 45x more to produce a functionally identical page. Choosing the wrong model for agent tasks burns budget on either rework or overpriced polish.

Summary

A hands-on test ran five Chinese and Western models through the same multi-step agent task: search for 20 AI tools, organize them into structured data, generate a full interactive HTML page, and self-check for errors. Every model completed the page, but how they got there diverged sharply. MiniMax-M3 proved the most stable and affordable workhorse at ¥1.33 per run. DeepSeek-V4-flash delivered the fastest, cheapest output at ¥0.20 but skimped on data verification and polish. Step-3.7-flash stood out for aggressive tool calling and the richest data coverage, making it the closest to a production-grade agent. GLM5.2 produced balanced results at a steep ¥3.66, while Gemini 3.5 Flash produced the best-looking pages at ¥9 but collected less data and called tools less proactively.

The test reveals a clear segmentation: visual polish costs multiples more and often comes at the expense of data thoroughness. Tool-calling diligence — not benchmark scores — determines whether a model can run a long agent task without human intervention. Cost differences are extreme: the cheapest run was 45x less than the most expensive, yet all five pages were functionally complete.

Takeaways

— All five models achieved 100% task completion on building a full AI tools directory HTML page.

— DeepSeek-V4-flash cost ¥0.20 per run — roughly 45x cheaper than Gemini 3.5 Flash at ¥9.

— Step-3.7-flash made the most aggressive tool calls and produced the richest, most complete dataset.

— MiniMax-M3 was the most stable: tool call failures did not derail the final output.

— Gemini 3.5 Flash produced the most visually polished pages but collected less data and called tools less proactively.

— GLM5.2 cost ¥3.66 per run with no clear advantage over cheaper models in tool-calling diligence.

— Tool call success rates ranged from 96% (Gemini) to 99% (DeepSeek, Step), but failure impact varied by model resilience.

— Models default to dark tech-style pages unless the prompt explicitly requests a light, clean design.

Conclusions

Tool-calling diligence is a better predictor of agent reliability than any published benchmark score.

Cost and visual quality are inversely correlated with data thoroughness in current models — you pay more for looks and get less substance.

A 45x cost spread for functionally equivalent output means model selection is now a procurement decision, not just a technical one.

Model resilience to tool call failures matters more than the failure rate itself; MiniMax-M3 continued cleanly after a failure while others might stall.

Agent tasks expose a model's 'personality': some rush to finish, others verify and enrich — and the prompt alone doesn't change this tendency.

Dark-mode default bias in AI-generated UIs is strong enough that teams should add a style constraint to every frontend prompt.

Concepts & terms

Agent task

A multi-step workflow where an AI model autonomously uses tools — search, code generation, file operations, self-checking — to complete a goal without turn-by-turn human prompting.

Tool call

An API-level action where the model invokes an external capability like web search, file reading, or code execution as part of completing a task.

Long-chain task

A task requiring many sequential steps and tool invocations, where failure or drift at any step can cascade into an unusable final result.

Flash-tier model

A smaller, faster, cheaper variant of a full-scale model, optimized for low latency and high throughput at the cost of some reasoning depth.

From the discussion

The discussion centers on a gap between benchmark hype and real-world performance. GPT and Claude are held up as the gold standard, while Minimax-M3 is called out for high benchmark scores that don't translate into good practical results. GLM5.2 gets a nod as potentially usable, with a general wish for faster progress from domestic models.

— GPT and Claude remain the most effective models in practice.

— Minimax-M3's strong benchmark scores do not reflect its real-world performance, which is poor.

— GLM5.2 is reportedly reaching a usable level.

— Domestic models still need to close the gap with leading Western models.

Featured comments

目标艾泽拉斯

GPT and Claude models still deliver the best results. When Minimax-M3 first came out, they said its benchmark scores were quite high, but the real-world performance isn't good. Haven't used the other models; I've heard GLM5.2 has reached a usable level. Hope domestic models develop quickly.

See top comments, translated →

Source: juejin.cn ↗ Google Translate ↗ Backup ↗