Five AI Models Run the Same Agent Task — The Gap Between Hype and Reality

Don't Be Fooled by Model Hype — You'll Know Once You Run a Real Agent Task

There are indeed more and more models available on the market now, each with its own highlights and focus areas. Looking at promotional materials and benchmark scores alone makes it hard to judge which one truly suits you — especially when tasks extend from single-turn conversations to multi-step operations, the situation becomes even more complex.

So I thought, why not pull a few mainstream models out and run them through a real task to see how they actually perform, and get a feel for them myself.

Evaluation Note: This is not a strict benchmark evaluation, but more of an experiential observation record around a single long-chain Agent task. The results are for reference only and do not constitute a comprehensive verdict on the models.

The models used in this test are MiniMax-M3, DeepSeek-V4-flash, Step-3.7-flash, GLM5.2, and Gemini3.5 flash.

The test task is to create an HTML single page for an "AI Website Aggregation Platform." There are three core observation points:

Whether it can continuously call tools to complete the task
Whether it can stably generate a runnable page
Whether it proactively checks and fixes issues after the page is complete

The WorkBuddy Agent tool was used globally. The cost is an estimate of the WorkBuddy platform's consumption for this task and does not represent official API pricing.

The general prompt is as follows:

Please complete the full development task of an 'AI Tool Navigation Site', requiring independent completion from requirement understanding to page generation, data organization, code implementation, run checks, and issue fixing.
Task Objective:
Create a complete, runnable HTML single-page website with the theme 'AI Tool Navigation Site'. The page is used to display different types of AI tools, suitable for web demos, course materials, or official account long images.
Task Requirements:
1. Information Collection
Search online and organize 20 mainstream AI tools, covering categories such as AI Writing, AI Coding, AI Image, AI Video, AI Search, AI Office, etc. Each tool needs to include: tool name, owning company, main purpose, target audience, and official website link.
2. Data Organization
Group tools by category and organize them into structured data. Information must be accurate, avoid duplicate tools, and cover both domestic and international tools.
3. Page Design
Generate a clean, modern, tech-feel HTML page. The page needs to include a top title area, category filter area, tool card area, recommended tools area, comparison table area, and summary description area.
4. Interactive Features
The page needs to support filtering by tool category, keyword search, expanding tool card details, and a back-to-top button.
5. Code Implementation
Implement using a single HTML, CSS, and JavaScript file, without backend dependencies. Public CDN icon libraries or lightweight chart libraries can be used, but the page must run directly.
6. Run and Check
After completion, self-check the page for code errors, style misalignment, invalid buttons, missing links, filter failures, etc. If problems are found, proactively fix them.
7. Output Result
Finally output the complete runnable HTML file content, along with a brief description: what data sources were used, what modules the page contains, and what interactive features it has.
Special Requirements:
Please try to complete the entire task in one go. If tools such as search, web reading, code generation, file modification, run checks, and error fixing are needed during the process, please complete them continuously in a reasonable order without skipping steps. The final result is based on the runnable page.

MiniMax-M3

MiniMax-M3 performs relatively stably in this type of long-chain task.

It basically proactively makes multiple rounds of tool calls, including searching for information, organizing data, generating page code, checking files, fixing issues, etc. The whole process is more like a normally functioning Agent, not just stopping at the level of "giving a piece of code."

During the test, the chance of tool call failure is very small, but not entirely absent. I had one tool call failure here, but it did not affect the final result generation. The model continues executing and produces the page.

This is the page effect after completion.

Judging from the final page, MiniMax-M3's data completeness, page structure, and interactive features are relatively complete. It doesn't particularly pursue visual flashiness, but its strength lies in process stability and relatively clear task understanding.

Consumed points in Workbuddy: around 27 points.

Converted, that's about 1.33 RMB.

If estimated by API unit price, MiniMax-M3 is a medium-to-low cost model, suitable for repeatedly running Agent workflow tasks.

After multiple tests, MiniMax-M3's task completion rate is 100%, and the tool call success rate is about 98%. A small number of tool calls failed, but did not affect the final result generation.

Simply put, MiniMax-M3's advantages are stability, low cost, and the ability to run to completion. It is suitable for batch page generation, data organization, code drafts, and lightweight Agent tasks.

Speaking of cheap, let's test the cheapest large model — deepseek-v4-flash — to see how it performs.

DeepSeek-V4-flash

Using the same prompt, I also tested DeepSeek-V4-flash.

DeepSeek-V4-flash's overall speed is relatively fast, and its responses are crisp. It performs well in understanding requirements, breaking down page modules, and generating HTML structures.

However, in long-chain tool calls, its style leans more towards "quickly completing the task." That is, it generates code very quickly, but in terms of data searching, data verification, and detail fixing, it is not as meticulous as MiniMax-M3 and Step-3.7-flash.

From the results, the page can be completed normally, and the basic modules are all there. For example, categories, cards, search, details, tables — these features are all covered.

DeepSeek-V4-flash is more suitable for speed-sensitive tasks. If you just want to quickly get a runnable HTML Demo, its efficiency is high.

But if the task requires a lot of data verification, page detail polishing, and multiple run fixes, it sometimes needs a manual reminder. For example, ask it to check links again, optimize styles again, or supplement data fields.

Consumed points in Workbuddy: around 4 points. Converted, that's about 0.2 RMB.

From a price perception, DeepSeek-V4-flash's cost advantage is obvious, suitable for high-frequency calls.

After multiple tests, DeepSeek-V4-flash's task completion rate is about 100%. The tool call success rate is 99%.

My feeling is that DeepSeek-V4-flash is very suitable for a "quick generation + slight manual check" workflow. Speed and cost are good, but the detail stability of long-chain Agents still depends on the specific platform's tool environment.

Step-3.7-flash

Step-3.7-flash is the model in this test that best fits the "production-grade Agent" positioning.

Its enthusiasm for multi-tool calls is relatively high, continuously completing search, reading, organizing, generating, modifying, and checking. The whole process is more like fully executing a task, rather than simply answering a question.

The page effect is a typical dark tech style.

AI really likes this color scheme. Without specific instructions, many models default to generating dark-themed website pages. This isn't necessarily bad, but if you want a clean, bright, official-account-long-image style page, it's best to specify it clearly in the prompt.

Step-3.7-flash performs prominently in data organization. The AI tool data is relatively complete, and the categorization is relatively clear. It tries to cover different categories like writing, coding, images, video, search, office, etc., rather than just listing a few common tools.

From the perspective of page completeness, Step-3.7-flash has the highest content density. It tries to include all the modules required by the task, including the top title area, category filter, tool cards, recommended tools, comparison table, and summary description.

The test cost for this round is about: 0.7 RMB.

From the unit price perspective, Step-3.7-flash is a medium-to-low priced contender. Its advantage isn't low price, but "can run continuously, few interruptions, high completion rate."

After multiple tests, Step-3.7-flash's task completion rate is about 100%, and the tool call success rate is about 99%.

If your task is high-frequency, multi-turn, low-latency, and includes tool chains like search, files, code, and fixes, Step-3.7-flash is a model worth putting on your candidate list.

GLM5.2

Now let's look at the effect generated by GLM5.2.

GLM5.2 performs well in code generation and page structure. It understands that this task requires a complete AI tool navigation site and can break down the page modules quite clearly.

From the results, the overall page completeness is acceptable. Categories, cards, search, and description areas are all covered.

GLM5.2's characteristic is relatively balanced capabilities. It can normally exert its model strength in Agent tasks, but the biggest drawback is that it's too expensive.

The test cost for this round is about: 74 points. Converted, that's about 3.66 RMB.

Finally, let's test a foreign model, Gemini3.5 flash, to see how it performs.

Gemini3.5 flash

For frontend pages, Gemini's aesthetic sense has always been relatively on point. So here I used the Gemini3.5 flash model.

Below is the AI tool navigation webpage effect produced.

Gemini3.5 flash's biggest advantage is comfortable page viewing.

The frontend pages it generates are more refined, with more comfortable layouts, and better whitespace and layering. Compared to the previous models, Gemini3.5 flash understands frontend design a bit better.

However, Gemini3.5 flash also has obvious problems.

It is indeed better in visual performance, but data collection is not as extensive as the previous models. Especially compared to Step-3.7-flash, Step collected more complete data, had more comprehensive category coverage, and was more proactive in tool calls.

The test cost for this round is about: 9 RMB.

Gemini3.5 flash's price is significantly more expensive, especially for tasks with more output tokens, tool calls, and code generation. The cost will be much higher than domestic Flash-tier models.

If you have high requirements for page quality, you can try Gemini3.5 flash. It is suitable for display pages, official website demos, product introduction pages, and course material pages. But if you care more about cost and high-frequency calls, you still need to be cautious.

Test Result Comparison

Model	Task Completion Rate	Tool Call Success Rate	This Round Cost	Main Advantages	Main Disadvantages
MiniMax-M3	100%	~98%	~1.33 RMB	Stable, low cost, can run the complete process	Page aesthetics are mediocre, visual impact is average
DeepSeek-V4-flash	100%	~99%	~0.2 RMB	Fast, low cost, suitable for quick first drafts	Detail checking and page polishing sometimes need manual reminders
Step-3.7-flash	100%	~99%	~0.7 RMB	Proactive tool calls, complete data coverage, strong long-chain execution feel	Page easily defaults to dark tech style, needs style constraints upfront
GLM5.2	100%	~97%	~3.66 RMB	Balanced overall capability, good page structure and code completion	Proactive search, verification, and fixing execution feel is not the strongest
Gemini3.5 flash	100%	~96%	~9 RMB	Best page aesthetics, more mature layout, whitespace, and visual hierarchy	Cost is significantly higher, data collection and tool call proactiveness are not as good as Step-3.7-flash

Summary

In this test, what I paid more attention to was not single-turn answering ability, but whether the model could run a real task from start to finish.

If only looking at page aesthetics, Gemini3.5 flash is indeed stronger. The web pages it generates look more like a finished product demo, visually more comfortable.

If looking at tool calls and data completeness, Step-3.7-flash's performance is more prominent. It more proactively searches, organizes, generates, and checks, suitable for long-chain Agent tasks.

If looking at cost and stability, MiniMax-M3 is a very stable choice. It's not particularly flashy, but it can complete tasks in multiple tests, and tool call failures don't significantly affect the results.

DeepSeek-V4-flash's advantage is speed and low cost, suitable for quickly generating first drafts. GLM5.2 is relatively balanced, suitable for comprehensive tasks.

So model selection still depends on the scenario.

For display-oriented pages, prioritize Gemini. For production-grade Agent workflows, focus on Step-3.7-flash. For high-frequency, low-cost tasks, consider MiniMax-M3 and DeepSeek-V4-flash.

Comments

Top 1 from juejin.cn, machine-translated. The original thread is authoritative.

目标艾泽拉斯

GPT and Claude models still deliver the best results. When Minimax-M3 first came out, they said its benchmark scores were quite high, but the real-world performance isn't good. Haven't used the other models; I've heard GLM5.2 has reached a usable level. Hope domestic models develop quickly.