AI Coding · VibeCoding · Flutter

Benchmark Scores Are Lying to You: A Flutter Dev's Three-Tier Strategy for Picking the Right AI Model

By 程序员老刘 · Jun 24, 2026

Read original on juejin.cn ↗ Google Translate ↗ Alt translation

For Western developers paying for AI coding plans, this is a reality check: benchmark scores are marketing, not engineering metrics. The three-tier strategy and the emphasis on context management offer a practical, cost-effective framework that challenges the assumption that you need the most expensive model for every task.

Summary

Public AI coding benchmarks are flooding social media, but a senior Flutter developer argues they are largely meaningless for real-world work. Most benchmarks use public test sets that models can overfit to, and they fail to capture the messy, context-rich environment of actual enterprise development — where a PRD document or a UI design carries months of implicit project history.

The developer recommends three closed-source benchmarks that are harder to game: CursorBench (for evaluating top foreign models in coding), LiveBench (for comprehensive ability with regular question rotation), and DeepSWE (for complex, long-cycle engineering tasks). LiveBench is singled out as the most trustworthy for comparing domestic coding plan models, with four key dimensions: Reasoning, Coding, Agentic Coding, and Instruction Following.

Beyond benchmarks, the developer's core insight is a three-tier model selection strategy: use cheap, fast models (like deepseek-v4-flash or agnes-2.0-flash) for simple tasks; mid-tier models (like glm-5.2 or gemini-3.5-flash) for core development; and top-tier models (GPT-5.5 Thinking or Claude 4.8 Opus) only for difficult problems. The real differentiator, however, is context management — a skill that can make a cheap model perform like a top-tier one.

Takeaways

— Public benchmarks are easy to game because their test sets are often open; models can overfit to them.

— Closed-source benchmarks like CursorBench, LiveBench, and DeepSWE are more reliable because their test sets are not public.

— LiveBench is recommended for comparing domestic coding plan models, with four key dimensions: Reasoning, Coding, Agentic Coding, and Instruction Following.

— A three-tier model selection strategy saves money: cheap/fast models for simple tasks, mid-tier for core tasks, top-tier only for difficult problems.

— Context management — providing clear, undiluted context and using subagents to isolate tool calls — can make a cheap model perform like a top-tier one.

— GLM-5.2 is highlighted as the most cost-effective open-source model for programming tasks.

— GPT-5.5 Thinking and Claude 4.8 Opus are recommended for difficult problems.

— Agnes-2.0-flash is noted as a free model that can handle core tasks.

— DeepSWE v1.1 evaluates models on originality and long-cycle engineering tasks with very difficult, non-public questions.

Conclusions

The most valuable benchmark is the one you build yourself: a private test set of your own past requirements and bugs.

Benchmark scores are a lagging indicator of marketing hype, not a leading indicator of real-world coding performance.

The three-tier strategy implicitly acknowledges that AI coding is not a single capability but a spectrum of task difficulties requiring different model economics.

Context management is emerging as the true 'harness engineering' skill that separates effective AI users from those who waste money on top-tier models.

The recommendation of GLM-5.2 as a cost-effective open-source model signals that Chinese AI labs are competitive in the coding domain.

The emphasis on 'Instruction Following' as a key dimension reflects the reality of enterprise development where code must conform to strict project standards.

The author's dismissal of 'VibeCoding' (implied by the tag) suggests a pragmatic, engineering-driven approach to AI coding rather than a hype-driven one.

Concepts & terms

Closed-source benchmark

A benchmark where the test questions are not publicly available, making it much harder for AI models to overfit or 'game' the scores. Examples include CursorBench and LiveBench.

Context management

The practice of providing an AI model with clear, undiluted context, configuring efficient tools, using subagents to isolate information from tool calls, and cleaning/compressing information in multi-turn loops. It is considered the key skill for getting top-tier results from any model.

Three-tier model selection strategy

A cost-optimization approach where different AI models are used for different task difficulties: cheap/fast models for simple tasks, mid-tier models for core tasks, and top-tier models only for difficult problems.

Harness engineering

The discipline of building the infrastructure (context management, tool configuration, subagent isolation) that allows a developer to get the best performance out of an AI coding model, regardless of the model's inherent capability.

Source: juejin.cn ↗ Google Translate ↗ Backup ↗