Benchmark Scores Are Lying to You: A Flutter Dev's Three-Tier Strategy for Picking the Right AI Model

Hello everyone, I'm Lao Liu.

Previously, I published an article saying I canceled my subscription to Trae and switched to Copilot plus a third-party model plan.

I canceled Trae and switched to this AI toolchain

In that article, I mentioned my three-tier model selection strategy, and many people privately asked me: "Lao Liu, what benchmarks do you look at when selecting large models for actual programming?"

So let's talk about which evaluations we should use to truly assess the working ability of large models in this era of benchmark-chasing.

Note: Lao Liu's strategy and actual choices are based primarily on Flutter projects. This is for reference only.

No single benchmark can fully represent a model's performance in real work

Because all benchmarks essentially create a test set. They score based on how a large model performs on questions within a specific agent, or based on fixed evaluation results.

So these benchmarks can only represent a model's performance on specific types of problems within a specific agent.

Your actual work scenario is likely quite different from the test question set.

Take Lao Liu's work scenario as an example:

When we develop a requirement, the input is a PRD document from the product manager and UI/UX designs from the designers.

These documents contain a lot of contextual information that we can represent with simple words in daily work.

For example, a "single product page" or "campaign page" mentioned in the document might specifically refer to products or campaigns from this year's 618 or Double 11.

So to solve this problem, we use a large number of skills or rules in actual development to ask the large model to clarify unclear concepts for the programmer.

If you want to truly evaluate a large model's performance in your actual work environment, the correct approach is to organize multiple past requirements and bugs into a private test set.

Then manually evaluate how the large model performs on your private test set in your work environment. This includes not only whether the functionality is completed or the bug is fixed, but also whether the generated code conforms to your entire project's coding standards and contextual habits.

So are benchmarks meaningless?

Someone asked: "Lao Liu, I don't understand what you're saying. I just want to know which scores to look at."

Actually, they are not meaningless.

Essentially, they can truly reflect a large model's ability in a specific problem domain. For example, logical reasoning, mathematical derivation, generating a function for a specific feature in coding, or solving a specific requirement in long-form coding.

But all this is based on the premise that the model has not overfitted to certain test sets.

But look at reality: new models are topping the charts every few days. However, they only release scores for a few benchmarks and remain silent about others.

So, if you want to use benchmark scores to preliminarily evaluate whether a large model is suitable for your workflow, my suggestion is to choose closed-source benchmarks.

That is, benchmarks where the test questions are not publicly available.

Here are two benchmarks Lao Liu frequently uses:

1. CursorBench

CursorBench: https://cursor.com/cn/cursorbench

This is a test set compiled by Cursor based on a large number of their cases. The biggest advantage is that this test set is not public.

Moreover, this benchmark is mainly aimed at development scenarios, so it has high reference value for us.

But the problem is that this benchmark is not updated frequently, and many newly released domestic large models are not included.

So this benchmark is more suitable for evaluating the true coding ability of top foreign large models.

Additionally, Cursor's own model ranks very high in this benchmark, so it should be viewed with caution.

2. LiveBench

LiveBench: https://livebench.ai/

LiveBench's test set is also not public. Additionally, it has more test dimensions and updates mainstream large models promptly.

So, if you subscribe to some domestic coding plans and are struggling to choose which large model to use as your daily workhorse, I suggest you look at this benchmark.

Where do you think the models you consider strong rank?

For daily work, I recommend referring to the overall score. For code development, refer to the following four dimensions:

Dimension	Description
Reasoning Average	Reasoning ability. This is very important when building complex business logic.
Coding Average	Can be simply understood as the ability to implement a functional function of this size. In enterprise development, if you have broken down the functionality very finely, this ability is a very important evaluation dimension.
Agentic Coding Average	This evaluates the model's ability to independently complete a functional module. For independent developers, this dimension is very important.
IF Average	Instruction following. This is a very important ability in any development scenario. Especially in enterprise-level development, it directly affects whether the code generated by the large model can accurately conform to the project's code standards.

It can be seen that GLM 5.2 ranks very high and is indeed a capable choice for programming tasks.

Note the distinction from LiveCodeBench. The core differences are as follows:

Comparison Dimension	LiveBench	LiveCodeBench
Evaluation Scope	Comprehensive ability (coding, math, reasoning, writing, etc.)	Focuses on programming ability
Test Set Publicity	Closed, questions are regularly updated and replaced	Open source, test set is public
Anti-cheating Mechanism	Yes, prevents model overfitting to the test set	No, can be locally reproduced and verified
Question Source	Comprehensive question bank maintained by the LMSYS team	Competitive programming + real codebase problems
Reference Value	More suitable for evaluating overall model strength	Suitable for targeted evaluation of programming ability
Score Credibility	Higher, difficult to game	Relatively lower, may be overfitted

In short: LiveBench is harder to "game" and has higher reference value; LiveCodeBench is suitable for targeted evaluation of programming ability.

3. DeepSWE

DeepSWE: https://deepswe.datacurve.ai/

Its v1.1 version was updated on June 20 this year, measuring and evaluating the performance of cutting-edge programming large models on originality and long-cycle engineering tasks.

This is a benchmark mainly for complex programming tasks. The questions are very difficult, and it was just released, so no model can copy homework.

I think this benchmark can also reflect the true ability of models to some extent:

For top models, choose the top three, with GPT-5.5 offering the best value for money.

In the second tier, GLM-5.2 is the most cost-effective. It is also the most capable open-source model.

Lao Liu's Three-Tier Model Selection Method: Save Money and Be Efficient

Lao Liu mentioned at the beginning of the article that benchmarks are only for reference; you need to evaluate models in your actual work scenario.

So here is a basic template. You can replace the models based on your actual work situation.

Lao Liu divides daily work into three tiers: Simple Tasks, Core Tasks, and Difficult Tasks.

1. Simple Tasks

For example, organizing the basic functions of a GitHub repository, aggregating data from several websites, developing a simple button, or modifying text prompts.

Model selection: deepseek-v4-flash, agnes-2.0-flash, mimo-v2.5, qwen3.6-plus

These tasks are usually simple and don't require top-tier large models, so lightweight models with low cost and high speed are the best choice.

Many platforms offer free usage quotas for these lightweight models, which is basically sufficient for daily work.

In fact, agnes-2.0-flash's ability is also basically sufficient for core tasks, but because it is currently completely free, it is recommended here for this tier.

2. Core Tasks

For example, writing a core page, writing the business logic for a module, comparing and selecting technical solutions.

Model selection: Development tasks: glm-5.2 Non-development tasks: deepseek-v4-pro, qwen-3.7-max, gemini-3.5-flash

In fact, there are many choices for this tier. Basically, all mainstream first-line models at home and abroad are capable. The main thing is to choose the one with the best cost-performance ratio based on the coding plan you can buy.

Additionally, you need to test the effects of different models based on your actual work scenario.

For example, Lao Liu found that in this tier, whether for development tasks (Flutter projects) or non-development tasks, gemini-3.5-flash provides a very good experience, with fast speed and relatively accurate results.

3. Difficult Tasks

There are no typical examples for this tier. Generally, if a model can't handle a core task, Lao Liu directly switches to the top-tier large model and tries again.

Model selection: GPT-5.5 Thinking xHigh, Claude 4.8 Opus

When encountering difficult problems, top-tier models are likely to perform better, but they may not necessarily solve the problem.

It still requires cooperation between the developer and the model. For example, you provide doubts and ideas, and let the large model help you test and verify.

Lao Liu suggests that if you encounter a difficult problem, you should quickly switch to a top-tier model for analysis and processing.

Otherwise, you might waste a lot of time, energy, and tokens trying various magical ideas, only to find that the problem is not solved and you haven't saved any money.

Model Selection Isn't Everything

Of course, the quality of the model directly affects how well and how quickly tasks are completed.

But ultimately, for some tasks, even the best model may not achieve ideal results. Similarly, with the right approach, a non-top-tier model can achieve results comparable to a top-tier model.

This depends on context management: providing the model with clear, undiluted context; configuring clear and efficient tools; isolating the massive information brought by tool calls through subagents; and maintaining state, cleaning, and compressing unimportant information in multi-turn loops.

Essentially, this is what harness engineering aims to solve.

But everyone should always understand one thing: Whether you can manage your context determines whether you use a top-tier model to produce third-rate results, or use a very cheap coding plan to complete work that only top-tier models can do.

Summary

Benchmarks are just references. What truly determines efficiency is whether you have a model selection strategy that suits you.

A good model matching strategy saves money and can solve difficult problems.

On the other hand, with the right approach, a cheap model can also achieve the effect of a top-tier model. This is the core message Lao Liu wants to convey.

Welcome to leave a comment and share which models you usually use and how they perform.

🤝 If any students here are interested in client-side or Flutter development, feel free to contact Lao Liu. Let's learn from each other.

🎁 Send a private message to get Lao Liu's "Flutter Development Handbook" for free, covering 90% of application development scenarios. It can serve as a knowledge map for learning Flutter.

💬 : laoliu_dev

📂 Lao Liu has also organized his historical articles in a GitHub repository for easy reference.

🔗 https://github.com/lzt-code/blog