The Three Tiers of AI Skill Design: Correct, Robust, Adaptive

How to Design a High-Quality Skill

Three tiers: first write it correctly, then write it robustly, and finally write it adaptively. Plus a benchmark closed loop.

I've written quite a few Skills in real projects, and I've stepped in more pitfalls than the number of Skills I've written 😅

Looking back, most problems weren't "I don't understand how to write the syntax." It was that from the very beginning, I didn't take the Skill seriously. If you treat it as a prompt, it delivers at the level of a prompt. If you build it as a small system, it can handle complex scenarios.

⚠️ Disclaimer: Everything below is purely my personal experience. Believe it or not... that's also normal 😅

1. First, Write It Correctly

This tier solves only one thing: Can the AI execute the instructions you wrote without deviation?

Sounds simple, but at least half of the Skill files I've reviewed fail at this stage 🤷

Instructions Are Not Explanations

SKILL.md is an operational instruction set, not a knowledge base document. Explanations are only meaningful when they "define behavioral boundaries."

Let's first talk about why this distinction matters. Structures for humans need to lay out background, explain motivation, and discuss design rationale. But AI doesn't work that way—feeding it backstory doesn't make it "understand more deeply"; it just gets distracted by irrelevant information.

So, can explanatory content never be written? No. Keep only one type: content that defines the scope of behavior. For example, "This field only takes effect when the document source is a wiki"—this is defining conditions and must stay. But explanations like "Why this step was designed this way"—try deleting them. If the AI's behavior doesn't change, they should be cut.

The standard is one sentence: If you delete this section, will the AI not know what to do in some scenario? No? Then delete it.

Prioritize Key Points

Each instruction: **Keyword/Action Name**: Concise description, allowing the AI to locate key information at a glance.

The reason: AI reads files differently than humans. A human can scan a whole page and summarize the key points themselves. AI relies more on explicit formatting signals—bold text, beginnings of lines, list items—these are its signposts for extracting information. If you make it find key information in regular paragraphs on its own, it's prone to miss or misinterpret things.

Comparing two writing styles makes it clear. Bad style:

"When processing Feishu documents, please note that if you need to read the document content, you cannot directly read the full text. You need to first use the document link to get something called a block tree..."

Good style:

Document Reading: First call the block tree API, recursively parse content, direct full-text requests are forbidden. Block Parsing: Route by type—text/header/list go to text extraction, table goes to the table parser. Capacity Limit: A maximum of 500 blocks per single fetch; pagination required if exceeded.

In the first style, the AI has to summarize for itself, "Oh, that means get the block tree first." The summarization step itself can go wrong. In the second style, the AI scans once and locates the information it needs.

Keep the same structure for every instruction: bold text is the anchor, what follows the colon is the action and constraint. The AI forms a habit: seeing bold text means this is something to execute.

Structured Expression

Arrange instructions in four layers: role boundaries, main flow, constraints, exceptions. The position of information determines the AI's understanding priority.

The arrangement order of instructions in the file determines the AI's understanding priority.

Position is weight. If a forbidden item is buried in the sixth paragraph of the third layer, the AI will likely ignore it. Place constraints directly in a prominent position, and the AI remembers the boundaries first before starting work.

I layer it like this:

Layer	Content	How AI Sees It
①	Role and Boundaries: What I manage, what I don't	Build the framework first
②	Main Flow: How the normal path goes	Then recognize the path
③	Constraints and Prohibitions: What cannot be touched	Then remember the red lines
④	Exception Handling: What to do when the path is blocked	Finally remember the fallback

And within the same layer, there must also be priority. Group constraints of the same type together; don't scatter them across different paragraphs. For example—capacity limits, authorization checks, timeout handling—these three belong to the same constraint layer and should be written together:

Capacity Limit: A maximum of 500 blocks per single fetch; pagination required if exceeded. Authorization Check: Only process documents the current user has permission for; return E002 for unauthorized ones. Timeout Handling: Single operations exceeding 30 seconds are considered failed; return E003 and retry.

Placed together, the AI knows "Oh, these are all constraint-type rules," rather than encountering one scattered in some paragraph.

Mermaid First

Draw conditional branching logic directly as diagrams; don't use text to go in circles. Exceptions are things that can be clearly stated in one sentence.

Describing branching logic like "If A then B, else if C then D" in text has three hard flaws: deep nesting inevitably creates ambiguity, edge cases are easily missed, and AI's reasoning over natural language conditions is unstable.

Mermaid's advantage: it has no room for ambiguity. What a three-line diagram can make clear, text might circle around for ages and still get wrong.

graph TD
    A[Receive Request] --> B{Is Document Operation?}
    B -->|Yes| C[Get Block Tree]
    B -->|No| D[Route to Corresponding Skill]
    C --> E{Block Count > 500?}
    E -->|Yes| F[Paginated Fetch]
    E -->|No| G[Single Fetch]
    F --> H[Recursive Parsing]
    G --> H

This rule also applies to logic like "one condition leading to multiple branches" and "loops/fallbacks." The judgment standard is one thing: if describing the branch in text takes more than three sentences to make clear, put it in Mermaid.

When the material format is clean, the AI is less likely to go off track.

2. Then, Write It Robustly

After writing correctly comes the next hurdle: when the task gets complex and instructions multiply, the AI itself gets confused.

At this point, piling instructions into one file is useless. Architecture is needed.

Orchestrator

In a complex Skill, SKILL.md only does scheduling. Stage definitions, I/O contracts, gate declarations—details are pushed down.

Simple Skills, like a translation completion, are fine with a few dozen lines all in one file. But complex Skills—"meeting minutes organization" includes querying, extracting, structuring, generating, and sending, five stages—if crammed into one file, the AI forgets the beginning by the time it reads the end, and execution quality actually drops.

The orchestrator only manages four things:

How many stages the process has, and their sequence.
The input and output of each stage.
Where the hard constraints and gates are placed.
But does not carry all the details of each stage on its own back.

Details are pushed down to sub-files or scripts. Keeping the orchestrator around 300 lines is a rule of thumb—fewer lines suggest the stages aren't granular enough; more lines suggest the orchestrator is carrying too many details it shouldn't.

Mini-Loop

Each key step internally self-validates in a loop, rather than running a straight line to the end. Do one step, validate one step, only proceed after passing.

If a Workflow is a straight line of A → B → C → Done, any problem in the middle step will cause the entire chain to break or pass dirty data downstream. A better approach is for each step to loop back to validation itself.

Global Flow
  ├─ Collection Stage
  │   └─ Mini-Loop: Pull Data → Check Completeness → Fill if Missing → Re-check → Pass, then Proceed
  ├─ Processing Stage
  │   └─ Mini-Loop: Process → Format Validation → Fix if Wrong → Re-validate → Pass, then Proceed
  └─ Output Stage
      └─ Mini-Loop: Output → Quality Judgment → Optimize if Not Good Enough → Re-judge → Pass, then Deliver

A global loop means redoing the entire task from scratch if it fails. A mini-loop means fixing a step within that step if it's not done well, without passing dirty data to the next step.

Mini-loop design must focus on three points. First, the validation condition must be decidable: it can't be a vague standard like "good quality"; it must be checkable conditions like correct format, complete fields, correct quantity. Second, there's a cap on the number of retries: generally, if it fails after three rounds of fixes, degrade or throw to the orchestrator. Third, there must be an exit when it can't be fixed: if a step can't handle it, how the orchestrator takes over must be thought out in advance.

Hard/Soft Layering

Deterministic behavior → solidified in scripts; flexible judgment → guided by instructions. Scripts set the floor, instructions raise the ceiling.

If you look closely at what a Skill needs to do, it actually falls into two types of work.

One type is deterministic work: assembling URLs, calculating dates, validating formats. No ambiguity; whoever does it gets the same result. This should not be left for the AI to reason about—using the AI's brain to calculate a date in a fixed format wastes tokens and has a probability of error.

One type is non-deterministic work: judging text types, choosing strategies for exceptions, adjusting result styles. There's no standard answer; it relies on context for flexible decisions. This is what should be left to the AI, with instructions telling it the principles and priorities, letting it judge on the spot.

Take Feishu document operations as an example. The block tree parsing logic is fixed—a 50-line script encapsulates it; the AI doesn't need to derive it, it just calls it. But how to layout a nested table with messy formatting? The AI judges on the spot; instructions give the principles.

Deterministic things anchored by scripts—the floor can't collapse. Non-deterministic things given judgment space by instructions—the ceiling can emerge.

HARD Gates

Set hard checkpoints at four types of positions: entry, between steps, exit, and security. If you can't pass, stop. It's not a reminder to continue.

Writing "forbidden" or "must" in instructions is a soft constraint. It works most of the time, but in high-stakes scenarios, the AI might be led astray by context and still step on the red line.

HARD gates mean: If you can't pass, you can't pass. It's not a reminder; it's a stop.

Four types of positions must have checkpoints:

Position	What It Does	Example
Entry	Preconditions not met → Reject, state reason	Document link must start with `https://`, otherwise block and return E001
Between Steps	Previous step's output unqualified → Block at the boundary	Block parsing result empty → Do not enter content extraction, return 'Document has no content'
Exit	Final result not up to standard → Block from output	Extracted text missing a top-level heading → Send back for correction, do not output directly
Security	Sensitive operations → Secondary confirmation	Delete operation → Must have user confirmation

Each gate is written as a decidable condition—not "looks good quality," but specific check logic. After triggering, it's not "remind and continue"; there are only three paths: go back and redo, degrade to a fallback, or directly block and return the reason.

The second tier boils down to this: single-file instructions are "doing this one thing well"; after adding an orchestrator, mini-loops, script layering, and gates, it's "no matter how complex, it won't break." Every added layer has a clear division of labor—fixing what the previous layer couldn't handle.

3. Finally, Write It Adaptively

The first two tiers operate within a pre-defined box. Task boundaries are known, processes are set in advance.

But there's a class of scenarios: when the AI starts, the information is incomplete. It has to find things itself, take one step and look at the next, adjusting strategy based on intermediate results 🤔

At this point, it no longer relies on one or two scripts, but on a family of scripts defined on demand, each with its own role.

Defined On Demand

Scripts are named with English verbs; the name reveals the function. validate is the baseline; the rest are added as needed.

The name directly reflects what the script does. Write whatever your Skill needs; the only requirement is: the model sees the name and knows when to call it.

Each type of script has its own positioning. Let's illustrate with a few common ones:

validate: Checks the correctness of the previous step's output. Takes the output artifact and checks it item by item against preset conditions, returning pass/fail, with specific reasons for failure. No matter how simple or complex the Skill, validate should exist—it's the baseline. This is the executable version of the HARD gates mentioned earlier: gates written in the orchestrator are paper rules; written in a validate script, they become automated checkpoints.

search: Searches for information inside the Skill itself. Internally indexed data, cached results, preset knowledge entries. The model calls search to find things in its own data pool; it has nothing to do with the outside.

research: Searches for external information. Data is not inside the Skill; it needs to call external APIs, query databases, pull real-time data. The essential difference from search: search rummages through its own drawer; research knocks on someone else's door.

audit: Keeps logs. Records what script was run at each step, what the input and output were. When something goes wrong, you don't need to flip through dozens of pages of conversation logs; audit strings the key nodes into a timeline.

grade: Assigns a score. validate is "pass or fail" (binary judgment); grade is "how good" (level judgment). Two different concepts.

flow: Manages where to go next. The orchestrator defines key branching nodes; the flow script is the navigator—call it, and it returns the possible directions from the current state. It doesn't choose for the model, but ensures the model doesn't invent a path that doesn't exist.

Add whatever is needed; don't pad the numbers. But validate is the foundation.

Shuttle Interaction

After scripts are combined, the model no longer "reads instructions and works in silence," but shuttles back and forth between scripts.

Specifically, the execution chain becomes like this:

graph TD
    Start[Receive Task] --> Search[search: Check Internal Info]
    Search --> NeedExternal{Internal Info Sufficient?}
    NeedExternal -->|No| Research[research: Check External Info]
    Research --> NeedExternal
    NeedExternal -->|Yes| Exec[Execute One Step]
    Exec --> Audit1[audit: Record Execution]
    Audit1 --> Validate[validate: Check]
    Validate --> Pass{Pass?}
    Pass -->|No| Fix[Fix]
    Fix --> Audit2[audit: Record Fix]
    Audit2 --> Exec
    Pass -->|Yes| Audit3[audit: Record Pass]
    Audit3 --> Flow[flow: Next Step?]
    Flow --> Done{Done?}
    Done -->|No| Search
    Done -->|Yes| Grade[grade: Score]
    Grade --> Output[Output Result + Score]

Each step is backed by a corresponding script. The model first rummages through its own drawer; if insufficient, goes out to find; does, checks, records, evaluates, moves—each step has deterministic assurance, and each step's decision power is returned to the model.

This and the mini-loops from earlier: mini-loops are self-correction within a single step; the script family is adaptive across multiple steps. One nested inside the other.

Dual Constraints

The more scripts, the greater the risk of deviation. Flow constraints manage direction; validation constraints manage quality. Both must pull simultaneously.

But scripts alone aren't enough. Scripts provide the model with an operating space—but the larger the operating space, the greater the risk of deviation.

Both constraints are indispensable.

Flow constraints rely on the flow script and the orchestrator state machine: you can move freely, but only along the drawn paths.

Validation constraints rely on the validate script: each step must pass a hard standard before continuing.

Research by Clark et al. also points to the same conclusion: while giving Agents greater freedom in information acquisition, structured process contracts must be used to constrain boundaries[^1]. The balance point between freedom and constraint is the scale of your design quality.

[^1]: Clark et al. arxiv 2605.19604v1

The third tier ultimately comes down to one sentence: A Skill is not one file commanding the AI. The orchestrator schedules, scripts provide the floor, and the model actively shuttles between them in the middle. An interactive system.

Between Done and Done Right, One Step of Benchmark

The Skill is written. How do you confirm it's not just "I think it's good"?

Use benchmark data to speak, not feelings.

skill-creator comes with built-in benchmark capabilities. Prepare a set of test inputs for typical scenarios, run a round, and watch a few core metrics: trigger accuracy, task completion rate, whether exception paths are handled. Once the data is out, tune wherever the score drops. After changes, run again and compare with the previous round—better or worse, the numbers say.

This is the same logic as writing unit tests for code. Who dares to deploy code that hasn't passed tests? Don't rush to say a Skill is finished if it hasn't passed a benchmark.

After getting the data, look back at the three tiers to see if anything was missed. Low trigger rate → Go back to Tier 1 and check if the description is clear. Completion rate collapses at a certain stage → Go back to Tier 2 and check the mini-loop and gates for that stage. Large-scale failures in edge scenarios → Go back to Tier 3 and check if the script family coverage is sufficient.

The entire closed loop:

graph LR
    A[Design] --> B[Implement]
    B --> C[Benchmark]
    C --> D{Data Satisfactory?}
    D -->|No| E[Locate Problem]
    E --> A
    D -->|Yes| F[Deliver]

Benchmark is not a formality; it's the engine that makes the loop turn. Without it, your judgment of quality is all based on feeling. With it, the data speaks for you.

Summary

graph LR
    A[Simple Instructions] --> B[Structured Instructions]
    B --> C[Orchestrator + Workflow]
    C --> D[Hard/Soft Layering + Gates]
    D --> E[Lightweight Agent]

    A -.-> S1[Tier 1: Write Correctly]
    B -.-> S1
    C -.-> S2[Tier 2: Write Robustly]
    D -.-> S2
    E -.-> S3[Tier 3: Write Adaptively]

Tier	Specific Methods	What It Fixes	When It's Enough
Simple Instructions	Write directly in one file	—	Small tasks, clear boundaries
Structured Instructions	Prioritize key points + bold prefixes, information layering, Mermaid for flows	AI understanding doesn't go off track, doesn't mis-summarize	Many rules but straight flow
Orchestrator + Workflow	Stage splitting, mini-loop self-correction	Execution doesn't get chaotic when tasks are complex	Multi-step with sequential dependencies
Hard/Soft Layering + Gates	Scripts solidify determinism, instructions keep flexibility, HARD gates guard boundaries	Floor doesn't collapse, boundaries aren't crossed	Many deterministic operations
Lightweight Agent	On-demand script family definition (validate as baseline, search/research/audit/grade/flow added as needed), model shuttles between scripts	When presets are insufficient, autonomously search/check/record/evaluate/move	Information is uncertain, must judge while doing

The three tiers are new layers stacked on old, not new replacing old. Each layer added fixes the previous layer's problem of "usable but not stable enough."

Tier 1 fixes AI understanding deviation. Tier 2 fixes collapse under complexity. Tier 3 fixes insufficient presets.

What's important isn't which tier you reach. It's that you clearly know why you stopped here, and when you should move forward.