跪拜 Guibai
← Back to the summary

Ponytail Forces AI Coding Agents to Climb a 7-Step 'Laziness Ladder' Before Writing Anything

TL;DR: Ponytail gained 60,000 stars in two weeks, ranking first on GitHub Trending. It turns "write concise code" from a vague suggestion into a 7-level yes/no decision ladder. Tests show it cuts 54% of code, saves 20% in costs, and achieves a perfect security score. It is fundamentally different from memory-adding projects—one is a subtraction rule (failures are immediately obvious), the other is an addition rule (failures go unnoticed). This article dissects its architecture, three designs worth stealing, and scenarios where it shouldn't be used.


60,000 Stars in Two Weeks: What is Ponytail

Recently, a project called Ponytail appeared on GitHub. At the time of writing, it has over 60,000 stars and is number one on the trending list.

Its core function is one thing: stuffing the decision tree of a "lazy senior engineer" into an AI Agent's system prompt. Not vague advice like "write clean code," but a 7-level ladder—before writing any code, the Agent must start climbing from the first level and stop at whichever level it can stand on:

  1. Does this really need to be done? → Skip if not
  2. Does it already exist in the codebase? → Reuse
  3. Can the standard library do it? → Use the standard library
  4. Can native platform capabilities cover it? → Use native
  5. Can an already installed dependency solve it? → Use existing dependency
  6. Can it be done in one line? → One line
  7. Last resort: Write the minimum working code

It sounds like a prompt to "teach an Agent to write concise code." But after going through its 80 files, I found the problem it solves goes much deeper than "conciseness."

Before this, I had reservations about such projects. Projects that add memory to Agents—like Claude Mem, which compresses context into memories at the end of each session and injects them at the next startup—share a common problem: memory rots. Three months ago you said you "liked React Query," three months later you switched teams to tRPC, and the Agent is still writing code based on old preferences. You spend half an hour before realizing it's making a decision you stopped making long ago.

This concern applies to all projects that stuff things into an Agent's system prompt. Ponytail falls into this category—it injects an "anti-over-engineering" rule into the system prompt. My first impression wasn't "awesome," but "here we go again."

But after going through the code, my judgment is: Ponytail and memory-adding projects like Claude Mem are two different things. Adding memory means adding information you don't know when will become outdated; Ponytail adds a decision bias that doesn't expire. But that doesn't mean it can be mindlessly enabled globally.


Core Mechanism: The 7-Level Ladder

Ponytail's core is not a prompt, but a decision tree. Before writing any code, the Agent must start climbing from the first level and stop at whichever level it can stand on:

1. Does this really need to be done? → No: Skip (YAGNI)
2. Does it already exist in the codebase? → Reuse, don't rewrite
3. Can the standard library do it? → Use the standard library
4. Can native platform capabilities cover it? → Use platform capabilities (<input type="date"> instead of a datepicker library)
5. Can an already installed dependency solve it? → Use existing dependency, don't add new ones
6. Can it be done in one line? → One line
7. Last resort: Write the minimum working code

The judgment criteria for each level are extremely specific—"Does the standard library have it?" is a question answerable with a grep. "Can a native <input> be used?" is a question answerable by checking MDN. The Agent doesn't need "taste," it just needs to eliminate options step by step.

The ladder runs after understanding, not instead of it. The SKILL.md explicitly states: the Agent must first read all relevant code, trace the real call chain, and only then climb the ladder. Laziness applies to the solution, not the reading.


Three Intensity Levels

It's not an "on/off" switch; it's progressive:

Level Behavior When to use
lite Writes code normally, casually mentions "there's actually a lazier way" Daily development, don't want to risk it
full (default) Strictly climbs the ladder, outputs the shortest solution When code starts to bloat
ultra YAGNI extremism, even questions the requirements themselves "Nuclear deterrent" before refactoring
# Same requirement "add a cache," outputs from three levels:
# lite:   "Done, cache added. FYI: lru_cache covers this in one line."
# full:   "@lru_cache(maxsize=1000) on the fetch. Skipped custom cache class."
# ultra:  "No cache until a profiler says so. When it does: @lru_cache."

Architecture: One Rule, 16 Adapters

First, look at the file structure—the entire project has 80 files, with only two core lines:

ponytail/
├── AGENTS.md                    ← Core rules (static version readable by all Agents)
├── skills/ponytail/SKILL.md      ← Full version rules (with lite/full/ultra levels)
├── hooks/
│   ├── ponytail-instructions.js  ← Instruction builder (single source of truth)
│   ├── ponytail-config.js        ← Config parsing (env > config.json > default)
│   ├── ponytail-runtime.js       ← Mode state read/write (flag file + output protocol)
│   ├── ponytail-activate.js      ← SessionStart hook: inject rules + status bar
│   ├── ponytail-subagent.js      ← SubagentStart hook: also inject into sub-Agents
│   └── ponytail-mode-tracker.js  ← UserPromptSubmit hook: track /ponytail commands
├── skills/                       ← 6 skills (ponytail / review / audit / debt / gain / help)
├── .cursor/rules/                ← Static rule files (Cursor / Windsurf / Cline / Kiro / Copilot)
├── .opencode/plugins/            ← OpenCode plugin
├── pi-extension/                 ← Pi extension
├── ponytail-mcp/                 ← MCP server
├── gemini-extension.json         ← Gemini CLI extension
├── plugin.yaml                   ← Hermes Agent plugin
└── __init__.py                   ← Hermes Python plugin implementation

image.png

Instruction Builder: The Single Source of Truth

ponytail-instructions.js reads SKILL.md (6400 characters), filters out irrelevant rule lines based on the current mode, and outputs a uniformly formatted injection text. Claude Code's hooks call it, Pi's extension calls it, OpenCode's plugin calls it, and the MCP server calls it. Change one sentence in SKILL.md, and all 16 Agents take effect on their next startup.

In CI, check-rule-copies.js checks whether all static copies (.cursor/rules/, .clinerules/, .github/copilot-instructions.md) are synchronized with AGENTS.md. It's not "write a rule for each Agent," but "one rule, multiple injection methods."

Hooks Pipeline

Agents with hook capabilities (Claude Code, Codex) inject at three lifecycle nodes:

User starts session ──→ SessionStart hook ──→ Write flag file (.ponytail-active)
                      ├─ Inject rules into system prompt
                      └─ Detect status bar, prompt if not configured

User sends message ────→ UserPromptSubmit hook ──→ Detect /ponytail lite|full|ultra|off
                           └─ Update flag file

Agent creates subtask ──→ SubagentStart hook ──→ Read flag file, inject rules for sub-Agent

image.png SubagentStart is an easily overlooked but critical link. Without it, the main thread has rules, but the sub-Agent doesn't—the sub-Agent writes code with default behavior, unconstrained by the ladder. Issue #252 specifically documents this problem.

Mode State Machine

┌──────────┐  /ponytail lite   ┌──────────┐
│   off    │ ────────────────→ │   lite   │
│ (no inject)│ ←──────────────── │ (remind)  │
└──────────┘  stop ponytail    └────┬─────┘
                                    │ /ponytail full
                                    ▼
                                ┌──────────┐  /ponytail ultra  ┌──────────┐
                                │   full   │ ────────────────→ │  ultra   │
                                │ (default)│ ←──────────────── │ (extreme) │
                                └──────────┘  /ponytail lite   └──────────┘

State is persisted in ~/.claude/.ponytail-active (a flag file), shared by all hook nodes. The PONYTAIL_DEFAULT_MODE environment variable or ~/.config/ponytail/config.json controls the default mode for new sessions.

Skills System

5 auxiliary skills, each "read-only, no write":

Skill What it does
/ponytail-review Reviews diffs for over-engineering, one finding per line
/ponytail-audit Full repository scan, ranks what should be deleted most
/ponytail-debt Greps all ponytail: comments, generates a tech debt ledger
/ponytail-gain Shows benchmark effect data
/ponytail-help Command quick reference

The output format of ponytail-review is particularly worth learning—one finding per line, each finding containing only location, label, and replacement:

L12-38: stdlib: 27-line validator class. "@" in email, 1 line.
L4: native: moment.js for one format call. Intl.DateTimeFormat, 0 deps.
L52-71: delete: retry wrapper around idempotent call. Nothing replaces it.
net: -43 lines possible.

No fluff like "This class might be more complex than necessary." Each finding directly tells you what to cut and what to replace it with.

Adapter Panorama

Adapter Layer Injection Method Mode Switching Covered Agents
Lifecycle Hooks SessionStart + SubagentStart + UserPromptSubmit /ponytail command Claude Code, Codex, Copilot CLI
System Prompt Injection Append per-turn via system.transform Same as above OpenCode, Pi
MCP Protocol ponytail_instructions tool + prompt Pass mode on call Any Agent supporting MCP
Static Rule Files Copy .cursor/rules/ etc. None Cursor, Windsurf, Cline, Kiro, CodeWhale
Hermes Plugin pre_llm_call + gateway rewrite Same as Hooks Hermes Agent

Effect Data

On a real project (FastAPI + React template, Haiku 4.5, 12 feature tasks, each run 4 times averaged):

Metric vs No Skill Baseline
Lines of Code -54%
Token Consumption -22%
Cost -20%
Time -27%
Security 100%

Compared to other solutions: caveman (minimalist expression) cut 20% of code but tokens and cost actually increased—saying less doesn't mean thinking less; the prompt "YAGNI + one-liners" cut 33% but security dropped to 95%. Ponytail is the only one that was positive across all metrics simultaneously.

Benchmark Honesty

Ponytail's benchmark contains a passage worth noting:

Our early agentic benchmark showed Ponytail had only a ~4% difference, and we almost released it. Later we found it was a bug—Ponytail's SessionStart hook triggered on every arm, including the baseline. So the baseline was actually secretly running Ponytail. Only after fixing it did we see the real -54%.

Moreover, the author himself admitted the old benchmark was biased—the old version was single-shot (one prompt, one completion), the baseline would output a bunch of options and explanations, and "lines" were mixed with prose. The new version switched to real Claude Code sessions running on a real repo, with lines counted from git diff. This level of honesty in "overturning one's own old data" is uncommon in open-source projects. (Colin Eberhardt pointed out the old benchmark's bias in issue #126, and the Ponytail author accepted the criticism and redid the entire version.)


Three Designs Worth Stealing

image.png

1. The Ladder Model: Turning Taste into a Decision Process

Most prompts for writing concise code are vague—"write clean code," "keep it simple." The Agent doesn't know what conciseness means.

Ponytail's solution: Don't give a definition, give a decision tree. The judgment criteria for each level are extremely specific—"Does the standard library have it?" is a question answerable with a grep. "Can a native <input> be used?" is a question answerable by checking MDN. The Agent doesn't need taste, it just needs to eliminate options step by step.

This idea can be transferred to any scenario where you want to constrain Agent behavior. Don't tell the Agent "be secure," tell it "every user input first passes whitelist validation, then parameterized SQL, finally set CSP header." Don't tell the Agent "be testable," tell it "every non-trivial function leaves an assert-style self-check, if there's input write one normal case + one edge case."

The key is not "giving instructions," but "giving a judgment process." Good Agent rules are not a list of do's/don'ts, but a decision tree that allows the Agent to make a yes/no judgment at each step.

2. ponytail: Comments: Leaving an Audit Trail for Simplification

# ponytail: global lock, per-account locks if throughput matters

The comment must contain two pieces of information: the upper limit of the current simplification + when to upgrade. Three months later, seeing this comment, you know why there's only a global lock here, and what signal indicates it's time to switch to a finer-grained lock.

The key to this convention is the "upgrade trigger condition"—not to write more code now, but to let future people know "this simplification is not a bug, it's a decision, and the decision condition is X." Simplification without comments is debt; simplification with comments is a correct decision existing under this constraint.

Ponytail also comes with a /ponytail-debt skill that greps the entire repository for ponytail: comments and generates a tech debt ledger. Comments marked no-trigger—i.e., those that only state the simplification but not the upgrade path—have the highest risk of rotting.

3. The No-Lazy List: Boundaries of What to Cut and Keep

No matter how high you climb the ladder, these things absolutely cannot be cut: input validation at trust boundaries, error handling that prevents data loss, security measures, accessibility, hardware calibration.

The value of the list is not "what items are listed as uncuttable," but turning the boundary of cutting and keeping from gut feeling into categories. The Agent doesn't need to judge "is this validation important," it just needs to check—is this input validation at a trust boundary? Yes, then it cannot be cut.

This list can be customized by project type. For payment systems, add "idempotency checks" and "audit logs." For data migration, add "rollback path" and "data validation." For frontend components, add "keyboard navigation" and "screen reader labels." The longer the list, the more reliable the Agent is on the boundary of "cannot cut."


Anti-Patterns

Usage Consequence
Regardless of project, install and turn on full May miss necessary checks in payment/security scenarios
Thinking it only affects formatting like Prettier It affects architectural decisions, not indentation width
Treating "one line if possible" as "can swallow errors" try { ... } catch {} is also one line
Simplifying but not leaving ponytail: comments Three months later, a colleague sees a global lock and thinks you didn't think of it
Not handling conflicts with project CLAUDE.md Agent randomly picks one, you never know which one it picks next time

Its Fundamental Difference from Memory-Adding Projects

Projects like CloudMem are addition rules—adding memory, preferences, history to the Agent. If these "additions" become outdated, the Agent writes code based on wrong premises, and you can hardly tell—the behavior looks "reasonable," just based on old preferences you've forgotten.

Ponytail is a subtraction rule—telling the Agent to write less code. The side effect of subtraction rules is "insufficient code"—the code doesn't run, the functionality is incomplete, visible at a glance. Say "add X," and the Agent fills it in.

The failure of addition rules is implicit; the failure of subtraction rules is explicit. This is the fundamental difference.


Practical Advice

Scenarios Where It Should Not Be Used

In error handling scenarios, "one line if possible" can become "swallow if possible." fetch().catch(() => null) is also one line. Ponytail's no-lazy list states "error handling cannot be cut," but the Agent might misjudge in specific scenarios—when writing a payment callback, thinking "this API rarely fails," compressing retry logic into one line.

When conflicting with project rules, the Agent randomly picks a side. Your CLAUDE.md says "all public functions must have docstring + unit tests," Ponytail says "no framework needed, one assert is enough." Two contradictory instructions exist simultaneously, the Agent judges priority itself—sometimes it picks correctly, sometimes not.

Per-turn injection consumes context. The mode-filtered injection text is about 1500 tokens. Not much in a 200K context, but in Haiku or small models, context space is taken away each turn to read rules.

Not suitable for security/payment scenarios. The benchmark scored perfectly on 6 security tasks, but those were carefully designed tasks, not real production environments.

5 Designs You Can Take Without Installing

  1. Ladder Model: Add to AGENTS.md "Before writing code, ask: Is this really needed? Does the standard library have it? Can it be one line?"—3 levels are enough.
  2. Simplification Marking Convention: // lean: <upper limit>, <upgrade path>—any simplification must carry a comment.
  3. No-Lazy List: Explicitly write in project rules "the following things absolutely cannot be simplified: input validation, error handling, security, accessibility."
  4. Multi-Level Intensity: Don't just have "on/off" for Agent rules, at least add a level of "remind but don't enforce."
  5. Output Format: Code first, then at most three lines of "what was skipped, when to add it back." If the explanation is longer than the code, delete the explanation.

Ponytail is a good tool, but it is a brake, not a steering wheel.

If you decide to install it, start from lite. The lesson from memory-adding projects—any operation that stuffs things into an Agent's system prompt should come with a plan for "when will I turn it off."

But the question Ponytail really prompts us to think about is not "should I install it," but a deeper design principle it hints at: In Agent tool design, subtraction is safer than addition. Adding memory, preferences, rules—the failure cost of these "additions" is implicit, you can't find it. Subtracting code, decision space, optional paths—the failure of these "subtractions" is visible at a glance. This isn't to say subtraction has no risks, but that the risks of subtraction fall in plain sight, and you fix them quickly.

The next time you stuff something into an Agent's system prompt, first ask yourself: Is this addition or subtraction? If it fails, will I find out immediately, or three months later?


Project address: https://github.com/DietrichGebert/ponytail


Off-Topic

After writing about Ponytail, I'd like to mention three projects I'm working on, related to the theme of this article—all tools to make Agents work better.

Archify — Generate architecture diagrams, flowcharts, and sequence diagrams in chat. One sentence produces an HTML file switchable between dark/light themes and exportable as PNG/SVG. No design skills needed, no drag-and-drop tools.

image.png Hive — A multi-Agent collaboration workbench in the browser. One prompt dispatches multiple Agents, with persistent identities, shared task graph, and one-click restart for the whole team. The follow-up project to MCO.

image.png MCO — AI Coding Agent orchestration tool. One prompt simultaneously dispatches Claude Code, Codex, Gemini CLI, OpenCode, Qwen, and other Agents to work in parallel. The core philosophy is the same as Ponytail—let the Agent do what it's good at, and the framework does the rest.