跪拜 Guibai
← All articles
Frontend · AI Programming

Every AI Skill You Add Is a Tax on the Context Window

By 莪_幻尘 ·
Read original on juejin.cn ↗ Google Translate ↗ Alt translation

Context-window economics are the hard ceiling on agent capability. Every Skill description loaded at startup is working memory the model cannot use for reasoning, tool output, or conversation history. Teams scaling past 50 Skills without a lazy-loading or semantic-routing strategy will hit degradation that looks like model failure but is actually a budget problem.

Summary

A Skill system's index layer is a global, unconditional tax on the context window. A single description might cost 100 tokens, but at 500 Skills the index consumes 25% of a 200K window before any real work begins. The damage shows up as forgotten context, missed specifications, and sluggish responses on long conversations—not because the model degraded, but because working memory shrank.

Short-term triage rewrites descriptions as 30-word routing triggers with exclusion words to cut false matches by 67%. Mid-term, a tiered index loads only a domain-level summary at startup and expands into full Skill lists on demand, saving 75% of the index overhead at 100 Skills. For 500+ Skills, keyword matching collapses under collision probability and O(n) traversal; a semantic routing engine vectorizes user intent and runs ANN search against a Skill embedding library, while a federation model lets each domain hub operate as an independent microservice.

A real-world audit of 17 Skills cut total token consumption by 40% and dropped the route mismatch rate from 12% to 4% using compression and exclusion intents alone. The companion `skill-token-audit.sh` script produces a health report in one command.

Takeaways
Skill descriptions are a global, unconditional tax on the context window—every registered Skill consumes tokens at startup whether or not it is used in the current session.
At 100 Skills, the index layer alone consumes roughly 5% of a 200K context window; at 500 Skills, it reaches 25% and user input starts getting truncated.
Rewriting descriptions as 30-word routing triggers instead of functional documentation saves roughly 60% of tokens per Skill.
Adding exclude_intents to each Skill reduces false-trigger collisions by 67% when the keyword pool grows large.
A tiered index with domain-level lazy loading cuts index overhead by 75% at 100 Skills by loading only a domain summary at startup and expanding on demand.
Keyword matching hits physical limits beyond 500 Skills due to exponential collision probability and O(n) traversal cost.
Semantic routing with vector embeddings and ANN search routes correctly even when user phrasing shares zero words with trigger keywords.
A Skill federation architecture treats each domain hub as an independent microservice, with a meta-hub that holds only domain-level metadata and acts as a pure routing gateway.
An audit of 17 real Skills reduced total token consumption by 40% and dropped the route mismatch rate from 12% to 4% using compression and exclusion intents alone.
Conclusions

The token-tax problem is architectural, not cosmetic. Compressing descriptions buys time, but the real scaling break comes from changing when Skills are loaded, not how they are described.

Exclusion words are an underused primitive. Most routing systems optimize for recall; adding a deny-list before the match step is a cheap way to raise precision without touching the model.

Anthropic's progressive-disclosure principle and Perplexity's 'Every Skill is a Tax' maxim converge on the same engineering conclusion: the cheapest token is the one never loaded.

The jump from keyword matching to semantic search is not just a scale fix—it changes the failure mode from silent misrouting to graceful degradation with confidence scores.

Concepts & terms
Token Tax
The fixed token cost every registered Skill imposes on the context window at agent startup, regardless of whether the Skill is used in the current session. Coined by Perplexity as 'Every Skill is a Tax.'
Tiered Index (L0/L1/L2)
A three-layer loading strategy: L0 holds domain-level summaries always in context; L1 loads a domain's full Skill list on demand; L2 loads a Skill's complete content only after it is triggered.
Progressive Disclosure
An Anthropic design principle stating that agents should receive the minimum context initially and expand it only when needed, directly informing domain-level lazy loading.
Semantic Routing Engine
Replaces keyword-based Skill matching with vector embeddings and approximate nearest neighbor (ANN) search, routing user intent to the correct Skill even when trigger words differ.
Skill Federation
An architecture for 500+ Skills where each domain hub operates as an independent microservice with its own registry and routing, coordinated by a lightweight meta-hub that holds only domain-level metadata.
Source: juejin.cn ↗ Google Translate ↗ Backup ↗