跪拜 Guibai
← All articles
Artificial Intelligence

Claude Fable 5 Returned After an 18-Day Government Ban—With a Stricter Safety Classifier That Breaks Normal Code

By AGIPlayer ·
Read original on juejin.cn ↗ Google Translate ↗ Alt translation

A Category C jailbreak—one that doesn't touch core harmful behaviors—can now trigger a global shutdown of frontier AI infrastructure for weeks. Developers paying premium prices for Fable 5 will see normal coding and debugging requests silently downgraded to a weaker model, and any prompt that asks the model to explain its reasoning gets rejected outright.

Summary

Anthropic launched Claude Fable 5 on June 9 as its most capable public model—1M-token context, 128k max output, always-on adaptive thinking, priced at $10/$50. Three days later, the US Commerce Department issued an export control directive after Amazon researchers found a narrow jailbreak that could produce exploit code for known vulnerabilities. Anthropic classified the jailbreak as Category C (minor) and noted that weaker models exhibited the same behavior, but the company shut down access globally because it could not verify user nationality in real time.

The model returned on July 1 with a new safety classifier trained to block the reported bypass method in over 99% of cases. Blocked requests fall back to Opus 4.8, and Anthropic introduced a Fallback Credit to refund the cache-write price difference. Four classifiers now monitor cyber, bio, frontier LLM, and reasoning-extraction risks—the last one rejects any prompt that asks the model to show its thinking, forcing developers to strip reflection instructions from their system prompts and skills.

Alongside the redeployment, Anthropic published a jailbreak severity framework co-developed with Amazon, Microsoft, and Google, and made four commitments to the US government including pre-release model access for government partners. The same-day release of Sonnet 5 at $3/$15, with looser safety restrictions and near-Opus-4.8 performance, offers a pragmatic alternative for developers who hit Fable 5's classifier walls.

Takeaways
Claude Fable 5 launched June 9, 2026, with 1M-token context, 128k max output, always-on adaptive thinking, and $10/$50 pricing—double Opus 4.8's cost.
The US Commerce Department ordered a global shutdown on June 12 after Amazon researchers demonstrated a narrow jailbreak that generated exploit code for known, simple vulnerabilities.
Anthropic classified the jailbreak as Category C (minor)—breaching the safety margin without reaching core harmful behaviors—and noted that Opus 4.8, GPT-5.5, Kimi K2.7, and Haiku 4.5 all exhibited the same capability.
Service was restored globally on July 1 after 18 days, with a new safety classifier that blocks the reported bypass in over 99% of cases and falls back to Opus 4.8 on flagged requests.
Four classifiers now gate Fable 5 requests: cyber, bio, frontier_llm, and reasoning_extraction—the last one rejects any prompt asking the model to show or explain its thinking.
Anthropic introduced a Fallback Credit that refunds the cache-write price difference when a request is downgraded to Opus 4.8.
Sonnet 5 launched the same day at $3/$15 with performance near Opus 4.8 and significantly looser cybersecurity restrictions—it scored 0.0% on Firefox exploit generation.
Mythos 5 is the same model as Fable 5 but without safety classifiers, available only to vetted cybersecurity defenders through Project Glasswing, which has already surfaced over 10,000 high-risk or critical vulnerabilities.
Anthropic committed to four conditions for the US government: pre-release model access, rapid information sharing on jailbreaks, dedicated joint research resources, and pushing common industry safety standards.
A jailbreak severity framework co-authored with Amazon, Microsoft, and Google introduces three levels—C (minor), D (narrow harmful), E (general)—across four dimensions: capability gain, breadth, weaponization difficulty, and discoverability.
Conclusions

Fable 5 and Mythos 5 are the same model with different safety postures, which means the publicly available version is deliberately crippled by classifiers while the uncensored version is reserved for government-vetted defenders—a two-tier access regime baked into the product line.

The reasoning_extraction classifier creates a direct conflict with prompt engineering best practices: developers who built skills around chain-of-thought transparency must now strip those instructions or get silently downgraded to a weaker model.

Anthropic's Fallback Credit only refunds the cache-write price delta, not the full request cost, so developers still pay a premium for Fable 5 even when their work gets routed to Opus 4.8.

The government's trigger for a global shutdown was oral evidence of a narrow, non-generalizable jailbreak—a standard far below what would normally justify infrastructure-level intervention, and one that sets a precedent for future regulatory action against any frontier model.

Sonnet 5's simultaneous release looks less like a coincidence and more like a hedge: a cheaper, less restricted model that keeps developers on the platform if Fable 5's classifiers prove too aggressive for daily work.

The jailbreak severity framework is the most durable outcome of this crisis—four major AI companies now share a common taxonomy, which could accelerate standardized safety evaluations across the industry, but the framework still lacks concrete response thresholds and enforcement mechanisms.

Pre-release government access to frontier models blurs the line between safety review and de facto licensing; if this becomes normalized, model release timelines become subject to political and bureaucratic calendars rather than engineering readiness.

Concepts & terms
Category C Jailbreak (Minor)
A jailbreak that breaches a model's safety margin but does not unlock core harmful behaviors. In the new four-company framework, it sits below Category D (narrow harmful) and Category E (general jailbreak). The Fable 5 incident was classified at this level.
Reasoning Extraction Classifier
A safety classifier that blocks prompts asking the model to repeat or display its internal chain of thought. It triggers on instructions like 'show your thinking' or 'explain your reasoning,' forcing a fallback to a weaker model.
Fallback Credit
Anthropic's refund mechanism that returns the cache-write price difference when a Fable 5 request is rejected by a safety classifier and retried on Opus 4.8. It does not refund the full request cost.
Project Glasswing
An invite-only program giving vetted cybersecurity defenders access to Mythos 5—the uncensored version of Fable 5—for defensive vulnerability research. It has already surfaced over 10,000 high-risk or critical bugs, including a 27-year-old OpenBSD flaw.
Jailbreak Severity Framework
A classification system proposed by Anthropic, Amazon, Microsoft, and Google that scores jailbreaks across four dimensions (capability gain, breadth, weaponization difficulty, discoverability) and assigns three severity levels: C (minor), D (narrow harmful), E (general).
Source: juejin.cn ↗ Google Translate ↗ Backup ↗