跪拜 Guibai
← All articles
Backend · Programmer · Artificial Intelligence

OpenAI Drops GPT-5.6: Three Tiers, Two New Reasoning Modes, and a Multi-Layer Security Stack

By cxuanAI ·
Read original on juejin.cn ↗ Google Translate ↗ Alt translation

For Western developers building on OpenAI's API, GPT-5.6 introduces a clear tiered pricing strategy that mirrors the Claude lineup, plus new reasoning modes that could change how complex agent workflows are architected. The multi-layered safety stack — including real-time generation pauses and account-level behavioral analysis — signals a new baseline for how frontier models will handle security-sensitive tasks, which directly impacts how developers design prompts, handle errors, and manage API costs.

Summary

OpenAI has released GPT-5.6, its most powerful model yet, in three distinct tiers: Sol (preview, top intelligence), Terra (balanced, cost-effective), and Luna (fast and cheap). The naming echoes Claude's Haiku/Sonnet/Opus lineup, but the substance is all OpenAI's own.

On benchmarks, GPT-5.6 Sol set a new record on TerminalBench 2.1, a test of command-line workflow planning and tool coordination. In cybersecurity, Sol approached the level of the Mythos Preview model using only about one-third of the tokens on ExploitBench. All three tiers showed measurable security capability gains on ExploitGym as reasoning effort increased.

Two new capability entry points debut: `max` reasoning effort, which gives Sol more time for deep reasoning, and `ultra` mode, which lets the model spawn sub-agents to decompose complex tasks. OpenAI describes the latter as going beyond the capability boundary of a single agent.

Safety is a major focus. GPT-5.6 introduces real-time cybersecurity and biological abuse classifiers that can pause generation for review by a larger reasoning model. Account-level flagging examines not just individual queries but related conversations and risk signals. The model is trained to refuse cybersecurity assistance when it detects disguised intent or jailbreak attempts. OpenAI explicitly states that no single safety measure is sufficient against determined attackers who constantly change methods.

Takeaways
GPT-5.6 comes in three tiers: Sol (preview, top intelligence), Terra (balanced, cost-effective), and Luna (fast, cheap).
Sol set a new record on TerminalBench 2.1, a benchmark for command-line workflow planning and tool coordination.
On ExploitBench, Sol approached Mythos Preview's level using about one-third of the tokens.
Two new reasoning modes: `max` (more time for deep reasoning) and `ultra` (spawns sub-agents for parallel task decomposition).
Terra's performance is similar to GPT-5.5 but at half the price; Luna is slightly below GPT-5.5 but cheaper.
Pricing per 1M tokens: Sol $5 input / $30 output; Terra $2.50 input / $15 output; Luna $1 input / $6 output.
Prompt caching now supports explicit cache breakpoints with a minimum 30-minute lifetime; cache writes cost 1.25x uncached input, cache reads get 90% discount.
Real-time cybersecurity and biological abuse classifiers can pause generation for review by a larger reasoning model.
Account-level flagging examines related conversations and risk signals, not just individual queries.
Under Chromium and Firefox test conditions, Sol did not produce a fully autonomous attack chain and has not crossed OpenAI's Cyber Critical threshold.
Full evaluations are deferred until wider release; only coding, biology, and cybersecurity benchmarks were shared.
Conclusions

The three-tier naming (Sol, Terra, Luna) is a direct competitive response to Anthropic's Claude model lineup, signaling that OpenAI is now segmenting its flagship by cost and capability rather than releasing a single monolithic model.

The `ultra` mode's sub-agent spawning capability is a significant architectural shift — it moves beyond single-agent reasoning into multi-agent orchestration within a single API call, which could simplify complex agent workflows for developers.

OpenAI's decision to withhold full benchmarks until wider release is a notable departure from past practice, suggesting either caution about unflattering comparisons or a strategic choice to control the narrative around model capability.

The multi-layered safety stack — real-time classifiers, generation pauses, account-level behavioral analysis — represents a hardening of the platform that will likely increase the friction for developers building security-adjacent or red-teaming applications.

The fact that Sol did not produce a fully autonomous attack chain under Chromium and Firefox conditions, despite being OpenAI's most capable model, suggests that the Cyber Critical threshold is a deliberately high bar that may not be crossed for some time.

Concepts & terms
TerminalBench 2.1
A benchmark that evaluates a model's ability to perform command-line workflows, testing planning, iteration, and tool coordination skills.
ExploitBench
A benchmark for evaluating a model's ability to perform vulnerability research and exploitation tasks in cybersecurity.
ExploitGym
A cybersecurity evaluation environment (referenced from arXiv) used to test model capabilities in exploit development and security tasks.
max reasoning effort
A new mode in GPT-5.6 that allocates more computational time to the model for deeper reasoning on complex problems.
ultra mode
A new mode in GPT-5.6 that allows the model to spawn sub-agents to decompose and execute complex tasks in parallel, going beyond single-agent capability.
cache breakpoint
An explicit marker in GPT-5.6's prompt caching system that allows developers to control where caching starts and stops, with a minimum cache lifetime of 30 minutes.
Source: juejin.cn ↗ Google Translate ↗ Backup ↗