OpenAI Drops GPT-5.6: Three Tiers, Two New Reasoning Modes, and a Multi-Layer Security Stack

Although late, it has finally arrived! GPT-5.6 is finally here!

Just yesterday, OpenAI released a GPT-5.5 instant version, and I thought 5.6 might not come.

But on June 26th, GPT-5.6 was finally released, which is basically consistent with the rumored timeline.

But first, the names: Solo, Terra, Luna. This naming scheme looks just like Claude Code's Haiku, Sonnet, and Opus!!! Please don't mess around with these gimmicks.

GPT-5.6 Sol is a limited preview model, currently OpenAI's most powerful model; GPT-5.6 Terra is a more balanced model; GPT-5.6 Luna is a cheap, fast, and most durable model.

GPT-5.6 is OpenAI's most powerful model. GPT-5.6 strengthens protection against high-risk activities, sensitive network requests, and repeated abuse, and has spent a lot of time finding vulnerabilities, stress testing, and making it more resistant to cyber attacks.

Terra's performance is basically similar to GPT-5.5, but at half the price; and the official statement says Luna's capability is second only to GPT-5.5, but provides powerful features at a low cost.

Needless to say, Terra is essentially a cheaper version of GPT-5.5, while Luna is inferior to 5.5, but wins on being cheaper.

One sentence to distinguish: Sol for intelligence ceiling, Terra for daily cost-effectiveness, Luna for throughput and cost.

(I mean, this model called Luna, why does it have to share a name with the infamous Luna coin that rug-pulled the crypto community? It's really unlucky...)

However, due to policy issues, it's not yet widely available for everyone to use, but you can first take a look at the model's capabilities.

OpenAI didn't release all the evaluations at once this time. The original text says that full evaluations will be released when it's more widely available.

But it first released three directions: Coding, Biology, and Cybersecurity.

In coding, GPT-5.6 Sol set a new record on TerminalBench 2.1 for OpenAI's own performance. This benchmark tests command-line workflows, requiring planning, iteration, and tool coordination.

TerminalBench 2.1

On TerminalBench 2.1, both effort levels of GPT-5.6 surpassed Mythos 5, and the cost-effective Terra version even surpassed Fable 5.

GPT-5.6 introduces two new capability entry points: max reasoning effort and ultra mode.

max means giving Sol more time for deep reasoning.

ultra is more like letting the model call sub-agents, breaking complex tasks into multiple sub-tasks to run in parallel.

OpenAI's description is that it goes beyond the capability boundary of a single agent.

In the biology direction, OpenAI mentioned GeneBench v1. It's used to evaluate long-cycle genomics and quantitative biological analysis tasks.

OpenAI claims that GPT-5.6 Sol is stronger than GPT-5.5 and uses fewer tokens.

The cybersecurity direction is even more straightforward.

OpenAI says GPT-5.6 Sol is their most capable model for cybersecurity. It shows significant progress in performance and efficiency on long-cycle security tasks like vulnerability research and exploitation.

On the ExploitBench evaluation list, GPT-5.6 Sol approached the level of Mythos Preview using about 1/3 of the tokens.

ExploitBench

On ExploitGym, Sol, Terra, and Luna all show significant improvements in cybersecurity capability as reasoning increases.

ExploitGym

I need to clarify this point.

The original text and the system card both emphasize that Sol is mainly for responding to cyber attacks, helping network defenders use more appropriate tools to discover attack vulnerabilities, develop patches, and strengthen system protection.

(I wonder what the difference is between this GPT-5.6 and the previously released Daybreak. Maybe they are the same model?)

But under Chromium and Firefox test conditions, it did not produce a fully autonomous attack chain. According to OpenAI's current framework, it hasn't crossed the Cyber Critical threshold.

But its capability is already strong enough to require phased release.

OpenAI's release repeatedly emphasizes the cybersecurity stack level. OpenAI says that relying on a single security measure cannot stop attackers who have clear jailbreak goals and constantly change their methods.

Therefore, they adopted multi-layered security measures. Different models have different specific configurations, and they conducted stress tests against real-world attacks.

These measures include: protection mechanisms built into model training, real-time checks during generation, account-level signals, differentiated access control, monitoring, enforcement, and continuous testing.

GPT-5.6 is trained to refuse or prohibit providing cybersecurity assistance, especially when users disguise their intent or attempt model jailbreaking. This level of assurance defines the model's security boundary, clearly defining what it can and cannot help with.

During the generation process, the large model also has real-time cybersecurity monitoring and biological abuse classifiers. If the risk is judged to be high, generation may be paused, and a larger reasoning model will review the entire conversation and context. If the output is evaluated by the model as a risky operation, it will be blocked.

Do you think that's it? No. After you trigger a model risk assessment, you are likely to be flagged. After being flagged, it will also trigger an account-level review of related conversations and risk signals.

That is, the system doesn't just look at what you asked in this round; it also looks at related conversations and risk signals to determine whether you are using it legitimately or continuously attempting malicious use.

I guess everyone is concerned about pricing:

GPT-5.6 pricing per 1 million tokens:

Sol: $5 input, $0.50 cached input, $30 output.

Terra: $2.50 input, $0.25 cached input, $15 output.

Luna: $1 input, $0.10 cached input, $6 output.

Additionally, GPT-5.6 introduces more predictable prompt caching, supporting explicit cache breakpoints, with a minimum cache lifetime of 30 minutes.

Cache writes are billed at 1.25x the uncached input price, while cache reads continue to enjoy a 90% discount on cached input.

I am cxuan, someone who has been tinkering with AI tools and Agent workflows for a long time. For more real usage records, post-mortems, and tool collections, you can search for the WeChat public account 'cxuanAI'.

Reference links:

OpenAI: Previewing GPT-5.6 Sol: a next-generation model
OpenAI Deployment Safety Hub: GPT-5.6 Preview System Card
arXiv: ExploitGym