Artificial Intelligence · Programmer · Interview

DeepSeek V4 Gets an 85% Speed Boost Without Touching the Model Weights

By 蝎子莱莱爱打怪 · Jun 28, 2026 · 394 views · 8 likes · 2 comments

Read original on juejin.cn ↗ Google Translate ↗ Alt translation

Model quality among frontier labs is converging; the next differentiator is inference cost. DSpark shows how to double throughput on the same hardware without touching model weights, turning an engineering optimization into a direct pricing or margin lever.

Summary

DSpark is not a new model. It is the same DeepSeek V4 checkpoint paired with a speculative decoding module that dramatically accelerates text generation. The technique uses a hybrid drafter—a fast parallel backbone corrected by a tiny Markov head—to guess multiple tokens at once, then lets the full model verify them in a single forward pass. A confidence scheduler dynamically adjusts how many tokens to guess, avoiding wasted computation on low-confidence predictions.

Benchmarks against DeepSeek's earlier MTP scheme show single-user generation speed rising 60–85% and throughput climbing 51–400%, all while producing mathematically identical output. The paper, training framework (DeepSpec, MIT license), and model weights are fully open, making the approach reusable even with competing models like Qwen3.

When top-tier model capabilities converge, the competitive edge shifts to inference economics. A 2× speedup in speculative decoding converts directly into margin—either halving prices to capture market share or halving costs to boost profit.

Takeaways

— DSpark is a speculative decoding module attached to the existing DeepSeek V4 checkpoint; no model weights were changed or retrained.

— Single-user generation speed improved by 60–85%, and server-side throughput rose by 51–400% compared to DeepSeek's prior MTP approach.

— Output is mathematically identical to the original model—zero accuracy loss, proven by the verification step.

— The drafter uses a semi-autoregressive design: a parallel backbone for speed plus a lightweight Markov head that corrects local context blind spots.

— A confidence scheduler dynamically varies the number of tokens guessed per step, avoiding wasted compute on low-confidence predictions.

— The DeepSpec codebase, DSpark weights, and paper are fully open under an MIT license; the framework can train drafters for other models, including Qwen3.

— Community tests suggest cost reductions of 5× to 7.6×, though those figures are unofficial; the official data confirms throughput gains that directly lower per-request cost.

Conclusions

Speculative decoding turns a serial bottleneck into a parallel verification problem, and DSpark's contribution is making the drafter both fast and accurate enough to capture most of the theoretical gain.

Confidence-based dynamic scheduling is a simple idea with outsized impact—it recovers compute that older fixed-length draft methods simply waste.

Open-sourcing the entire training framework, not just the weights, signals a strategic bet: commoditize inference acceleration so the ecosystem builds on DeepSeek's stack.

Frontier model capabilities are plateauing relative to each other; the labs that win on cost-per-token will dictate API pricing and developer adoption.

Engineering improvements to inference can matter more than a 0.5-point benchmark gain because they compound across every request, directly affecting margins and user experience.

Concepts & terms

Speculative Decoding

A technique where a small, fast draft model guesses several future tokens, and the large target model verifies them all in one parallel forward pass. Correct tokens are kept; incorrect ones trigger a rollback. It preserves exact output quality while reducing latency.

Semi-Autoregressive Drafting

A hybrid generation method where a parallel backbone predicts multiple tokens at once for speed, while a lightweight sequential component (like a Markov head) refines those predictions using local context to improve accuracy.

Markov Head

A minimal prediction layer that conditions only on the immediately preceding token. In DSpark, it corrects the parallel drafter's output by injecting local sequential information without adding significant latency.

Confidence Scheduling

A dynamic strategy that decides how many tokens to speculate based on the drafter's self-assessed confidence. High-confidence steps guess more tokens; low-confidence steps guess fewer, reducing wasted computation on likely-incorrect predictions.

Zero Accuracy Loss

A guarantee that the output of a speculative decoding system is exactly identical to what the original model would have generated autoregressively, because the target model itself performs final verification on every token.

Source: juejin.cn ↗ Google Translate ↗ Backup ↗