← Back to the summary

DeepSeek V4 Gets an 85% Speed Boost Without Touching the Model Weights

Systems are improved, not designed from scratch. What DeepSeek improved this time isn't the model, but the way the model is run.

On June 27th, DeepSeek and Peking University jointly released something called DSpark.

Overnight, a single phrase spread across the internet—"85% speed increase."

Honestly, I was stunned when I saw it. But after scrolling through discussions, I found an awkward situation: 80% of people are sharing it, but few can clearly explain what this thing actually does. Some say it's a new model, some say it's a new chip, and others say DeepSeek has created another GPT killer.

None of that is correct.

In this article, I'll explain DSpark thoroughly, once and for all. No beating around the bush. After reading, you'll understand: it's neither a new model nor some kind of chip, but a "turbocharger" that DeepSeek installed on its own large model—the engine hasn't changed, but it runs much faster, and it won't write a single word wrong.

Let's break it down in plain language.

1. A Reality Check First: DSpark Is Not a New Model

This is the biggest misconception and the easiest place to stumble, so I have to put it right at the beginning.

Opening the DSpark model card on HuggingFace, the first sentence DeepSeek wrote is:

DeepSeek-V4-Pro-DSpark is NOT a new model. It is the same checkpoint with an additional speculative decoding module attached.

In other words: It's not a new model; it's the original V4, with an added speculative decoding module.

Here's an analogy. Your car is still the same car; the engine and parts haven't changed. But the repair shop added a turbocharger—the intake method changed, combustion efficiency went up, and naturally, the speed increased. You are still you, the car is still the car, only the way it runs has changed.

That's exactly what DSpark does. The brain of DeepSeek-V4—with its 1.6 trillion parameters and ability to handle 1 million tokens of context—remains untouched. A layer of "speculative decoding" is wrapped around it, making it spit out words faster and with higher throughput.

So stop obsessing over "how much stronger DSpark is than V4"—it IS V4, just running differently. As for V4's own hybrid attention mechanism (which reduces the computational and memory overhead of ultra-long contexts to one-tenth of its predecessor V3.2, according to DeepSeek's official data), that's V4's own achievement and has nothing to do with DSpark.

So what exactly did this DSpark "shell" change? That's the truly interesting part.

2. Speculative Decoding: Turning Large Models from "Word-by-Word" to "Wholesale"

To understand DSpark, you first need to understand its core technology—Speculative Decoding.

This term sounds intimidating, but the principle is actually quite simple.

How Do Large Models Normally Generate Text?

When you chat with a large model and it replies with a paragraph, what looks like sentences flowing out is actually generated one word at a time (strictly speaking, one token, which roughly equals a word or half a Chinese character).

Every time it generates a word, it has to re-process all the preceding content, calculate probabilities, and pick the next most likely word. One by one, sequentially.

This is called autoregressive generation. The slowness comes precisely from this—it's a complete slowpoke, looking back after every single word.

How Does Speculative Decoding Solve This?

The idea is particularly clever: Don't let the large model guess from scratch itself; send a "sidekick" to guess in batches first.

Here's how it works:

First, let a small, fast "draft model" guess several words in one go (e.g., 4 words in a row).
Then, let the real large model verify these 4 words all at once—accept all the correct guesses, and for any wrong ones, recalculate starting from the error.

Think about it, what's the key here? The key is that the verification step is parallel. The large model, which originally had to honestly calculate 4 times one by one, now verifies all 4 words in a single forward pass. The correctly guessed parts are essentially free speed gains.

Another analogy to make it instantly clear:

An editor-in-chief writing an article has to meticulously craft every word, slow but accurate. Now, let an intern quickly draft a version first; the editor scans it, passes the correct parts directly, and corrects the wrong ones with a red pen. The quality of the final draft is still at the editor-in-chief's level. But the drafting speed is much faster.

This is the entire essence of speculative decoding.

There's also a particularly solid mathematical guarantee here: because the large model itself is doing the final verification, the content it outputs is exactly identical to what it would have written word-by-word honestly. This is "zero accuracy loss"—not "almost the same," but mathematically provable complete equivalence.

Simply put: Speculative decoding trades "smartly calculating a bit more" for "overall much faster speed," with the result being exactly the same, word for word.

So the question is—if it's this good, why wasn't it used earlier? Because the old method had an unsolvable deadlock. This deadlock is precisely what DSpark aims to eliminate.

fig_1 Comparison of Speculative Decoding Principles: Left side is traditional word-by-word generation, calculating one by one serially; right side is speculative decoding, where a small draft model guesses in batches first, and the large model verifies them all in parallel at once, accepting correct ones and correcting wrong ones.

3. DSpark's Two Blades: Fast, Accurate, and Cost-Saving

Traditional speculative decoding is stuck in a dilemma.

To make the draft fast, you must use a parallel method to guess in one go—but parallel has a flaw, it only sees locally, so accuracy is poor. More wrong guesses mean the large model has to overturn them and recalculate, wasting computational power.

To make the draft accurate, you must use a serial method to guess carefully one by one—but serial is slow. The whole point was to speed things up, but the draft model itself becomes a bottleneck.

Fast isn't accurate, accurate isn't fast. This is the fatal flaw of old-style speculative decoding; the computational power spent on wrong guesses is completely wasted.

DSpark wields two blades to cut through this knot.

First Blade: Semi-Autoregressive Draft + Markov Head

DSpark's drafter is a "hybrid": the main body is a parallel backbone, responsible for speed—guessing several words at once. But parallel tends to miss context, so what's the solution?

It attaches a super small Markov head onto the backbone. This head is very lightweight, only looking at the previous word, and uses it to fine-tune the probability of the current word. It's like equipping the careless parallel backbone with a "correction assistant" specifically to fill in the context it missed.

Fast and accurate, that's how it's achieved—the backbone handles speed, the Markov head handles accuracy.

Second Blade: Confidence Scheduling

This blade is even more ingenious.

Every time the drafter guesses a word, it estimates a "confidence score" for itself. DSpark lets it dynamically decide how many words to guess this time based on this score:

For passages it's good at, with high scores, guess more words, verifying a large batch at once—a huge win.
For uncertain, error-prone spots, guess fewer words, avoiding guessing a bunch of wrong ones and having to overturn them all.

You know, this is very much like how an expert operates—assess yourself before guessing, go for more when confident, touch less when unsure, rather than blindly applying equal effort.

The old method was "mindlessly guessing a fixed number," wasting all the computation on wrong guesses. With DSpark's scheduling, most of that wasted computation is recovered.

Combined, these two blades are the entire magic of DSpark: making the drafter fast, accurate, and not wasteful.

fig_2 DSpark's Two Blades: Top is the semi-autoregressive draft (parallel backbone + Markov head for accuracy), bottom is confidence scheduling (guess more when confident, less when not), recovering the waste of old-style speculative decoding.

4. Real Results: How Much Faster and How Accurate

The data is right here, all from DeepSeek's official paper, benchmarked against their own previous MTP scheme (MTP is an early version of speculative decoding that could only predict one word at a time; DSpark is its upgraded evolution):

Single-user generation speed: Improved by 60%~85%. When you chat with DeepSeek, its reply speed is significantly faster.
Throughput: Improved by 51%~400%. This is a server-side metric—the same GPU can serve several times more simultaneous users. For high-concurrency API services, this is real money.
Accuracy: Zero loss. As mentioned earlier, mathematically guaranteed output identical to the original model.

Regarding cost, I have to be honest. There are community test posts on Reddit claiming "5x cheaper, 7.6x cheaper." These numbers were measured by the community, not officially announced by DeepSeek. I'm noting this clearly first, so don't go around boasting and get proven wrong. But the direction is certain: as throughput goes up, the cost per request inevitably goes down; that's just physics.

Another point easily misread: DSpark does not replace the previous MTP; the two are complementary. MTP is the foundation, and DSpark is the skyscraper built on top of it.

In one sentence: The model hasn't changed a single bit, speed has multiplied, and the results are completely unchanged.

fig_3 Performance Comparison: MTP-1 baseline vs DSpark, single-user speed +60-85%, throughput +51-400%, zero accuracy loss.

5. Why This Deserves Your Attention

After seeing the data, you might think: faster is faster, but what does it have to do with me?

It has a lot to do with you. DSpark is essentially an inference engineering task. It doesn't touch the model's capabilities themselves; it tackles "how to run the same model more cheaply."

You might think this isn't that sexy—no new model, no broken records. But think about it from another angle:

The capabilities of large models are rapidly converging. Among the top few players, the gap at the model level is getting smaller and smaller. When the models themselves can't pull away from each other, what's the next battlefield?

It's about "who can run an equally strong model more cheaply."

I saw a pretty harsh comment: Every 2x speedup in speculative decoding directly translates into profit margin. Twice as fast means you can either cut prices in half to grab market share, or cut costs in half to quietly make money—much more practical than fighting over another 0.5 points on a benchmark score.

And the most ruthless part of DeepSeek's move this time isn't the technology—it's the stance. The paper, the codebase (DeepSpec, MIT license), and the model weights are all open-sourced. As a side note, this DeepSpec repository contains not just DSpark, but also DFlash and the industry classic Eagle3; it's a general-purpose speculative decoding training framework that can even use a competitor's Qwen3 as the target model to train a drafter. Open-sourcing to this extent is a real attempt to flip the table and get everyone playing.

For developers: Using DeepSeek's API will be cheaper and faster; for those wanting to self-deploy, the code and weights are ready. For ordinary users: The experience is simply that DeepSeek replies faster, especially for long-form generation, where the speedup is most noticeable.

Systems are improved, not designed from scratch—this time, DeepSeek didn't improve the model, but the entire inference playbook.

AI is an amplifier, not a replacement. Engineering optimizations like DSpark amplify DeepSeek's path of "strong enough models, sold cheaply enough"—a path that's becoming increasingly hard for others to catch up to. In the future, the large model game might not be about whose brain is smarter, but whose brain runs cheaper.

I am Scorpion Lailai Loves Fighting Monsters, same name across the web. Feel free to follow my public account/planet/Juejin/Zhihu.