Free Beta
Home / Docs / Speculative Decoding

Speculative Decoding

Up to 75% Faster Inference · Automatic Acceleration · Zero Configuration

Overview

Speculative decoding makes AI inference dramatically faster on your phone. Instead of generating text token-by-token one at a time, TokForge uses a small draft model to propose multiple tokens in parallel, while the main model verifies them in batch. The net result: responses appear in minutes instead of hours.

The Simple Version

A lightweight draft model rapidly predicts what tokens come next. The powerful main model then verifies all those predictions at once, accepting the ones that are correct and rejecting the others. This parallel verification — instead of generating token-by-token — is what creates the massive speedup.

You don't need to do anything. TokForge automatically detects which draft-target model pairs work best on your device and handles all the heavy lifting behind the scenes.

How Fast Is It?

Real-world speedups vary by device, model, and hardware. Here's what you can expect:

Model Without Spec Decode With Spec Decode Speedup
Qwen3 8B 11.41 tokens/sec 19.60 tokens/sec +72%
Qwen3 9B 9.55 tokens/sec 16.67 tokens/sec +74%

On flagship hardware (2024+): You may see up to 75% speedup. On older devices, spec decode is smart enough to disable itself if it wouldn't help — we don't want slower inference.

Performance varies by:

  • Device hardware (SoC, RAM, thermal conditions)
  • Model size and architecture
  • Input prompt length and context window usage
  • Batch size and inference settings

Some devices benefit more than others. TokForge automatically detects what works best for your specific phone.

How It Works For You

Setup: Automatic

Speculative decoding requires downloading a small draft model alongside your main model. TokForge handles this automatically as an "Acceleration Pack." When you download a compatible model from the app, the draft model downloads right alongside it.

You'll see a download progress bar just like normal. Total download size is typically small — draft models are lightweight by design.

Enable & Disable

Spec decode is enabled by default on compatible models on supported hardware. But you have full control:

  • Enable or disable spec decode globally in Settings
  • Enable or disable per-model in the model card
  • See a live indicator in the chat toolbar when spec decode is active

Smart Detection

TokForge automatically profiles your device on first launch. It detects:

  • Which hardware features are available
  • Which draft-target model pairs would actually help
  • Whether spec decode would slow you down (and disables it if so)

This profiling happens once, in the background. No configuration needed.

Supported Models

Which Models Work?

Speculative decoding works best with larger language models. Currently, the Qwen3 family has first-class support for spec decode acceleration.

Other model families may be added over time as we optimize draft-target pairings. Check your model card in the app for the Spec Decode badge — if it's there, spec decode is available for that model.

What the Badge Means

When you see the Spec Decode badge on a model card, it means:

  • TokForge has verified this model works well with speculative decoding
  • A compatible draft model has been identified and will download automatically
  • On your device, if the hardware supports it, spec decode will be enabled by default
  • You can toggle it on/off in settings at any time

Device Compatibility

Hardware Support

Speculative decoding is designed for flagship mobile chips (Snapdragon 8 series and similar), which have the parallel processing power to handle draft model inference. Some devices benefit more than others, depending on:

  • Processor generation and architecture
  • Available compute units for parallel execution
  • Memory bandwidth and thermal design

Automatic Enable/Disable

TokForge automatically detects compatibility. On your first launch, the app profiles your device to determine where spec decode helps. On supported hardware, it's enabled by default. On devices where it wouldn't provide a benefit, it's automatically disabled — we don't want slower inference.

You can always manually override this in Settings if you want to experiment.

Multiple Backends

Speculative decoding works with both:

  • MNN backend: Optimized for on-device inference
  • GGUF backend: Alternative quantization format with spec decode support

The backend is selected automatically based on your model and device configuration.

Settings & Controls

Global Settings

In Settings → Performance, you'll find:

  • Enable Speculative Decoding: Toggle spec decode on/off globally
  • Auto-Enable on Compatible Hardware: Automatically enable spec decode on devices where TokForge detects it would help

Per-Model Configuration

On each model card, you can:

  • See the Spec Decode badge if the model supports it
  • Toggle spec decode for that specific model
  • View which draft model is being used (shown in the pairing card)

Draft-Target Pairing Card

When a model has spec decode enabled, you'll see a pairing card showing:

  • Target Model: The large, powerful model (e.g., Qwen3 9B)
  • Draft Model: The small, fast model used to predict tokens
  • Speedup Range: Expected performance gain on your device (e.g., "+23-63%")

Live Speed Indicator

When you're chatting and spec decode is active, the chat toolbar shows a live icon. This indicates speculative decoding is currently accelerating your inference. The icon appears during token generation and disappears when the response is complete.

FAQ

Does speculative decoding affect quality?

No. Quality is identical. The draft model is only used to propose tokens; the main model always makes the final decision on which tokens to accept. If the draft model predicts wrong, those tokens are rejected. You get the exact same output quality, just faster.

Does it use more RAM?

Yes, a small amount. The draft model stays in memory alongside the main model. Typically, this adds 10-20% to your memory footprint, depending on draft model size. On devices with memory constraints, you can disable spec decode to reclaim that RAM.

Can I disable speculative decoding?

Absolutely. You can disable it globally in Settings, or disable it per-model on the model card. If you find spec decode isn't helping on your device, turn it off.

What if my device doesn't support spec decode?

TokForge automatically detects this. If your device doesn't have the hardware for efficient spec decode, it will be disabled by default. Models will still download and run normally — spec decode is purely optional acceleration. You can always manually enable it in Settings if you want to experiment.

How much disk space do draft models take?

Draft models are significantly smaller than full models — typically 25-50% of the target model size. They download automatically alongside your main model, and you only need to store one draft model per target model.

Does spec decode work with custom quantizations?

Speculative decoding works with models quantized in MNN and GGUF formats. If you're using a custom model, spec decode may not be available, but the model will run normally without it.

Can I use different draft-target pairs?

Not manually. TokForge automatically selects the best draft-target pairing for each model based on rigorous testing. Custom pairings could result in poor performance or incorrect output. Trust the automatic pairing for the best results.

Is there a performance penalty if spec decode fails to predict correctly?

No. Spec decode only helps if it succeeds. If the draft model predicts incorrectly, those tokens are rejected and re-computed correctly by the main model. There's no penalty — you just don't get a speedup for that particular token.

Do I need to download anything extra?

No. Draft models download automatically as "Acceleration Packs" when you download a compatible target model. You don't need to take any extra steps — it's all built into the normal download flow.

What's the battery impact?

Speculative decoding actually reduces battery consumption because inference is faster. Less time computing means less power draw. In most cases, spec decode will improve battery life compared to running without it.

Can I see metrics on how much spec decode is helping?

Yes. In the chat history, you can view detailed metrics for each response, including tokens generated, spec decode acceptance rate, and effective speedup for that particular response. Tap the metrics icon next to any response to see the details.