Private AI Chat
On Your Phone.

Run full AI models on your phone — fast. Up to 57 tok/s on small models, 23+ on 14B with speculative decoding. Search your documents, hear responses with offline TTS, and your AI remembers you across every conversation. No cloud. No subscription.

100% offline Zero telemetry Free during beta Android 8.0+
TokForge — Mobile LLM Tuning

See It Running Offline

Airplane mode on. No Wi-Fi. No data. Still generating.

Tap to play demo
✈ Fully offline · 10+ tok/s on-device
Galaxy S24 · Snapdragon 8 Gen 3 · Qwen3-8B

Custom AI Personalities

Import character cards with full backstories, alternate greetings, and lore. Each character feels different because each one is.

Auto-Optimized for Your Device

Three inference backends and five GPU paths. TokForge detects your hardware and picks the fastest config automatically — no tuning required.

Blazing Small-Model Speed HOT

46–57 tok/s on small models with TQ4 TurboQuant — aggressive GPU quantization that makes lightweight models fly.

Chat With Your Documents NEW

Attach PDFs, DOCX, or EPUB files. TokForge summarizes, indexes, and searches them so your AI can answer grounded in your documents — all on-device.

Hear Your AI Speak NEW

Offline text-to-speech with 11 natural voices and adjustable speed. Powered by Kokoro TTS — no internet, no latency, no data sent anywhere.

Unique Settings Per Character

Each character gets its own creativity, sampling, and style settings. Your creative writer stays wild while your analyst stays precise.

Cross-Device Benchmark Matrix

Real devices. Real tok/s. Reproducible configs.

Updated: 2026-04-06 — v3.4.7 Methodology →

MNN Vulkan — AR Decode Vulkan

Device SoC Model tok/s vs OpenCL vs CPU
OnePlus Ace 5 Ultra D9400 Qwen3-8B 11.88 +56% +166%
OnePlus Ace 5 Ultra D9400 Qwen3-14B 11.22 N/A +151%

MNN Vulkan with tuned NHWC4 GEMV kernels for Mali G925. Verified same-session decode peaks via app benchmark harness. Note: spec decode currently hurts Vulkan — verify batch (M>1) hits slow slide-window path on Mali. Use Vulkan for AR only; OpenCL for spec decode.

GGUF Vulkan — CoopMat Vulkan

Device SoC Model tok/s vs CPU
OnePlus Ace 5 Ultra D9400 3B Q4_K_M 16.85 3.4x
OnePlus Ace 5 Ultra D9400 8B Q4_K_M 8.07 ~2x

GGUF Vulkan with ggml-vulkan cooperative matrix support. Complements MNN Vulkan for quantized GGUF models.

Mid-Range vs Flagship

OnePlus Ace 5 Ultra (D9400, ~$400) hits 11.88 tok/s on 8B and 11.22 tok/s on 14B via Vulkan AR — competitive with flagships costing 3–4x more. Samsung's OneUI memory overhead limits what the S26 can run comfortably; the OnePlus runs 14B with headroom.

Speculative Decoding — Draft-Verified Acceleration NEW

Device Target Model Baseline With Spec Decode Speedup
RedMagic 11 Pro Qwen3-8B 14.05 tok/s 23.5 tok/s +67%
RedMagic 11 Pro Qwen3-14B 8.25 tok/s 16.4 tok/s +99%
Lenovo TB520FU Qwen3-8B 10.10 tok/s 10.99 tok/s +9%
OnePlus Ace 5 Ultra Qwen3-8B 11.88 tok/s (Vulkan AR) N/A*
OnePlus Ace 5 Ultra Qwen3-14B 11.22 tok/s (Vulkan AR) N/A*

A small draft model proposes candidate tokens; the target model verifies them in a single batched forward pass. SM8850 results are verified single-packet peaks (500 decode tokens). Automatically enabled on supported devices and model pairings.

* Spec decode currently hurts Vulkan performance — verify batch (M>1) falls back to slow slide-window path. D9400 baselines now use faster Vulkan AR-only decode instead.

Cross-Device Fleet Results (v3.4.7)

Device SoC Model Backend Decode tok/s
Galaxy S26 Ultra SM8850 Qwen3-8B OpenCL 21.0
RedMagic 11 Pro SM8850 Qwen3-4B OpenCL 20.68
Galaxy S26 Ultra SM8850 Qwen3.5-4B CPU 21.30
OnePlus Ace 5 Ultra D9400 Qwen3-8B Vulkan 11.88
OnePlus Ace 5 Ultra D9400 Qwen3-14B Vulkan 11.22
RedMagic 11 Pro SM8850 Qwen3-8B OpenCL 14.05
Galaxy S24 Ultra SM8650 Qwen3-4B OpenCL 13.58
Xiaomi Pad 7 Pro SM8635 Qwen3-4B CPU 11.81
Lenovo TB520FU SM8650 Qwen3-8B OpenCL 10.10

BackendCapabilityResolver auto-routes each device: Snapdragon uses MNN OpenCL for standard attention (Qwen3), CPU for linear attention (Qwen3.5). Dimensity 9400 Mali uses MNN Vulkan with tuned NHWC4 GEMV kernels — first production Vulkan LLM on ARM Mali.

MNN vs GGUF — Backend Comparison (RedMagic 11 Pro, SM8850)

Model MNN OpenCL GGUF CPU MNN Advantage
Qwen3-0.6B 34.8 42.7 −18%
Qwen3-1.7B 27.4 16.3 +68%
Qwen3-4B 20.68 9.0 +130%
Qwen3-8B 14.05 5.4 +160%
Qwen3-14B 8.25 2.7 +206%

MNN OpenCL overtakes GGUF CPU at 1.7B+ parameters. At 14B, MNN is 3x faster. GGUF wins only on tiny models (<1B) where CPU overhead is negligible. GGUF uses KleidiAI i8mm + futex barrier threading (2T optimal on Snapdragon 8 Elite).

GGUF Decode Speed by Model Size

Model Quant Threads Decode tok/s Prefill tok/s
Qwen3-0.6B Q4_K_M 2T 42.7 113.0
Qwen3-1.7B Q4_K_M 2T 16.3 43.9
Llama-3.2-3B Q4_K_M 2T 10.1 26.6
Qwen3-4B Q4_K_M 2T 9.0 20.7
Qwen3-8B Q4_K_M 2T 5.4 12.0
Qwen3-14B Q4_K_M 2T 2.7 5.8

GGUF uses llama.cpp with KleidiAI i8mm acceleration and futex barrier threading. 2 threads consistently outperforms 4 threads on Snapdragon 8 Elite.

Key Findings

  • Vulkan GPU makes Mali competitive — 11.88 tok/s on 8B models, 56% faster than OpenCL and 166% faster than CPU on the Dimensity 9400.
  • $400 phone, flagship performance — OnePlus Ace 5 Ultra runs 8B at 11.88 tok/s and 14B at 11.22 tok/s, matching devices that cost 3–4x more.
  • Spec decode nearly doubles large-model speed — 14B goes from 8.25 to 16.4 tok/s (+99%) on Snapdragon 8 Elite. 8B gets a 67% boost.
  • GPU acceleration is up to 3x faster — On models 1.7B and above, GPU-accelerated inference dominates CPU-only. The gap widens with model size.
  • Conversations stay fast after the first message — Delta prefill cuts follow-up latency by up to 34x (58s down to 1.7s).
  • Your device picks the best path — Five GPU acceleration paths are auto-selected per chipset. No manual config needed.
  • Every result is reproducible — All benchmarks are exportable via the 120+ endpoint API. Compare across devices and share configs.

Everything you need for local AI chat.

Built for privacy-conscious users, roleplay enthusiasts, and developers who want full control.

2x Faster With Spec Decode NEW

A small draft model predicts ahead, the main model verifies in one pass — nearly doubling speed on large models. 23+ tok/s on 14B. TokForge auto-detects the best model pairings and picks the right GPU path per device.

Your Phone, Optimized Automatically

Three inference engines and five GPU acceleration paths. TokForge profiles your hardware on first launch and picks the fastest config — Snapdragon, Dimensity, or Exynos. You can also connect to a remote server for bigger models.

Hear Your AI Talk Back NEW

11 natural-sounding voices with adjustable speed, fully offline via Kokoro TTS. Two quality tiers: fast or premium. Voice input too — talk to your AI and hear it respond without ever touching a server.

Your AI Remembers You

Facts, preferences, and context carry across every conversation. TokForge learns in the background while you chat, building a memory that makes each character feel like it actually knows you.

TurboQuant — 57 tok/s HOT

TQ4 aggressive GPU quantization makes small models absurdly fast — 46–57 tok/s. Two modes trade off quality vs speed. Ideal for quick questions, brainstorming, and real-time back-and-forth where latency matters most.

Chat With Your Documents NEW

Attach PDFs, Word docs, EPUBs, or plain text. TokForge indexes and summarizes them, then your AI answers questions grounded in the actual content — all processed on-device, nothing uploaded anywhere.

Read the docs

Engineered for speed and control.

  • Character cards + persona — Import characters with full backstories, then TokForge assembles the prompt for you
  • Dual inference engines — GPU-accelerated and CPU-optimized paths, automatically selected for your device
  • Hardware profiler — Detects your chipset, GPU, and RAM to recommend the best config
  • 120+ API endpoints — Full remote control from any device on your network: run benchmarks, manage models, change settings
  • Benchmark database — Save results, compare across devices, export and share your configs
1. Pick a character or start a blank chat
2. TokForge detects your hardware & picks the fastest config
3. GPU-accelerated inference → real-time token streaming
4. Rich rendering with reasoning blocks & markdown
5. Memory learns from the conversation in the background

Chat + inference pipeline

Local-first by design.
Transparent by default.

Get TokForge — Open Testing

TokForge is free and available on Google Play open testing. Install directly or join the community to help shape the future of private mobile AI.

v3.4.7 — TurboQuant (up to 57 tok/s), document search, Vulkan GPU acceleration, speculative decoding, offline TTS, persistent memory, saved configs, 120+ API endpoints, and more. Free on Google Play.
Get it on Google Play
Join the Community

Request Access

No spam. We'll only email you about beta access.

No telemetry. No background reporting. Your data stays on your device unless you explicitly opt in.