Join Closed Beta
Home / Docs / ForgeLab & Methodology

ForgeLab & Methodology

TokForge v3.4.7 — Benchmarking system, optimization tiers, TQ4 TurboQuant, configuration management, and reproducible inference performance measurement.

Overview

ForgeLab is TokForge's integrated benchmarking and optimization system. It helps you systematically measure and optimize model performance on your device:

Every benchmark result is stored with its complete configuration, making all performance measurements reproducible and comparable across devices.

ForgeLab UI

The Forge screen is a three-tab interface for benchmarking and optimization:

Tab 1: Config (Optimization Control)

Purpose: Configure and run AutoForge optimization sweeps.

What you can do:

Typical workflow: Select your loaded model → Pick "Medium" tier → Watch as TokForge tests different thread counts, KV cache types, batch sizes, and spec decode configurations → Get the optimal config in less than 15 minutes

ForgeLab auto-tune complete showing 4 optimization stages: Threads, CPU vs OpenCL, Precision, and Context
AutoForge Sweep Complete

Tab 2: Report (Results & History)

Purpose: View benchmark results and historical performance data.

What you can see:

The benchmark card shows your inference speed at a glance. Tap it to see full details like prefill time, delta prefill latency, total latency, spec decode results, and the exact configuration used. Share results via WhatsApp, Telegram, or email with a single button tap.

Tab 3: Profiles (Configuration Management)

Purpose: Manage saved configurations and apply them to inference.

What you can do:

Profiles are named by device SoC, model, and backend. When you apply a profile, TokForge loads those exact settings and returns you to chat with the new configuration active. No manual config editing needed.

Optimization Tiers

AutoForge offers four time-based optimization tiers. Each tier automatically sweeps different parameter combinations to find the best configuration for your hardware and model:

Instant Tier

Use case: Quick baseline check or validate a config change works.

Fast Tier (<1 minute)

Use case: Quick sweep to find obvious wins in threading.

Medium Tier (<15 minutes)

Use case: Production optimization that covers most parameter interactions.

Long Tier (~30 minutes)

Use case: Exhaustive testing for scientific validation or fleet-wide optimization.

How Benchmarks Run

Warmup & Measured Runs

Every benchmark follows a standardized structure to ensure reproducible results:

Metrics collected per run: Prefill latency (ms), decode latency (ms), tokens per second (tok/s), and token count.

Single Benchmark

A manual benchmark run sends a prompt to your loaded model and measures inference speed. You can customize:

Auto-Matrix

The auto-matrix feature benchmarks all combinations of installed models and available backends in one async operation. This produces a complete device performance matrix — useful for understanding which model/backend combination is fastest on your hardware.

TokForge Score & Metrics

TokForge Score: A composite metric computed as decode_tok/s × 0.7 + prefill_tok/s × 0.3. This weighted formula prioritizes decode speed (70%) while accounting for prefill latency (30%), reflecting real-world inference where generation throughput dominates user perception.

MetricDescriptionUnit
tok/sEnd-to-end tokens per second (prompt + decode)tokens/sec
Decode tok/sDecode-only throughput (excludes prefill)tokens/sec
Prefill tok/sPrompt processing throughputtokens/sec
Prefill latencyTime to process the input prompt before generatingmilliseconds
Delta prefillLatency for new messages only in multi-turn conversations (excludes cached context)milliseconds
Decode latencyTotal time spent generating output tokensmilliseconds
Token countNumber of tokens generated in the runcount

Speculative Decoding Optimization

ForgeLab optimizes speculative decoding (spec decode) performance by systematically testing draft model configurations. The system generates optimized configs by testing different draft model backends and prediction lengths on CPU while the target model runs on GPU.

Spec Decode Sweep Configuration:

Config profiles with spec decode: When an optimization tier tests spec decode, the resulting profile includes optimal draft backend, prediction length, draft thread count, and measured uplift percentage. Apply a profile to activate both the base inference configuration and its paired spec decode settings.

Typical workflow: Run a Medium or Long tier optimization, which automatically tests spec decode variants. Check results to see measured uplift percentages for each configuration. Apply a high-uplift profile to activate spec decode for your inference.

Hardware Profiling & Thermal Management

Before benchmarking, TokForge auto-detects your device's hardware profile:

This hardware profile informs recommended starting configurations and sets upper bounds for thread counts and context sizes. It also determines which GPU paths (CPU, OpenCL, Vulkan MNN, Vulkan GGUF CoopMat, QNN) are available and which parameters are swept during optimization.

Thermal Management: AutoTuneThermalGate continuously monitors device temperature during optimization. When temperature thresholds are exceeded (moderate → severe → critical → emergency), the system automatically pauses benchmarking and waits for the device to cool before continuing. Thermal status and battery level are recorded with each benchmark result for context about measurement conditions.

Backend Comparison

ForgeLab tests across five GPU paths for comprehensive device coverage:

AspectMNNGGUF (llama.cpp)
Speed2-8x faster on mobileBaseline (reference)
Format.mnn directorySingle .gguf file
Quantization4-bit (Q4_0, Q4_1)Q4_K_M through Q8_0
GPU accelerationOpenCL (Adreno/Mali) + VulkanVulkan (where available) + GPU layers
Vulkan supportMNN Vulkan with custom kernelsGGUF Vulkan CoopMat auto-tuned in Medium/Long tiers
Thinking modelsNot yet supportedFull support
Model coverageQwen3 primarilyBroader ecosystem

Configuration Profiles

A configuration profile is a complete set of inference settings saved per (SoC, model, backend, quantization) tuple. Profiles capture:

Profile sources: Profiles are tagged by how they were created. Manual benchmarks create "benchmark" source profiles (highest priority). AutoForge creates "auto-tune" profiles. Device auto-detection creates "auto_profile" (lowest priority). Profiles from other devices are tagged "imported". This hierarchy ensures manually-tuned configs are never overwritten by automatic sweeps.

Applying a profile: When you tap Apply on a profile, TokForge loads those exact settings, switches backends if needed, unloads the current model, and reloads it with the new configuration. You return to chat automatically with the profile active.

Benchmark Matrix & Auto-Matrix

Benchmark Matrix: All results organize into a cross-model × cross-backend comparison grid by SoC × Model × Backend for easy comparison:

Auto-Matrix: Run a comprehensive automated sweep across all loaded models and available backends in a single operation. Auto-Matrix benchmarks produce a complete device performance matrix useful for understanding which model/backend combination is fastest on your hardware.

Export/Import: Benchmark results and configuration profiles are exportable as JSON for cross-device sharing and fleet analysis. Deduplication ensures imported results don't create duplicates if the exact same configuration was already benchmarked locally.

Benchmark Cards & Sharing

Each benchmark result generates a card via BenchmarkCardRenderer showing key metrics: headline tok/s, decode tok/s, prefill latency, total runtime, model name, device SoC, and thermal data. For speculative decoding configurations, cards display the measured uplift percentage.

Sharing options:

Imported benchmarks are merged into your local database and appear in your benchmark history. You can compare your device's results directly against results from colleagues running the same models.

Key Findings

Reproducibility & Benchmark Entity

Every benchmark result is stored with its complete configuration profile and runtime fingerprint. This means any result can be reproduced on identical hardware by loading the same profile. Each benchmark entity records:

Results that show unusual performance (e.g., heavy throttling) are tagged with thermal and battery data so you can identify when a device was under stress. Cross-device comparisons use exact runtime fingerprint matching to ensure accurate performance attribution.

API Access

All ForgeLab functionality is available programmatically via the MetricsService API with 90+ endpoints. See the API documentation for complete details. Sample endpoints include:

Benchmark tips: Rotating tips are displayed during optimization waits to help guide users through longer sweeps.