ForgeLab & Methodology

Benchmarking system, optimization tiers, configuration management, and reproducible inference performance measurement.

Overview

ForgeLab is TokForge's integrated benchmarking and optimization system. It helps you systematically measure and optimize model performance on your device:

Benchmark — Run standardized speed tests on any model configuration
AutoForge — Automatically sweep configuration parameters across four optimization tiers (Instant, Short, Medium, Long)
Profiles — Save best-performing configs and apply them instantly
Matrix — Compare results across devices, models, and backends
Sharing — Export benchmark cards and JSON profiles for fleet analysis

Every benchmark result is stored with its complete configuration, making all performance measurements reproducible and comparable across devices.

ForgeLab UI

The Forge screen is a three-tab interface for benchmarking and optimization:

Tab 1: Config (Optimization Control)

Purpose: Configure and run AutoForge optimization sweeps.

What you can do:

Select a model to optimize
Choose a backend (MNN, GGUF, or Remote API)
Pick an optimization tier (Instant, Short, Medium, or Long)
Start the AutoForge sweep with a single tap
Watch real-time progress showing which parameters are being tested
See the best result found so far and estimated completion time

Typical workflow: Select your loaded model → Pick "Medium" tier → Watch as TokForge tests different thread counts, KV cache types, and batch sizes → Get the optimal config in 15-45 minutes

Tab 2: Report (Results & History)

Purpose: View benchmark results and historical performance data.

What you can see:

Benchmark Card — Latest result with headline tok/s, prefill latency, and model/device info
Benchmark History — Scrollable list of past runs with timestamps, backends, and performance metrics
Manual Run — Run a single benchmark with custom prompt and token limits
Export/Import — Share benchmarks as JSON or import results from other devices

The benchmark card shows your inference speed at a glance. Tap it to see full details like prefill time, total latency, and the exact configuration used. Share results via WhatsApp, Telegram, or email with a single button tap.

Tab 3: Profiles (Configuration Management)

Purpose: Manage saved configurations and apply them to inference.

What you can do:

View all saved profiles with their best tok/s scores
See where each profile came from (manual benchmark, auto-tune, or imported)
Apply a profile instantly to switch your inference config
View the full parameter list of any profile
Export a profile as JSON for sharing or backup
Import profiles from files or paste JSON directly

Profiles are named by device SoC, model, and backend. When you apply a profile, TokForge loads those exact settings and returns you to chat with the new configuration active. No manual config editing needed.

Optimization Tiers

AutoForge offers four time-based optimization tiers. Each tier automatically sweeps different parameter combinations to find the best configuration for your hardware and model:

Instant Tier (~2-5 minutes)

Use case: Quick baseline check or validate a config change works.

What it tests: Single config (your current settings as baseline)
Parameters swept: None — just benchmarks your current state
Output: Baseline tok/s number to compare future runs against
Time: 1-5 minutes

Short Tier (~5-15 minutes)

Use case: Find obvious wins in threading and KV cache configuration.

What it tests: 3-6 different configurations combining thread counts and memory options
Parameters swept: Thread count (2-12), KV cache type (f16, q8_0)
Output: Best configuration for your device, typically 5-10% faster than baseline
Time: 5-15 minutes (depending on device speed)

Medium Tier (~15-45 minutes)

Use case: Production optimization that covers most parameter interactions.

What it tests: 40-80 different configurations
Parameters swept: Thread counts, batch sizes (256/512/1024), KV cache types, precision modes (for MNN), flash attention on/off
Output: Comprehensive optimal config with 2-5% uplift over baseline
Time: 15-45 minutes

Long Tier (45 min - 2+ hours)

Use case: Exhaustive testing for scientific validation or publication-quality results.

What it tests: 100-200+ configurations
Parameters swept: All Medium parameters plus fine-grained threading ranges and GPU layer allocations
Output: Complete performance landscape, optimal config, and detailed parameter sensitivity analysis
Time: 1.5-3 hours

How Benchmarks Run

Warmup & Measured Runs

Every benchmark follows a standardized structure to ensure reproducible results:

Warmup run (1 run): Primes CPU caches, GPU memory, and JIT compilation. Results are discarded.
Measured runs (3 runs): Actual performance measurements. The median value is selected to reduce variance from outliers.

Metrics collected per run: Prefill latency (ms), decode latency (ms), tokens per second (tok/s), and token count.

Single Benchmark

A manual benchmark run sends a prompt to your loaded model and measures inference speed. You can customize:

Prompt: Any text input (default: standardized benchmark prompt)
Max tokens: Length of generated response (default: 128)
Number of runs: Averaging (default: 3 measured runs)

Auto-Matrix

The auto-matrix feature benchmarks all combinations of installed models and available backends in one async operation. This produces a complete device performance matrix — useful for understanding which model/backend combination is fastest on your hardware.

What We Measure

Metric	Description	Unit
tok/s	End-to-end tokens per second (prompt + decode)	tokens/sec
Decode tok/s	Decode-only throughput (excludes prefill)	tokens/sec
Prefill latency	Time to process the input prompt before generating	milliseconds
Decode latency	Total time spent generating output tokens	milliseconds
Token count	Number of tokens generated in the run	count

Hardware Profiling

Before benchmarking, TokForge auto-detects your device's hardware profile:

SoC model — Snapdragon 8 Elite, MediaTek Dimensity, etc.
CPU topology — Performance cores vs efficiency cores, maximum frequency
GPU — Adreno, Mali, or other GPU with compute capabilities
RAM — Total available memory for model loading

This hardware profile informs recommended starting configurations and sets upper bounds for thread counts and context sizes. It also determines which parameters are swept during optimization (e.g., older devices may have fewer core options to test).

Backend Comparison

Aspect	MNN	GGUF (llama.cpp)
Speed	2-8x faster on mobile	Baseline (reference)
Format	.mnn directory	Single .gguf file
Quantization	4-bit (Q4_0, Q4_1)	Q4_K_M through Q8_0
GPU	OpenCL (Adreno/Mali)	GPU layers via llama.cpp
Thinking models	Not yet supported	Full support
Model coverage	Qwen3 primarily	Broader ecosystem

Configuration Profiles

A configuration profile is a named set of inference settings specific to a hardware/model/backend combination. Profiles include:

Thread counts (CPU parallelism)
Batch sizes (context length and processing batches)
KV cache type (f16, q8_0, q4_0 — memory vs. speed tradeoff)
Precision mode (low = int8, normal = fp32)
Backend choice (MNN CPU, MNN OpenCL, GGUF, Remote)
Flash attention (on/off)
GPU layer allocation (for MNN OpenCL)

Profile sources: Profiles are tagged by how they were created. Manual benchmarks create "benchmark" source profiles (highest priority). AutoForge creates "auto-tune" profiles. Device auto-detection creates "auto_profile" (lowest priority). Profiles from other devices are tagged "imported". This hierarchy ensures manually-tuned configs are never overwritten by automatic sweeps.

Applying a profile: When you tap Apply on a profile, TokForge loads those exact settings, switches backends if needed, unloads the current model, and reloads it with the new configuration. You return to chat automatically with the profile active.

Cross-Device Comparison

Benchmark results can be exported as JSON and imported on other devices. The matrix view organizes all results by SoC × Model × Backend for easy comparison:

See which device performs best with each model
Compare MNN vs GGUF performance side-by-side
Identify which models are well-optimized on your hardware
Flatten results across your entire device fleet

Deduplication ensures imported results don't create duplicates if the exact same configuration was already benchmarked locally.

Benchmark Cards & Sharing

Each benchmark result creates a card showing your key metrics: headline tok/s, prefill latency, total runtime, model name, and device SoC.

Sharing options:

Text share: Copy to clipboard or send via WhatsApp, Telegram, Signal, Email, Discord, Slack
JSON export: Export the benchmark result and full configuration as JSON for archival or sharing with team members
PNG export: Screenshot the benchmark card for social media or presentations

Imported benchmarks are merged into your local database and appear in your benchmark history. You can compare your device's results directly against results from colleagues running the same models.

Key Findings

MNN consistently outperforms GGUF on mobile SoCs, often by 2-8x — inference is typically memory-bandwidth bound on ARM, and MNN's mobile-optimized kernels reduce overhead.
Speculative decoding can be slower on mobile due to draft model overhead and bandwidth contention between main and draft models.
Thread count matters: Performance cores only (not efficiency cores) should be used for inference threads. AutoForge detects the optimal split.
Flash attention provides measurable speedups on devices with sufficient memory bandwidth (16GB+ RAM recommended).
Short and Medium tiers capture most optimization gains for typical use cases. Long tier is valuable for scientific validation or fleet-wide baseline establishment.

Reproducibility

Every benchmark result is stored with its complete configuration profile. This means any result can be reproduced on identical hardware by loading the same profile. Stored with each result:

Device SoC, CPU cores, GPU, RAM, and Android version
Model name, quantization, parameter count
Backend type (MNN / GGUF / Remote) and backend-specific settings
Thread counts, batch size, context length, KV cache type
Temperature, top-p, top-k sampling parameters
Battery percent, thermal status, and available memory at test time

Results that show unusual performance (e.g., heavy throttling) are tagged with thermal and battery data so you can identify when a device was under stress.

API Access

All ForgeLab functionality is available programmatically via the MetricsService API. See the API documentation for complete details. Key endpoints:

POST /control/auto-tune — Start an AutoForge sweep with a specified tier
GET /control/auto-tune/status — Check progress of running optimization
POST /benchmark/run — Run a single benchmark
GET /benchmark/results — Query stored benchmark results with filtering
GET /benchmark/matrix — Get SoC × Model × Backend matrix
POST /benchmark/auto-matrix — Run comprehensive fleet benchmark
GET /benchmark/optimal-config — Get the best profile for a device/model/backend combo
GET /benchmark/export — Export all results as JSON
POST /benchmark/import — Import results from another device