Join Closed Beta
Home / Docs / ForgeLab & Methodology

ForgeLab & Methodology

Benchmarking system, optimization tiers, configuration management, and reproducible inference performance measurement.

Overview

ForgeLab is TokForge's integrated benchmarking and optimization system. It helps you systematically measure and optimize model performance on your device:

Every benchmark result is stored with its complete configuration, making all performance measurements reproducible and comparable across devices.

ForgeLab UI

The Forge screen is a three-tab interface for benchmarking and optimization:

Tab 1: Config (Optimization Control)

Purpose: Configure and run AutoForge optimization sweeps.

What you can do:

Typical workflow: Select your loaded model → Pick "Medium" tier → Watch as TokForge tests different thread counts, KV cache types, and batch sizes → Get the optimal config in 15-45 minutes

Tab 2: Report (Results & History)

Purpose: View benchmark results and historical performance data.

What you can see:

The benchmark card shows your inference speed at a glance. Tap it to see full details like prefill time, total latency, and the exact configuration used. Share results via WhatsApp, Telegram, or email with a single button tap.

Tab 3: Profiles (Configuration Management)

Purpose: Manage saved configurations and apply them to inference.

What you can do:

Profiles are named by device SoC, model, and backend. When you apply a profile, TokForge loads those exact settings and returns you to chat with the new configuration active. No manual config editing needed.

Optimization Tiers

AutoForge offers four time-based optimization tiers. Each tier automatically sweeps different parameter combinations to find the best configuration for your hardware and model:

Instant Tier (~2-5 minutes)

Use case: Quick baseline check or validate a config change works.

Short Tier (~5-15 minutes)

Use case: Find obvious wins in threading and KV cache configuration.

Medium Tier (~15-45 minutes)

Use case: Production optimization that covers most parameter interactions.

Long Tier (45 min - 2+ hours)

Use case: Exhaustive testing for scientific validation or publication-quality results.

How Benchmarks Run

Warmup & Measured Runs

Every benchmark follows a standardized structure to ensure reproducible results:

Metrics collected per run: Prefill latency (ms), decode latency (ms), tokens per second (tok/s), and token count.

Single Benchmark

A manual benchmark run sends a prompt to your loaded model and measures inference speed. You can customize:

Auto-Matrix

The auto-matrix feature benchmarks all combinations of installed models and available backends in one async operation. This produces a complete device performance matrix — useful for understanding which model/backend combination is fastest on your hardware.

What We Measure

MetricDescriptionUnit
tok/sEnd-to-end tokens per second (prompt + decode)tokens/sec
Decode tok/sDecode-only throughput (excludes prefill)tokens/sec
Prefill latencyTime to process the input prompt before generatingmilliseconds
Decode latencyTotal time spent generating output tokensmilliseconds
Token countNumber of tokens generated in the runcount

Hardware Profiling

Before benchmarking, TokForge auto-detects your device's hardware profile:

This hardware profile informs recommended starting configurations and sets upper bounds for thread counts and context sizes. It also determines which parameters are swept during optimization (e.g., older devices may have fewer core options to test).

Backend Comparison

AspectMNNGGUF (llama.cpp)
Speed2-8x faster on mobileBaseline (reference)
Format.mnn directorySingle .gguf file
Quantization4-bit (Q4_0, Q4_1)Q4_K_M through Q8_0
GPUOpenCL (Adreno/Mali)GPU layers via llama.cpp
Thinking modelsNot yet supportedFull support
Model coverageQwen3 primarilyBroader ecosystem

Configuration Profiles

A configuration profile is a named set of inference settings specific to a hardware/model/backend combination. Profiles include:

Profile sources: Profiles are tagged by how they were created. Manual benchmarks create "benchmark" source profiles (highest priority). AutoForge creates "auto-tune" profiles. Device auto-detection creates "auto_profile" (lowest priority). Profiles from other devices are tagged "imported". This hierarchy ensures manually-tuned configs are never overwritten by automatic sweeps.

Applying a profile: When you tap Apply on a profile, TokForge loads those exact settings, switches backends if needed, unloads the current model, and reloads it with the new configuration. You return to chat automatically with the profile active.

Cross-Device Comparison

Benchmark results can be exported as JSON and imported on other devices. The matrix view organizes all results by SoC × Model × Backend for easy comparison:

Deduplication ensures imported results don't create duplicates if the exact same configuration was already benchmarked locally.

Benchmark Cards & Sharing

Each benchmark result creates a card showing your key metrics: headline tok/s, prefill latency, total runtime, model name, and device SoC.

Sharing options:

Imported benchmarks are merged into your local database and appear in your benchmark history. You can compare your device's results directly against results from colleagues running the same models.

Key Findings

Reproducibility

Every benchmark result is stored with its complete configuration profile. This means any result can be reproduced on identical hardware by loading the same profile. Stored with each result:

Results that show unusual performance (e.g., heavy throttling) are tagged with thermal and battery data so you can identify when a device was under stress.

API Access

All ForgeLab functionality is available programmatically via the MetricsService API. See the API documentation for complete details. Key endpoints: