ForgeLab & Methodology
Overview
ForgeLab is TokForge's integrated benchmarking and optimization system. It helps you systematically measure and optimize model performance on your device:
- Benchmark — Run standardized speed tests on any model configuration
- AutoForge — Automatically sweep configuration parameters across four optimization tiers (Instant, Short, Medium, Long)
- Profiles — Save best-performing configs and apply them instantly
- Matrix — Compare results across devices, models, and backends
- Sharing — Export benchmark cards and JSON profiles for fleet analysis
Every benchmark result is stored with its complete configuration, making all performance measurements reproducible and comparable across devices.
ForgeLab UI
The Forge screen is a three-tab interface for benchmarking and optimization:
Tab 1: Config (Optimization Control)
Purpose: Configure and run AutoForge optimization sweeps.
What you can do:
- Select a model to optimize
- Choose a backend (MNN, GGUF, or Remote API)
- Pick an optimization tier (Instant, Short, Medium, or Long)
- Start the AutoForge sweep with a single tap
- Watch real-time progress showing which parameters are being tested
- See the best result found so far and estimated completion time
Typical workflow: Select your loaded model → Pick "Medium" tier → Watch as TokForge tests different thread counts, KV cache types, and batch sizes → Get the optimal config in 15-45 minutes
Tab 2: Report (Results & History)
Purpose: View benchmark results and historical performance data.
What you can see:
- Benchmark Card — Latest result with headline tok/s, prefill latency, and model/device info
- Benchmark History — Scrollable list of past runs with timestamps, backends, and performance metrics
- Manual Run — Run a single benchmark with custom prompt and token limits
- Export/Import — Share benchmarks as JSON or import results from other devices
The benchmark card shows your inference speed at a glance. Tap it to see full details like prefill time, total latency, and the exact configuration used. Share results via WhatsApp, Telegram, or email with a single button tap.
Tab 3: Profiles (Configuration Management)
Purpose: Manage saved configurations and apply them to inference.
What you can do:
- View all saved profiles with their best tok/s scores
- See where each profile came from (manual benchmark, auto-tune, or imported)
- Apply a profile instantly to switch your inference config
- View the full parameter list of any profile
- Export a profile as JSON for sharing or backup
- Import profiles from files or paste JSON directly
Profiles are named by device SoC, model, and backend. When you apply a profile, TokForge loads those exact settings and returns you to chat with the new configuration active. No manual config editing needed.
Optimization Tiers
AutoForge offers four time-based optimization tiers. Each tier automatically sweeps different parameter combinations to find the best configuration for your hardware and model:
Instant Tier (~2-5 minutes)
Use case: Quick baseline check or validate a config change works.
- What it tests: Single config (your current settings as baseline)
- Parameters swept: None — just benchmarks your current state
- Output: Baseline tok/s number to compare future runs against
- Time: 1-5 minutes
Short Tier (~5-15 minutes)
Use case: Find obvious wins in threading and KV cache configuration.
- What it tests: 3-6 different configurations combining thread counts and memory options
- Parameters swept: Thread count (2-12), KV cache type (f16, q8_0)
- Output: Best configuration for your device, typically 5-10% faster than baseline
- Time: 5-15 minutes (depending on device speed)
Medium Tier (~15-45 minutes)
Use case: Production optimization that covers most parameter interactions.
- What it tests: 40-80 different configurations
- Parameters swept: Thread counts, batch sizes (256/512/1024), KV cache types, precision modes (for MNN), flash attention on/off
- Output: Comprehensive optimal config with 2-5% uplift over baseline
- Time: 15-45 minutes
Long Tier (45 min - 2+ hours)
Use case: Exhaustive testing for scientific validation or publication-quality results.
- What it tests: 100-200+ configurations
- Parameters swept: All Medium parameters plus fine-grained threading ranges and GPU layer allocations
- Output: Complete performance landscape, optimal config, and detailed parameter sensitivity analysis
- Time: 1.5-3 hours
How Benchmarks Run
Warmup & Measured Runs
Every benchmark follows a standardized structure to ensure reproducible results:
- Warmup run (1 run): Primes CPU caches, GPU memory, and JIT compilation. Results are discarded.
- Measured runs (3 runs): Actual performance measurements. The median value is selected to reduce variance from outliers.
Metrics collected per run: Prefill latency (ms), decode latency (ms), tokens per second (tok/s), and token count.
Single Benchmark
A manual benchmark run sends a prompt to your loaded model and measures inference speed. You can customize:
- Prompt: Any text input (default: standardized benchmark prompt)
- Max tokens: Length of generated response (default: 128)
- Number of runs: Averaging (default: 3 measured runs)
Auto-Matrix
The auto-matrix feature benchmarks all combinations of installed models and available backends in one async operation. This produces a complete device performance matrix — useful for understanding which model/backend combination is fastest on your hardware.
What We Measure
| Metric | Description | Unit |
|---|---|---|
| tok/s | End-to-end tokens per second (prompt + decode) | tokens/sec |
| Decode tok/s | Decode-only throughput (excludes prefill) | tokens/sec |
| Prefill latency | Time to process the input prompt before generating | milliseconds |
| Decode latency | Total time spent generating output tokens | milliseconds |
| Token count | Number of tokens generated in the run | count |
Hardware Profiling
Before benchmarking, TokForge auto-detects your device's hardware profile:
- SoC model — Snapdragon 8 Elite, MediaTek Dimensity, etc.
- CPU topology — Performance cores vs efficiency cores, maximum frequency
- GPU — Adreno, Mali, or other GPU with compute capabilities
- RAM — Total available memory for model loading
This hardware profile informs recommended starting configurations and sets upper bounds for thread counts and context sizes. It also determines which parameters are swept during optimization (e.g., older devices may have fewer core options to test).
Backend Comparison
| Aspect | MNN | GGUF (llama.cpp) |
|---|---|---|
| Speed | 2-8x faster on mobile | Baseline (reference) |
| Format | .mnn directory | Single .gguf file |
| Quantization | 4-bit (Q4_0, Q4_1) | Q4_K_M through Q8_0 |
| GPU | OpenCL (Adreno/Mali) | GPU layers via llama.cpp |
| Thinking models | Not yet supported | Full support |
| Model coverage | Qwen3 primarily | Broader ecosystem |
Configuration Profiles
A configuration profile is a named set of inference settings specific to a hardware/model/backend combination. Profiles include:
- Thread counts (CPU parallelism)
- Batch sizes (context length and processing batches)
- KV cache type (f16, q8_0, q4_0 — memory vs. speed tradeoff)
- Precision mode (low = int8, normal = fp32)
- Backend choice (MNN CPU, MNN OpenCL, GGUF, Remote)
- Flash attention (on/off)
- GPU layer allocation (for MNN OpenCL)
Profile sources: Profiles are tagged by how they were created. Manual benchmarks create "benchmark" source profiles (highest priority). AutoForge creates "auto-tune" profiles. Device auto-detection creates "auto_profile" (lowest priority). Profiles from other devices are tagged "imported". This hierarchy ensures manually-tuned configs are never overwritten by automatic sweeps.
Applying a profile: When you tap Apply on a profile, TokForge loads those exact settings, switches backends if needed, unloads the current model, and reloads it with the new configuration. You return to chat automatically with the profile active.
Cross-Device Comparison
Benchmark results can be exported as JSON and imported on other devices. The matrix view organizes all results by SoC × Model × Backend for easy comparison:
- See which device performs best with each model
- Compare MNN vs GGUF performance side-by-side
- Identify which models are well-optimized on your hardware
- Flatten results across your entire device fleet
Deduplication ensures imported results don't create duplicates if the exact same configuration was already benchmarked locally.
Benchmark Cards & Sharing
Each benchmark result creates a card showing your key metrics: headline tok/s, prefill latency, total runtime, model name, and device SoC.
Sharing options:
- Text share: Copy to clipboard or send via WhatsApp, Telegram, Signal, Email, Discord, Slack
- JSON export: Export the benchmark result and full configuration as JSON for archival or sharing with team members
- PNG export: Screenshot the benchmark card for social media or presentations
Imported benchmarks are merged into your local database and appear in your benchmark history. You can compare your device's results directly against results from colleagues running the same models.
Key Findings
- MNN consistently outperforms GGUF on mobile SoCs, often by 2-8x — inference is typically memory-bandwidth bound on ARM, and MNN's mobile-optimized kernels reduce overhead.
- Speculative decoding can be slower on mobile due to draft model overhead and bandwidth contention between main and draft models.
- Thread count matters: Performance cores only (not efficiency cores) should be used for inference threads. AutoForge detects the optimal split.
- Flash attention provides measurable speedups on devices with sufficient memory bandwidth (16GB+ RAM recommended).
- Short and Medium tiers capture most optimization gains for typical use cases. Long tier is valuable for scientific validation or fleet-wide baseline establishment.
Reproducibility
Every benchmark result is stored with its complete configuration profile. This means any result can be reproduced on identical hardware by loading the same profile. Stored with each result:
- Device SoC, CPU cores, GPU, RAM, and Android version
- Model name, quantization, parameter count
- Backend type (MNN / GGUF / Remote) and backend-specific settings
- Thread counts, batch size, context length, KV cache type
- Temperature, top-p, top-k sampling parameters
- Battery percent, thermal status, and available memory at test time
Results that show unusual performance (e.g., heavy throttling) are tagged with thermal and battery data so you can identify when a device was under stress.
API Access
All ForgeLab functionality is available programmatically via the MetricsService API. See the API documentation for complete details. Key endpoints:
POST /control/auto-tune— Start an AutoForge sweep with a specified tierGET /control/auto-tune/status— Check progress of running optimizationPOST /benchmark/run— Run a single benchmarkGET /benchmark/results— Query stored benchmark results with filteringGET /benchmark/matrix— Get SoC × Model × Backend matrixPOST /benchmark/auto-matrix— Run comprehensive fleet benchmarkGET /benchmark/optimal-config— Get the best profile for a device/model/backend comboGET /benchmark/export— Export all results as JSONPOST /benchmark/import— Import results from another device