LLM Inference: Establishing a Baseline
In the previous post, I outlined the plan to build high-performance LLM inference engine from scratch in C++. Today, I’m sharing the results of Phase 1: The Baseline.
The goal of this phase was not to optimize yet, but to build a stable, measurable system using llama.cpp as the backend and establish hard data points for where we stand.
The Setup
- Hardware: Intel Core Ultra 9 275HX (24 Cores, 64GB RAM)
- Model: TinyLlama 1.1B Chat (Q4_K_M)
- Engine:
LL_LLM(Custom C++ harness) - Backend:
llama.cpp(Low-level C API)
We ran three distinct benchmark scenarios to capture different performance characteristics:
- Short Prompt (“Hello”): Measures pure system overhead and dispatch latency.
- Long Prompt (~500 tokens): Measures prefill performance (compute-bound).
- Reasoning Prompt: Measures sustenance and stability over a mixed workload.
Key Metrics
Here is what the raw data looks like:
| Metric | Short Prompt | Long Prompt | Reasoning |
|---|---|---|---|
| First Token Latency | 112.16 ms | 2,983.71 ms | 896.90 ms |
| Generation Speed | 40.91 tok/s | 35.31 tok/s | 38.39 tok/s |
| Peak RAM | 1,118 MB | 1,168 MB | 1,146 MB |
| Model Load Time | 722 ms | 436 ms | 574 ms |
1. The Prefill Wall
The most glaring number here is the 2.98 seconds latency on the long prompt. This is the prefill phase—where the model processes all input tokens before generating a single new one.
Processing 500 tokens takes ~3 seconds. This confirms our hypothesis: Prefill is compute-bound. It’s a massive matrix multiplication problem ($O(N^2)$ attention matrix).
2. Throughput is Healthy
Once the first token is out, the engine settles into a comfortable 35-40 tokens/second. This is faster than human reading speed. It suggests that even without advanced optimizations, the llama.cpp backend is quite efficient at sequential decoding on this hardware.
3. RAM Usage
The 1.1B model takes up just over 1.1 GB of RAM. This is extremely lightweight. When we move to the 7B Mistral model in Phase 2, we expect this to jump to ~4.5 GB, which will put more pressure on memory bandwidth.
Visualizing the Latency
xychart-beta
title "First Token Latency by Prompt Type (Lower is Better)"
x-axis ["Short (Hello)", "Reasoning (Logic)", "Long (Article)"]
y-axis "Time (ms)" 0 --> 3500
bar [112, 897, 2984]
What’s Next?
Now that we have a baseline, we can start profiling. We are not just going to guess what makes it slow; we are going to measure it.
Phase 2 Goals:
- Profile with Visual Studio Performance Profiler: See exactly which functions burn those 3 seconds during prefill.
- Analyze Memory Bandwidth: Is the decode phase limited by compute or RAM speed?
- Optimize Threading: We used 4 threads for this test. Is that optimal? What happens at 8, 12, or 24 threads?
Stay tuned for the profiling deep dive.