Low-Latency LLM Inference on CPU: A Systems Engineering Study
Most conversations about running LLMs assume you have a GPU. But what if you don’t? What if your deployment target is a CPU-only laptop, an edge device, or a cost-constrained server?
This is the question behind LL_LLM — a systematic, profiling-driven study of LLM inference optimization in pure C++. No blind tricks. Every optimization is backed by measurement data.
You can find the full source and documentation on GitHub: ch09/Low_Latency_Inference_optimization_LLM-LL_LLM-.
What is an LLM? A Large Language Model is a neural network with billions of parameters, trained on massive text datasets to predict the next word in a sequence. Models like LLaMA, GPT, and Mistral learn statistical patterns of language during training. At runtime, they generate text one token at a time by repeatedly predicting “given everything so far, what comes next?” The “large” part matters: a 7-billion-parameter model stores 7 billion floating-point numbers as weights — that is ~14 GB in FP16, and it all needs to fit in memory just to run.
What is GGUF? GGUF (GPT-Generated Unified Format) is a binary file format designed for fast, efficient loading of quantized LLM weights. Created by the llama.cpp project, it stores model architecture metadata, tokenizer data, and tensor weights in a single file. Unlike older formats (GGML, PyTorch
.bin), GGUF supports memory-mapping (mmap), meaning the OS can load model weights directly from disk into RAM without copying — critical for fast cold starts on CPU. A file liketinyllama-1.1b-chat-v1.0.Q4_K_M.gguftells you: model name, parameter count (1.1B), variant (chat), and quantization level (Q4_K_M = 4-bit).
The Pipeline at a Glance
What is LLM Inference? Inference is the process of using a trained model to generate output — in this case, text. Unlike training (which adjusts billions of parameters over weeks on GPU clusters), inference is what happens every time you send a prompt to ChatGPT, Copilot, or any local model. It is the hot path. The runtime cost you pay per query. And on a CPU, every millisecond counts.
At its core, LLM inference is a pipeline: take a user prompt, tokenize it, load a model, run the inference engine, sample tokens, and produce text.
flowchart LR
subgraph Input
A[User Prompt]
end
subgraph Engine["Our Optimized Engine"]
B[Tokenizer]
C[Model Loader]
D[Inference Engine]
E[Token Sampler]
end
subgraph Output
F[Generated Text]
end
A --> B --> C --> D --> E --> F
style Engine fill:#1a1a2e,stroke:#00d4ff,stroke-width:2px,color:#fff
style A fill:#0f3460,stroke:#00d4ff,color:#fff
style F fill:#0f3460,stroke:#00d4ff,color:#fff
Simple enough. The complexity — and the performance opportunities — lie inside each of these boxes.
How LLM Inference Actually Works
Before optimizing anything, you must understand what happens when a model generates text. There are two distinct phases:
sequenceDiagram
participant User
participant Tokenizer
participant Prefill as Prefill Phase
participant Decode as Decode Phase
participant Output
User->>Tokenizer: "What is the capital of France?"
Tokenizer->>Prefill: [Token IDs: 1734, 338, 278, ...]
Note over Prefill: PREFILL PHASE<br/>Process ALL input tokens<br/>in ONE forward pass<br/>(compute-bound)
Prefill->>Decode: KV Cache populated
loop For each output token
Note over Decode: DECODE PHASE<br/>Generate ONE token<br/>per forward pass<br/>(memory-bound)
Decode->>Decode: Token → Attention → FFN → Logits → Sample
end
Decode->>Output: "The capital of France is Paris."
The prefill phase processes all input tokens at once — it’s compute-bound and benefits from parallelism. The decode phase generates one token at a time — it’s memory-bandwidth-bound and is where most of the wall-clock time goes.
Inside a Transformer Layer
Each layer of the model performs these operations:
flowchart TB
subgraph Layer["Transformer Layer (repeated N times)"]
direction TB
IN[Input Hidden State] --> NORM1[RMSNorm]
NORM1 --> ATTN
subgraph ATTN["Multi-Head Self-Attention"]
direction LR
QKV["Q, K, V\nProjections\n(MatMul)"] --> ROPE["RoPE\nPositional\nEncoding"]
ROPE --> DOT["Q × Kᵀ\n(Dot Product)"]
DOT --> SOFT["Softmax"]
SOFT --> VMUL["× V\n(MatMul)"]
VMUL --> PROJ["Output\nProjection\n(MatMul)"]
end
ATTN --> ADD1["+ Residual"]
IN --> ADD1
ADD1 --> NORM2[RMSNorm]
NORM2 --> FFN
subgraph FFN["Feed-Forward Network"]
direction LR
GATE["Gate Proj\n(MatMul)"] --> SILU["SiLU\nActivation"]
UP["Up Proj\n(MatMul)"] --> MUL["Element-wise\nMultiply"]
SILU --> MUL
MUL --> DOWN["Down Proj\n(MatMul)"]
end
FFN --> ADD2["+ Residual"]
ADD1 --> ADD2
end
ADD2 --> OUT[Output Hidden State]
style Layer fill:#1a1a2e,stroke:#00d4ff,stroke-width:2px,color:#fff
style ATTN fill:#16213e,stroke:#e94560,stroke-width:2px,color:#fff
style FFN fill:#16213e,stroke:#533483,stroke-width:2px,color:#fff
The key insight: almost every box labeled MatMul is a matrix multiplication. This single operation accounts for ~90% of compute time. That is why optimizing matrix multiply is so impactful.
Where Time Is Spent
| Operation | % of Time | Bottleneck Type | Optimization Target |
|---|---|---|---|
| Matrix Multiplications | ~80-90% | Compute + Memory BW | SIMD, tiling, quantization |
| Attention (QK^T, softmax) | ~5-10% | Memory bandwidth | KV cache optimization |
| Token Sampling | ~1-2% | Latency | Efficient top-k/top-p |
| Tokenization | <1% | CPU | Usually negligible |
| Memory Allocation | ~2-5% | System | Memory pooling |
Profiling Strategy: Measure, Don’t Guess
We use the Visual Studio Performance Profiler to identify where time is actually spent. No guessing — data drives every decision.
flowchart TB
PROFILE["Run Profiler"]
CPU_PROF["CPU Sampling<br/>Where is time spent?"]
MEM_PROF["Memory Usage<br/>Where are allocations?"]
THREAD_PROF["Concurrency<br/>Are threads idle?"]
PROFILE --> CPU_PROF
PROFILE --> MEM_PROF
PROFILE --> THREAD_PROF
CPU_PROF --> HOTSPOT["Hotspot Analysis"]
MEM_PROF --> ALLOC["Allocation Analysis"]
THREAD_PROF --> CONTENTION["Lock Contention<br/>Analysis"]
HOTSPOT --> Q1{"MatMul > 80%\nof time?"}
ALLOC --> Q2{"KV Cache\ndominates?"}
CONTENTION --> Q3{"Threads\nstarving?"}
Q1 -->|Yes| OPT1["→ SIMD Optimization"]
Q1 -->|No| OPT1B["→ Investigate other hotspots"]
Q2 -->|Yes| OPT2["→ KV Cache Optimization"]
Q2 -->|No| OPT2B["→ Memory is fine"]
Q3 -->|Yes| OPT3["→ Threading Optimization"]
Q3 -->|No| OPT3B["→ Threading is fine"]
style PROFILE fill:#533483,stroke:#fff,color:#fff
style Q1 fill:#e94560,stroke:#fff,color:#fff
style Q2 fill:#e94560,stroke:#fff,color:#fff
style Q3 fill:#e94560,stroke:#fff,color:#fff
The expected CPU time distribution for a 7B parameter model looks like this:
pie title Expected CPU Time Distribution (7B Model)
"Matrix Multiply (GEMM)" : 85
"Attention Computation" : 5
"Memory Operations" : 5
"Token Sampling" : 2
"Tokenization" : 1
"Other" : 2
This confirms that matrix multiply is the single biggest optimization lever.
Quantization: Trading Precision for Speed
Quantization reduces the precision of model weights from 16-bit floats to fewer bits. Less precision means less memory and faster compute, but potentially less accuracy.
flowchart LR
subgraph FP16["FP16 (16 bits)"]
direction TB
FP16_BITS["████████████████<br/>16 bits per weight<br/>~14 GB for 7B"]
end
subgraph Q8["Q8 (8 bits)"]
direction TB
Q8_BITS["████████<br/>8 bits per weight<br/>~7 GB for 7B"]
end
subgraph Q4["Q4 (4 bits)"]
direction TB
Q4_BITS["████<br/>4 bits per weight<br/>~4 GB for 7B"]
end
subgraph Q2["Q2 (2 bits)"]
direction TB
Q2_BITS["██<br/>2 bits per weight<br/>~2.5 GB for 7B"]
end
FP16 -->|"Lose some precision"| Q8
Q8 -->|"Lose more precision"| Q4
Q4 -->|"Aggressive"| Q2
style FP16 fill:#0f3460,stroke:#00d4ff,color:#fff
style Q8 fill:#16213e,stroke:#00d4ff,color:#fff
style Q4 fill:#533483,stroke:#00d4ff,color:#fff
style Q2 fill:#e94560,stroke:#fff,color:#fff
Comparison Table
| Format | Bits | Model Size | Accuracy Score |
|---|---|---|---|
| FP16 | 16 | ~14 GB | 100% (baseline) |
| Q8_0 | 8 | ~7 GB | ~99% |
| Q4_K_M | 4 | ~4 GB | ~95-97% |
| Q2_K | 2 | ~2.5 GB | ~85-90% |
The sweet spot for most deployments is Q4_K_M: a ~3.5x reduction in model size with only a ~3-5% accuracy loss.
Advanced Optimizations
Based on profiling results, we pick 2-3 of the following optimizations. This is where the real systems engineering happens.
KV Cache Optimization
The KV cache stores the Key and Value matrices from previous tokens so they don’t need to be recomputed. It grows linearly with sequence length and becomes a memory bottleneck.
flowchart TB
subgraph Problem["Problem: KV Cache Growth"]
direction LR
T1["Token 1<br/>K₁, V₁"] --> T2["Token 2<br/>K₁₋₂, V₁₋₂"] --> T3["Token 3<br/>K₁₋₃, V₁₋₃"] --> TN["Token N<br/>K₁₋ₙ, V₁₋ₙ"]
MEM["Memory grows<br/>O(n × d × layers)"]
end
subgraph Solutions["Solutions"]
direction TB
POOL["Memory Pooling<br/>Pre-allocate fixed buffer<br/>Avoid malloc/free per token"]
COMPRESS["Cache Compression<br/>Store K,V in lower precision<br/>(FP16 → INT8)"]
EVICT["Sliding Window<br/>Keep only last N tokens<br/>Evict oldest entries"]
end
Problem --> Solutions
style Problem fill:#2d142c,stroke:#e94560,color:#fff
style Solutions fill:#0f3460,stroke:#00d4ff,color:#fff
Threading Optimization
Default threading in llama.cpp often leaves cores idle or waiting on locks. By pinning threads to cores and using work-stealing, we can fully saturate the available compute.
flowchart TB
subgraph Current["Current: Default Threading"]
direction LR
C0["Core 0<br/>Working"] --- C1["Core 1<br/>Working"]
C2["Core 2<br/>Idle"] --- C3["Core 3<br/>Idle"]
C4["Core 4<br/>Waiting<br/>on lock"] --- C5["Core 5<br/>Idle"]
end
subgraph Optimized["Optimized: Pinned Thread Pool"]
direction LR
O0["Core 0<br/>MatMul Block 0"] --- O1["Core 1<br/>MatMul Block 1"]
O2["Core 2<br/>MatMul Block 2"] --- O3["Core 3<br/>MatMul Block 3"]
O4["Core 4<br/>MatMul Block 4"] --- O5["Core 5<br/>MatMul Block 5"]
end
Current -->|"Thread pinning +<br/>work stealing"| Optimized
style Current fill:#2d142c,stroke:#e94560,color:#fff
style Optimized fill:#0f3460,stroke:#00d4ff,color:#fff
Performance typically plateaus after saturating memory bandwidth. More threads does not mean more speed past a certain point. Finding that inflection point is the goal.
Speculative Decoding
This is the most research-grade optimization. Use a small, fast “draft” model to predict multiple tokens, then verify them in bulk with the large model.
sequenceDiagram
participant Draft as Draft Model (1B, fast)
participant Target as Target Model (7B, slow)
participant Output as Accepted Tokens
Note over Draft,Target: Standard decoding: 1 token per 7B forward pass
rect rgb(15, 52, 96)
Note over Draft: Speculative: Draft generates K candidates
Draft->>Draft: Token 1 → "The" (5ms)
Draft->>Draft: Token 2 → "capital" (5ms)
Draft->>Draft: Token 3 → "of" (5ms)
Draft->>Draft: Token 4 → "France" (5ms)
Draft->>Draft: Token 5 → "is" (5ms)
Note over Draft: Total: 25ms for 5 tokens
end
rect rgb(83, 52, 131)
Note over Target: Verify ALL 5 in ONE forward pass
Draft->>Target: ["The", "capital", "of", "France", "is"]
Target->>Target: Batch verify (50ms)
Target->>Output: "The" accepted
Target->>Output: "capital" accepted
Target->>Output: "of" accepted
Target->>Output: "France" accepted
Target->>Output: "is" rejected -> resample
end
Note over Output: Result: 4 tokens in 75ms<br/>vs 4 x 50ms = 200ms normally<br/>2.7x speedup!
SIMD Matrix Multiplication
SIMD (Single Instruction, Multiple Data) processes multiple values per CPU instruction. With AVX2, we can perform 8 float multiplications in a single cycle.
flowchart TB
subgraph Scalar["Scalar: 1 operation at a time"]
direction LR
S1["a₁ × b₁"] --> S2["a₂ × b₂"] --> S3["a₃ × b₃"] --> S4["a₄ × b₄"]
S5["a₅ × b₅"] --> S6["a₆ × b₆"] --> S7["a₇ × b₇"] --> S8["a₈ × b₈"]
ST["8 cycles"]
end
subgraph AVX2["AVX2: 8 operations simultaneously"]
direction LR
V1["a₁×b₁ a₂×b₂ a₃×b₃ a₄×b₄ a₅×b₅ a₆×b₆ a₇×b₇ a₈×b₈"]
VT["1 cycle"]
end
Scalar -->|"8× theoretical<br/>speedup"| AVX2
style Scalar fill:#2d142c,stroke:#e94560,color:#fff
style AVX2 fill:#0f3460,stroke:#00d4ff,color:#fff
Combined with cache-aware tiling (breaking large matrices into 64x64 blocks that fit in L1 cache), this can yield a 3-5x real-world speedup on top of the SIMD gains.
System Architecture
The project wraps all of this into a clean, layered architecture:
flowchart TB
subgraph CLI["Command-Line Interface"]
ARGS[CLI Arguments Parser]
STREAM[Streaming Output]
end
subgraph CORE["Core Engine"]
LOADER[Model Loader<br/>GGUF Format]
QUANT[Quantization Manager<br/>FP16 / Q8 / Q4 / Q2]
INFER[Inference Engine<br/>llama.cpp backend]
KV[KV Cache Manager]
THREAD[Thread Pool]
end
subgraph BENCH["Benchmarking Layer"]
METRICS[Metrics Collector<br/>Latency, Throughput, RAM]
PROFILER[Profiler Integration<br/>VS Profiler hooks]
SUITE[Benchmark Suite<br/>Short/Long/Reasoning]
end
subgraph RESULTS["Results & Analysis"]
JSON[JSON Output]
CSV[CSV Tables]
GRAPHS[Performance Graphs]
end
CLI --> CORE
CORE --> BENCH
BENCH --> RESULTS
LOADER --> QUANT
QUANT --> INFER
INFER --> KV
INFER --> THREAD
style CLI fill:#0f3460,stroke:#00d4ff,color:#fff
style CORE fill:#1a1a2e,stroke:#e94560,stroke-width:2px,color:#fff
style BENCH fill:#16213e,stroke:#533483,color:#fff
style RESULTS fill:#0f3460,stroke:#00d4ff,color:#fff
Building and Running
Prerequisites
- CMake 3.20+
- Visual Studio 2022 (or MSVC Build Tools)
- Git
Build
cmake -B build -G "Visual Studio 17 2022" -A x64
cmake --build build --config Release
Run
# Download a GGUF model from Hugging Face first
.\build\Release\ll_llm.exe \
--model models\tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
--prompt "What is the capital of France?" \
--threads 4
What’s Next
This project is structured in 6 phases. Phase 1 establishes the baseline with llama.cpp and a 7B model. Each subsequent phase layers on profiling insights, quantization experiments, and targeted optimizations — all driven by the data collected in the previous phase.
The goal is not just “make it faster” but to understand why each optimization works (or doesn’t) at the hardware level. That’s what separates engineering from cargo-culting.
This is an active research project. Follow along on GitHub as results come in.