ChebbiOS

Low-Latency LLM Inference on CPU: A Systems Engineering Study

#ai #llm #cpp #performance #systems

Most conversations about running LLMs assume you have a GPU. But what if you don’t? What if your deployment target is a CPU-only laptop, an edge device, or a cost-constrained server?

This is the question behind LL_LLM — a systematic, profiling-driven study of LLM inference optimization in pure C++. No blind tricks. Every optimization is backed by measurement data.

You can find the full source and documentation on GitHub: ch09/Low_Latency_Inference_optimization_LLM-LL_LLM-.

What is an LLM? A Large Language Model is a neural network with billions of parameters, trained on massive text datasets to predict the next word in a sequence. Models like LLaMA, GPT, and Mistral learn statistical patterns of language during training. At runtime, they generate text one token at a time by repeatedly predicting “given everything so far, what comes next?” The “large” part matters: a 7-billion-parameter model stores 7 billion floating-point numbers as weights — that is ~14 GB in FP16, and it all needs to fit in memory just to run.

What is GGUF? GGUF (GPT-Generated Unified Format) is a binary file format designed for fast, efficient loading of quantized LLM weights. Created by the llama.cpp project, it stores model architecture metadata, tokenizer data, and tensor weights in a single file. Unlike older formats (GGML, PyTorch .bin), GGUF supports memory-mapping (mmap), meaning the OS can load model weights directly from disk into RAM without copying — critical for fast cold starts on CPU. A file like tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf tells you: model name, parameter count (1.1B), variant (chat), and quantization level (Q4_K_M = 4-bit).

The Pipeline at a Glance

What is LLM Inference? Inference is the process of using a trained model to generate output — in this case, text. Unlike training (which adjusts billions of parameters over weeks on GPU clusters), inference is what happens every time you send a prompt to ChatGPT, Copilot, or any local model. It is the hot path. The runtime cost you pay per query. And on a CPU, every millisecond counts.

At its core, LLM inference is a pipeline: take a user prompt, tokenize it, load a model, run the inference engine, sample tokens, and produce text.

flowchart LR
    subgraph Input
        A[User Prompt]
    end
    subgraph Engine["Our Optimized Engine"]
        B[Tokenizer]
        C[Model Loader]
        D[Inference Engine]
        E[Token Sampler]
    end
    subgraph Output
        F[Generated Text]
    end

    A --> B --> C --> D --> E --> F

    style Engine fill:#1a1a2e,stroke:#00d4ff,stroke-width:2px,color:#fff
    style A fill:#0f3460,stroke:#00d4ff,color:#fff
    style F fill:#0f3460,stroke:#00d4ff,color:#fff

Simple enough. The complexity — and the performance opportunities — lie inside each of these boxes.

How LLM Inference Actually Works

Before optimizing anything, you must understand what happens when a model generates text. There are two distinct phases:

sequenceDiagram
    participant User
    participant Tokenizer
    participant Prefill as Prefill Phase
    participant Decode as Decode Phase
    participant Output

    User->>Tokenizer: "What is the capital of France?"
    Tokenizer->>Prefill: [Token IDs: 1734, 338, 278, ...]

    Note over Prefill: PREFILL PHASE<br/>Process ALL input tokens<br/>in ONE forward pass<br/>(compute-bound)

    Prefill->>Decode: KV Cache populated

    loop For each output token
        Note over Decode: DECODE PHASE<br/>Generate ONE token<br/>per forward pass<br/>(memory-bound)
        Decode->>Decode: Token → Attention → FFN → Logits → Sample
    end

    Decode->>Output: "The capital of France is Paris."

The prefill phase processes all input tokens at once — it’s compute-bound and benefits from parallelism. The decode phase generates one token at a time — it’s memory-bandwidth-bound and is where most of the wall-clock time goes.

Inside a Transformer Layer

Each layer of the model performs these operations:

flowchart TB
    subgraph Layer["Transformer Layer (repeated N times)"]
        direction TB
        IN[Input Hidden State] --> NORM1[RMSNorm]
        NORM1 --> ATTN

        subgraph ATTN["Multi-Head Self-Attention"]
            direction LR
            QKV["Q, K, V\nProjections\n(MatMul)"] --> ROPE["RoPE\nPositional\nEncoding"]
            ROPE --> DOT["Q × Kᵀ\n(Dot Product)"]
            DOT --> SOFT["Softmax"]
            SOFT --> VMUL["× V\n(MatMul)"]
            VMUL --> PROJ["Output\nProjection\n(MatMul)"]
        end

        ATTN --> ADD1["+  Residual"]
        IN --> ADD1

        ADD1 --> NORM2[RMSNorm]
        NORM2 --> FFN

        subgraph FFN["Feed-Forward Network"]
            direction LR
            GATE["Gate Proj\n(MatMul)"] --> SILU["SiLU\nActivation"]
            UP["Up Proj\n(MatMul)"] --> MUL["Element-wise\nMultiply"]
            SILU --> MUL
            MUL --> DOWN["Down Proj\n(MatMul)"]
        end

        FFN --> ADD2["+  Residual"]
        ADD1 --> ADD2
    end

    ADD2 --> OUT[Output Hidden State]

    style Layer fill:#1a1a2e,stroke:#00d4ff,stroke-width:2px,color:#fff
    style ATTN fill:#16213e,stroke:#e94560,stroke-width:2px,color:#fff
    style FFN fill:#16213e,stroke:#533483,stroke-width:2px,color:#fff

The key insight: almost every box labeled MatMul is a matrix multiplication. This single operation accounts for ~90% of compute time. That is why optimizing matrix multiply is so impactful.

Where Time Is Spent

Operation% of TimeBottleneck TypeOptimization Target
Matrix Multiplications~80-90%Compute + Memory BWSIMD, tiling, quantization
Attention (QK^T, softmax)~5-10%Memory bandwidthKV cache optimization
Token Sampling~1-2%LatencyEfficient top-k/top-p
Tokenization<1%CPUUsually negligible
Memory Allocation~2-5%SystemMemory pooling

Profiling Strategy: Measure, Don’t Guess

We use the Visual Studio Performance Profiler to identify where time is actually spent. No guessing — data drives every decision.

flowchart TB
    PROFILE["Run Profiler"]
    CPU_PROF["CPU Sampling<br/>Where is time spent?"]
    MEM_PROF["Memory Usage<br/>Where are allocations?"]
    THREAD_PROF["Concurrency<br/>Are threads idle?"]

    PROFILE --> CPU_PROF
    PROFILE --> MEM_PROF
    PROFILE --> THREAD_PROF

    CPU_PROF --> HOTSPOT["Hotspot Analysis"]
    MEM_PROF --> ALLOC["Allocation Analysis"]
    THREAD_PROF --> CONTENTION["Lock Contention<br/>Analysis"]

    HOTSPOT --> Q1{"MatMul > 80%\nof time?"}
    ALLOC --> Q2{"KV Cache\ndominates?"}
    CONTENTION --> Q3{"Threads\nstarving?"}

    Q1 -->|Yes| OPT1["→ SIMD Optimization"]
    Q1 -->|No| OPT1B["→ Investigate other hotspots"]
    Q2 -->|Yes| OPT2["→ KV Cache Optimization"]
    Q2 -->|No| OPT2B["→ Memory is fine"]
    Q3 -->|Yes| OPT3["→ Threading Optimization"]
    Q3 -->|No| OPT3B["→ Threading is fine"]

    style PROFILE fill:#533483,stroke:#fff,color:#fff
    style Q1 fill:#e94560,stroke:#fff,color:#fff
    style Q2 fill:#e94560,stroke:#fff,color:#fff
    style Q3 fill:#e94560,stroke:#fff,color:#fff

The expected CPU time distribution for a 7B parameter model looks like this:

pie title Expected CPU Time Distribution (7B Model)
    "Matrix Multiply (GEMM)" : 85
    "Attention Computation" : 5
    "Memory Operations" : 5
    "Token Sampling" : 2
    "Tokenization" : 1
    "Other" : 2

This confirms that matrix multiply is the single biggest optimization lever.

Quantization: Trading Precision for Speed

Quantization reduces the precision of model weights from 16-bit floats to fewer bits. Less precision means less memory and faster compute, but potentially less accuracy.

flowchart LR
    subgraph FP16["FP16 (16 bits)"]
        direction TB
        FP16_BITS["████████████████<br/>16 bits per weight<br/>~14 GB for 7B"]
    end

    subgraph Q8["Q8 (8 bits)"]
        direction TB
        Q8_BITS["████████<br/>8 bits per weight<br/>~7 GB for 7B"]
    end

    subgraph Q4["Q4 (4 bits)"]
        direction TB
        Q4_BITS["████<br/>4 bits per weight<br/>~4 GB for 7B"]
    end

    subgraph Q2["Q2 (2 bits)"]
        direction TB
        Q2_BITS["██<br/>2 bits per weight<br/>~2.5 GB for 7B"]
    end

    FP16 -->|"Lose some precision"| Q8
    Q8 -->|"Lose more precision"| Q4
    Q4 -->|"Aggressive"| Q2

    style FP16 fill:#0f3460,stroke:#00d4ff,color:#fff
    style Q8 fill:#16213e,stroke:#00d4ff,color:#fff
    style Q4 fill:#533483,stroke:#00d4ff,color:#fff
    style Q2 fill:#e94560,stroke:#fff,color:#fff

Comparison Table

FormatBitsModel SizeAccuracy Score
FP1616~14 GB100% (baseline)
Q8_08~7 GB~99%
Q4_K_M4~4 GB~95-97%
Q2_K2~2.5 GB~85-90%

The sweet spot for most deployments is Q4_K_M: a ~3.5x reduction in model size with only a ~3-5% accuracy loss.

Advanced Optimizations

Based on profiling results, we pick 2-3 of the following optimizations. This is where the real systems engineering happens.

KV Cache Optimization

The KV cache stores the Key and Value matrices from previous tokens so they don’t need to be recomputed. It grows linearly with sequence length and becomes a memory bottleneck.

flowchart TB
    subgraph Problem["Problem: KV Cache Growth"]
        direction LR
        T1["Token 1<br/>K₁, V₁"] --> T2["Token 2<br/>K₁₋₂, V₁₋₂"] --> T3["Token 3<br/>K₁₋₃, V₁₋₃"] --> TN["Token N<br/>K₁₋ₙ, V₁₋ₙ"]
        MEM["Memory grows<br/>O(n × d × layers)"]
    end

    subgraph Solutions["Solutions"]
        direction TB
        POOL["Memory Pooling<br/>Pre-allocate fixed buffer<br/>Avoid malloc/free per token"]
        COMPRESS["Cache Compression<br/>Store K,V in lower precision<br/>(FP16 → INT8)"]
        EVICT["Sliding Window<br/>Keep only last N tokens<br/>Evict oldest entries"]
    end

    Problem --> Solutions

    style Problem fill:#2d142c,stroke:#e94560,color:#fff
    style Solutions fill:#0f3460,stroke:#00d4ff,color:#fff

Threading Optimization

Default threading in llama.cpp often leaves cores idle or waiting on locks. By pinning threads to cores and using work-stealing, we can fully saturate the available compute.

flowchart TB
    subgraph Current["Current: Default Threading"]
        direction LR
        C0["Core 0<br/>Working"] --- C1["Core 1<br/>Working"]
        C2["Core 2<br/>Idle"] --- C3["Core 3<br/>Idle"]
        C4["Core 4<br/>Waiting<br/>on lock"] --- C5["Core 5<br/>Idle"]
    end

    subgraph Optimized["Optimized: Pinned Thread Pool"]
        direction LR
        O0["Core 0<br/>MatMul Block 0"] --- O1["Core 1<br/>MatMul Block 1"]
        O2["Core 2<br/>MatMul Block 2"] --- O3["Core 3<br/>MatMul Block 3"]
        O4["Core 4<br/>MatMul Block 4"] --- O5["Core 5<br/>MatMul Block 5"]
    end

    Current -->|"Thread pinning +<br/>work stealing"| Optimized

    style Current fill:#2d142c,stroke:#e94560,color:#fff
    style Optimized fill:#0f3460,stroke:#00d4ff,color:#fff

Performance typically plateaus after saturating memory bandwidth. More threads does not mean more speed past a certain point. Finding that inflection point is the goal.

Speculative Decoding

This is the most research-grade optimization. Use a small, fast “draft” model to predict multiple tokens, then verify them in bulk with the large model.

sequenceDiagram
    participant Draft as Draft Model (1B, fast)
    participant Target as Target Model (7B, slow)
    participant Output as Accepted Tokens

    Note over Draft,Target: Standard decoding: 1 token per 7B forward pass

    rect rgb(15, 52, 96)
        Note over Draft: Speculative: Draft generates K candidates
        Draft->>Draft: Token 1 → "The" (5ms)
        Draft->>Draft: Token 2 → "capital" (5ms)
        Draft->>Draft: Token 3 → "of" (5ms)
        Draft->>Draft: Token 4 → "France" (5ms)
        Draft->>Draft: Token 5 → "is" (5ms)
        Note over Draft: Total: 25ms for 5 tokens
    end

    rect rgb(83, 52, 131)
        Note over Target: Verify ALL 5 in ONE forward pass
        Draft->>Target: ["The", "capital", "of", "France", "is"]
        Target->>Target: Batch verify (50ms)
        Target->>Output: "The" accepted
        Target->>Output: "capital" accepted
        Target->>Output: "of" accepted
        Target->>Output: "France" accepted
        Target->>Output: "is" rejected -> resample
    end

    Note over Output: Result: 4 tokens in 75ms<br/>vs 4 x 50ms = 200ms normally<br/>2.7x speedup!

SIMD Matrix Multiplication

SIMD (Single Instruction, Multiple Data) processes multiple values per CPU instruction. With AVX2, we can perform 8 float multiplications in a single cycle.

flowchart TB
    subgraph Scalar["Scalar: 1 operation at a time"]
        direction LR
        S1["a₁ × b₁"] --> S2["a₂ × b₂"] --> S3["a₃ × b₃"] --> S4["a₄ × b₄"]
        S5["a₅ × b₅"] --> S6["a₆ × b₆"] --> S7["a₇ × b₇"] --> S8["a₈ × b₈"]
        ST["8 cycles"]
    end

    subgraph AVX2["AVX2: 8 operations simultaneously"]
        direction LR
        V1["a₁×b₁  a₂×b₂  a₃×b₃  a₄×b₄  a₅×b₅  a₆×b₆  a₇×b₇  a₈×b₈"]
        VT["1 cycle"]
    end

    Scalar -->|"8× theoretical<br/>speedup"| AVX2

    style Scalar fill:#2d142c,stroke:#e94560,color:#fff
    style AVX2 fill:#0f3460,stroke:#00d4ff,color:#fff

Combined with cache-aware tiling (breaking large matrices into 64x64 blocks that fit in L1 cache), this can yield a 3-5x real-world speedup on top of the SIMD gains.

System Architecture

The project wraps all of this into a clean, layered architecture:

flowchart TB
    subgraph CLI["Command-Line Interface"]
        ARGS[CLI Arguments Parser]
        STREAM[Streaming Output]
    end

    subgraph CORE["Core Engine"]
        LOADER[Model Loader<br/>GGUF Format]
        QUANT[Quantization Manager<br/>FP16 / Q8 / Q4 / Q2]
        INFER[Inference Engine<br/>llama.cpp backend]
        KV[KV Cache Manager]
        THREAD[Thread Pool]
    end

    subgraph BENCH["Benchmarking Layer"]
        METRICS[Metrics Collector<br/>Latency, Throughput, RAM]
        PROFILER[Profiler Integration<br/>VS Profiler hooks]
        SUITE[Benchmark Suite<br/>Short/Long/Reasoning]
    end

    subgraph RESULTS["Results & Analysis"]
        JSON[JSON Output]
        CSV[CSV Tables]
        GRAPHS[Performance Graphs]
    end

    CLI --> CORE
    CORE --> BENCH
    BENCH --> RESULTS
    LOADER --> QUANT
    QUANT --> INFER
    INFER --> KV
    INFER --> THREAD

    style CLI fill:#0f3460,stroke:#00d4ff,color:#fff
    style CORE fill:#1a1a2e,stroke:#e94560,stroke-width:2px,color:#fff
    style BENCH fill:#16213e,stroke:#533483,color:#fff
    style RESULTS fill:#0f3460,stroke:#00d4ff,color:#fff

Building and Running

Prerequisites

  • CMake 3.20+
  • Visual Studio 2022 (or MSVC Build Tools)
  • Git

Build

cmake -B build -G "Visual Studio 17 2022" -A x64
cmake --build build --config Release

Run

# Download a GGUF model from Hugging Face first
.\build\Release\ll_llm.exe \
    --model models\tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
    --prompt "What is the capital of France?" \
    --threads 4

What’s Next

This project is structured in 6 phases. Phase 1 establishes the baseline with llama.cpp and a 7B model. Each subsequent phase layers on profiling insights, quantization experiments, and targeted optimizations — all driven by the data collected in the previous phase.

The goal is not just “make it faster” but to understand why each optimization works (or doesn’t) at the hardware level. That’s what separates engineering from cargo-culting.


This is an active research project. Follow along on GitHub as results come in.