ChebbiOS

LLM Inference: Establishing a Baseline

#ai #llm #cpp #performance #benchmark

In the previous post, I outlined the plan to build high-performance LLM inference engine from scratch in C++. Today, I’m sharing the results of Phase 1: The Baseline.

The goal of this phase was not to optimize yet, but to build a stable, measurable system using llama.cpp as the backend and establish hard data points for where we stand.

The Setup

  • Hardware: Intel Core Ultra 9 275HX (24 Cores, 64GB RAM)
  • Model: TinyLlama 1.1B Chat (Q4_K_M)
  • Engine: LL_LLM (Custom C++ harness)
  • Backend: llama.cpp (Low-level C API)

We ran three distinct benchmark scenarios to capture different performance characteristics:

  1. Short Prompt (“Hello”): Measures pure system overhead and dispatch latency.
  2. Long Prompt (~500 tokens): Measures prefill performance (compute-bound).
  3. Reasoning Prompt: Measures sustenance and stability over a mixed workload.

Key Metrics

Here is what the raw data looks like:

MetricShort PromptLong PromptReasoning
First Token Latency112.16 ms2,983.71 ms896.90 ms
Generation Speed40.91 tok/s35.31 tok/s38.39 tok/s
Peak RAM1,118 MB1,168 MB1,146 MB
Model Load Time722 ms436 ms574 ms

1. The Prefill Wall

The most glaring number here is the 2.98 seconds latency on the long prompt. This is the prefill phase—where the model processes all input tokens before generating a single new one.

Processing 500 tokens takes ~3 seconds. This confirms our hypothesis: Prefill is compute-bound. It’s a massive matrix multiplication problem ($O(N^2)$ attention matrix).

2. Throughput is Healthy

Once the first token is out, the engine settles into a comfortable 35-40 tokens/second. This is faster than human reading speed. It suggests that even without advanced optimizations, the llama.cpp backend is quite efficient at sequential decoding on this hardware.

3. RAM Usage

The 1.1B model takes up just over 1.1 GB of RAM. This is extremely lightweight. When we move to the 7B Mistral model in Phase 2, we expect this to jump to ~4.5 GB, which will put more pressure on memory bandwidth.

Visualizing the Latency

xychart-beta
    title "First Token Latency by Prompt Type (Lower is Better)"
    x-axis ["Short (Hello)", "Reasoning (Logic)", "Long (Article)"]
    y-axis "Time (ms)" 0 --> 3500
    bar [112, 897, 2984]

What’s Next?

Now that we have a baseline, we can start profiling. We are not just going to guess what makes it slow; we are going to measure it.

Phase 2 Goals:

  1. Profile with Visual Studio Performance Profiler: See exactly which functions burn those 3 seconds during prefill.
  2. Analyze Memory Bandwidth: Is the decode phase limited by compute or RAM speed?
  3. Optimize Threading: We used 4 threads for this test. Is that optimal? What happens at 8, 12, or 24 threads?

Stay tuned for the profiling deep dive.