For llama.cpp

llama.cpp on Edge,
the leanest LLM runtime

Self-host quantised LLMs on CPU-only or GPU Edge VMs. GGUF models from S3-compatible storage, OpenAI-compatible server, zero per-token bills. The smallest possible AI footprint.

Start your trial See Ollama (built on this)

# Build with CUDA

$ cmake -B build -DGGML_CUDA=ON

$ cmake --build build --config Release -j

# Serve a quantised model

$ ./build/bin/llama-server \

-m /mnt/edge-storage/llama-3.1-8b-q4.gguf \

--host 0.0.0.0 --port 8080 \

-c 8192 -ngl 999

# OpenAI-compatible at /v1

curl http://localhost:8080/v1/chat/completions \

-d '{ "messages": [{"role":"user","content":"hi"}] }'

Why teams pick llama.cpp on Edge

The smallest AI footprint that still does the job — and the foundation Ollama is built on.

CPU-friendly inference

Run quantised models on regular Edge VMs without a GPU. Useful for tiny models, dev environments, and workloads where latency tolerance is high.

GGUF: the standard for quantised

GGUF is the de facto format for quantised open-weight models. Hugging Face, TheBloke, llama.cpp itself — all interoperate cleanly.

OpenAI-compatible server

`llama-server` exposes the OpenAI chat/completions API. Drop-in replacement for paid APIs at the SDK level.

Models cached in S3

Cache GGUF files in Edge Storage so any new VM can pull them quickly. No Hugging Face rate-limit drama on cold starts.

CUDA / Metal / ROCm builds

When you do have a GPU, llama.cpp uses it. CUDA on Nvidia, Vulkan/ROCm on AMD, Metal on Apple Silicon. Your binary, your acceleration.

Cheapest path to local LLMs

For low-volume workloads or dev, the smallest non-GPU Edge VM serves a 7B Q4 model. Cheaper than any per-token API at any volume.

Reference architecture

How llama.cpp maps to Edge

A CPU or GPU VM running llama-server, GGUF model files mounted from object storage. The least infrastructure to run a serious LLM.

Compute

CPU VM for small/dev workloads, or GPU VM for serious throughput

Storage

GGUF model cache in a bucket, mounted to the VM

CDN

Optional: caches non-streaming responses

DNS

Anycast DNS for `llm.example.com`

# Pull a quantised model from HF

$ huggingface-cli download \

bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \

--include "*Q4_K_M*"

# Cache it in Edge Storage for next time

$ aws s3 cp ./Meta-Llama-3.1-8B-Q4.gguf \

s3://llm-models/ \

--endpoint storage.edge.run

Common questions

When should I pick llama.cpp over Ollama or vLLM?

Pick llama.cpp when you want full control of the binary, need CPU-only inference, or want the smallest possible footprint. Ollama wraps llama.cpp with conveniences; vLLM is for high-throughput GPU serving. All have their place.

CPU or GPU?

For 7B Q4_K_M and below, CPU on a beefy VM is workable for dev and low-volume use. For production throughput or anything 13B+, you want a GPU.

Which quantisation level?

Q4_K_M is the usual sweet spot — ~75% quality of full precision at ~25% the size. Q5_K_M for higher fidelity, Q3 for the smallest footprint. Experiment with your specific use case.

How do I serve to multiple users?

llama-server handles concurrent requests, but doesn't do continuous batching like vLLM. For a small team it's fine; for production multi-tenancy switch to vLLM.

By Stack