For llama.cpp
llama.cpp on Edge,
the leanest LLM runtime
Self-host quantised LLMs on CPU-only or GPU Edge VMs. GGUF models from S3-compatible storage, OpenAI-compatible server, zero per-token bills. The smallest possible AI footprint.
# Build with CUDA
$ cmake -B build -DGGML_CUDA=ON
$ cmake --build build --config Release -j
# Serve a quantised model
$ ./build/bin/llama-server \
-m /mnt/edge-storage/llama-3.1-8b-q4.gguf \
--host 0.0.0.0 --port 8080 \
-c 8192 -ngl 999
# OpenAI-compatible at /v1
curl http://localhost:8080/v1/chat/completions \
-d '{ "messages": [{"role":"user","content":"hi"}] }'
Why teams pick llama.cpp on Edge
The smallest AI footprint that still does the job — and the foundation Ollama is built on.
CPU-friendly inference
Run quantised models on regular Edge VMs without a GPU. Useful for tiny models, dev environments, and workloads where latency tolerance is high.
GGUF: the standard for quantised
GGUF is the de facto format for quantised open-weight models. Hugging Face, TheBloke, llama.cpp itself — all interoperate cleanly.
OpenAI-compatible server
`llama-server` exposes the OpenAI chat/completions API. Drop-in replacement for paid APIs at the SDK level.
Models cached in S3
Cache GGUF files in Edge Storage so any new VM can pull them quickly. No Hugging Face rate-limit drama on cold starts.
CUDA / Metal / ROCm builds
When you do have a GPU, llama.cpp uses it. CUDA on Nvidia, Vulkan/ROCm on AMD, Metal on Apple Silicon. Your binary, your acceleration.
Cheapest path to local LLMs
For low-volume workloads or dev, the smallest non-GPU Edge VM serves a 7B Q4 model. Cheaper than any per-token API at any volume.
Reference architecture
How llama.cpp maps to Edge
A CPU or GPU VM running llama-server, GGUF model files mounted from object storage. The least infrastructure to run a serious LLM.
CPU VM for small/dev workloads, or GPU VM for serious throughput
GGUF model cache in a bucket, mounted to the VM
Optional: caches non-streaming responses
Anycast DNS for `llm.example.com`
# Pull a quantised model from HF
$ huggingface-cli download \
bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
--include "*Q4_K_M*"
# Cache it in Edge Storage for next time
$ aws s3 cp ./Meta-Llama-3.1-8B-Q4.gguf \
s3://llm-models/ \
--endpoint storage.edge.run
Common questions
When should I pick llama.cpp over Ollama or vLLM?
Pick llama.cpp when you want full control of the binary, need CPU-only inference, or want the smallest possible footprint. Ollama wraps llama.cpp with conveniences; vLLM is for high-throughput GPU serving. All have their place.
CPU or GPU?
For 7B Q4_K_M and below, CPU on a beefy VM is workable for dev and low-volume use. For production throughput or anything 13B+, you want a GPU.
Which quantisation level?
Q4_K_M is the usual sweet spot — ~75% quality of full precision at ~25% the size. Q5_K_M for higher fidelity, Q3 for the smallest footprint. Experiment with your specific use case.
How do I serve to multiple users?
llama-server handles concurrent requests, but doesn't do continuous batching like vLLM. For a small team it's fine; for production multi-tenancy switch to vLLM.
By Stack
Other stacks on Edge
Run LLMs the lean way
30-day trial. CPU or GPU, your call.