For vLLM
vLLM on Edge,
production LLM serving
Self-host the high-throughput LLM serving stack production AI teams use. PagedAttention, continuous batching, OpenAI-compatible API, models from S3-compatible storage.
# Provision a GPU VM
$ edge compute create \
--image ubuntu-24-04-cuda --plan gpu-a100 \
--script ./bootstrap-vllm.sh
# Serve a model
$ vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--quantization fp8
# From your app — drop-in OpenAI client
const openai = new OpenAI({
baseURL: 'https://inference.example.com/v1',
apiKey: 'EMPTY',
})
Why production teams pick vLLM
When you've outgrown Ollama and need real throughput per GPU dollar.
Continuous batching for throughput
PagedAttention and continuous batching make vLLM ~10x faster than naive serving for multi-tenant workloads. Same GPU, far more requests per second.
GPU VMs sized for serious models
A10, A100, H100 — pick the right card for your model size. Tensor parallelism across multiple GPUs on a single VM, or pipeline parallelism across VMs.
OpenAI-compatible server
`vllm serve` exposes the OpenAI chat completions, completions and embeddings endpoints. Existing SDKs work unchanged — just swap the base URL.
Model weights in object storage
Mount Edge Storage for model weights so spinning up a new GPU VM is fast and predictable. No 30-minute Hugging Face downloads on every restart.
LoRA / quantisation supported
AWQ, GPTQ, FP8, BNB quantisation; LoRA adapters loaded at request time. The full vLLM feature set on a real VM.
Per-GPU-hour, not per-token
Together and Replicate charge per million tokens. A vLLM instance on an Edge GPU VM is one fixed hourly rate — at any throughput.
Reference architecture
How vLLM maps to Edge
One GPU VM per model, weights cached in S3 for fast spin-up, load balancer in front when you need horizontal scale.
vLLM serving on a GPU VM (or multi-GPU for tensor parallelism)
Model weights cached in a bucket; mounted at boot
Optional: caches non-streaming responses
Anycast DNS for `inference.example.com`
# systemd unit for vLLM
[Service]
ExecStart=/usr/local/bin/vllm serve $MODEL \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 2 \
--max-model-len 8192
Environment=HF_HOME=/mnt/edge-storage/hf-cache
Restart=always
Common questions
When should I pick vLLM over Ollama?
Pick vLLM when you need throughput, multi-tenancy, or specific quantisation/LoRA features. Pick Ollama for ease of use and small-team / single-user workloads.
Which GPU is right?
For 7B–13B at high throughput: A10 / L4. For 70B-class: A100 80GB or H100. Multi-GPU tensor parallelism on a single VM unlocks larger models — our Compute team can size it.
Can I run multiple models on one GPU?
Yes — vLLM supports `--lora-modules` for serving multiple LoRA adapters on a single base model. For totally separate models, run multiple vLLM processes (or VMs) and route at the load balancer.
How do I autoscale?
GPU autoscale is generally a per-VM affair (cold starts are slow). Most production teams provision a static fleet sized for peak, plus burst-to-manual capacity for special events.
By Stack
Other stacks on Edge
Serve LLMs at scale
30-day trial. Compute team can help you size the GPU and architecture.