For vLLM

vLLM on Edge,
production LLM serving

Self-host the high-throughput LLM serving stack production AI teams use. PagedAttention, continuous batching, OpenAI-compatible API, models from S3-compatible storage.

Start your trial See AI infrastructure

# Provision a GPU VM

$ edge compute create \

--image ubuntu-24-04-cuda --plan gpu-a100 \

--script ./bootstrap-vllm.sh

# Serve a model

$ vllm serve meta-llama/Llama-3.1-70B-Instruct \

--tensor-parallel-size 2 \

--quantization fp8

# From your app — drop-in OpenAI client

const openai = new OpenAI({

baseURL: 'https://inference.example.com/v1',

apiKey: 'EMPTY',

})

Why production teams pick vLLM

When you've outgrown Ollama and need real throughput per GPU dollar.

Continuous batching for throughput

PagedAttention and continuous batching make vLLM ~10x faster than naive serving for multi-tenant workloads. Same GPU, far more requests per second.

GPU VMs sized for serious models

A10, A100, H100 — pick the right card for your model size. Tensor parallelism across multiple GPUs on a single VM, or pipeline parallelism across VMs.

OpenAI-compatible server

`vllm serve` exposes the OpenAI chat completions, completions and embeddings endpoints. Existing SDKs work unchanged — just swap the base URL.

Model weights in object storage

Mount Edge Storage for model weights so spinning up a new GPU VM is fast and predictable. No 30-minute Hugging Face downloads on every restart.

LoRA / quantisation supported

AWQ, GPTQ, FP8, BNB quantisation; LoRA adapters loaded at request time. The full vLLM feature set on a real VM.

Per-GPU-hour, not per-token

Together and Replicate charge per million tokens. A vLLM instance on an Edge GPU VM is one fixed hourly rate — at any throughput.

Reference architecture

How vLLM maps to Edge

One GPU VM per model, weights cached in S3 for fast spin-up, load balancer in front when you need horizontal scale.

Compute (GPU)

vLLM serving on a GPU VM (or multi-GPU for tensor parallelism)

Storage

Model weights cached in a bucket; mounted at boot

CDN

Optional: caches non-streaming responses

DNS

Anycast DNS for `inference.example.com`

# systemd unit for vLLM

[Service]

ExecStart=/usr/local/bin/vllm serve $MODEL \

--host 0.0.0.0 --port 8000 \

--tensor-parallel-size 2 \

--max-model-len 8192

Environment=HF_HOME=/mnt/edge-storage/hf-cache

Restart=always

Common questions

When should I pick vLLM over Ollama?

Pick vLLM when you need throughput, multi-tenancy, or specific quantisation/LoRA features. Pick Ollama for ease of use and small-team / single-user workloads.

Which GPU is right?

For 7B–13B at high throughput: A10 / L4. For 70B-class: A100 80GB or H100. Multi-GPU tensor parallelism on a single VM unlocks larger models — our Compute team can size it.

Can I run multiple models on one GPU?

Yes — vLLM supports `--lora-modules` for serving multiple LoRA adapters on a single base model. For totally separate models, run multiple vLLM processes (or VMs) and route at the load balancer.

How do I autoscale?

GPU autoscale is generally a per-VM affair (cold starts are slow). Most production teams provision a static fleet sized for peak, plus burst-to-manual capacity for special events.

By Stack