For Ollama
Ollama on Edge,
open LLMs on your GPUs
Self-host Llama, Mistral, Qwen and friends on Edge GPU instances. OpenAI-compatible API, models cached in S3, no per-token bills and no data leaving your infrastructure.
# Provision a GPU VM
$ edge compute create \
--image ubuntu-24-04 --plan gpu-a10 \
--script ./bootstrap-ollama.sh
# Pull and run a model
$ ollama pull llama3.1:8b
$ ollama serve
# From your app — drop-in OpenAI client
const openai = new OpenAI({
baseURL: 'https://ai.example.com/v1',
apiKey: 'ollama',
})
await openai.chat.completions.create({ model: 'llama3.1:8b', ... })
Why teams pick Ollama on Edge
All the convenience of an OpenAI API, with your data and your costs under control.
GPU instances, sized to fit
From a single consumer-grade GPU for 7B–13B models up to dedicated A100/H100 instances for large frontier models. You pick the GPU, you keep the model.
Single binary, dozens of models
Ollama is one binary that pulls and serves Llama, Mistral, Qwen, Gemma, Phi and 100+ others. `ollama run llama3` and you're away.
OpenAI-compatible API
Drop-in replacement for OpenAI API endpoints. Point your existing SDK at your Edge Ollama VM and your code runs unchanged.
Models persisted in S3
Cache GGUF model files in Edge Storage so any new VM can pull them in seconds. Faster cold starts, no Hugging Face rate limits to dance around.
Your prompts, your data
Customer prompts and generated responses never leave your infrastructure. Critical for regulated industries and anyone with NDAs.
No per-token bills
OpenAI charges per million tokens. An Edge GPU VM running Ollama is a fixed monthly cost — generate as much as you like.
Reference architecture
How Ollama maps to Edge
A GPU VM (or several) running Ollama, model weights cached in object storage. Add a tiny load balancer for multi-GPU horizontal scale.
Runs Ollama on a GPU instance, 1-N depending on traffic
S3-compatible bucket caching GGUF models for fast cold starts
Optional: caches streaming responses where applicable
Anycast DNS for `ai.example.com`
Indicative cost
~10M tokens/day, 8B model
Edge wins decisively as token volume grows.
Common questions
Which GPU should I pick?
For 7B–13B models, a 24GB consumer card (or single-slot A10) works well. For 70B+, look at A100 80GB or H100 instances. Our Compute team can size for your exact model.
How does this compare to OpenAI / Anthropic / Together?
Cheaper at scale (no per-token bills), private (your prompts stay on your infra), and works offline. Trade-off: open-weight models lag closed frontier models on the hardest tasks — pick based on your workload.
How do I scale beyond one GPU?
Spin up more GPU VMs, put a small load balancer in front (Caddy or Nginx). Ollama is stateless so this scales linearly.
Can I fine-tune?
Ollama runs inference; for fine-tuning use vLLM, Axolotl or Unsloth on a separate GPU VM, then push the resulting model back to Ollama for serving. See our vLLM stack page.
By Stack
Other stacks on Edge
Run open LLMs on your terms
30-day trial. Stand up a GPU VM, pull a model, hit the API in minutes.