For Ollama

Ollama on Edge,
open LLMs on your GPUs

Self-host Llama, Mistral, Qwen and friends on Edge GPU instances. OpenAI-compatible API, models cached in S3, no per-token bills and no data leaving your infrastructure.

Start your trial See AI infrastructure

# Provision a GPU VM

$ edge compute create \

--image ubuntu-24-04 --plan gpu-a10 \

--script ./bootstrap-ollama.sh

# Pull and run a model

$ ollama pull llama3.1:8b

$ ollama serve

# From your app — drop-in OpenAI client

const openai = new OpenAI({

baseURL: 'https://ai.example.com/v1',

apiKey: 'ollama',

})

await openai.chat.completions.create({ model: 'llama3.1:8b', ... })

Why teams pick Ollama on Edge

All the convenience of an OpenAI API, with your data and your costs under control.

GPU instances, sized to fit

From a single consumer-grade GPU for 7B–13B models up to dedicated A100/H100 instances for large frontier models. You pick the GPU, you keep the model.

Single binary, dozens of models

Ollama is one binary that pulls and serves Llama, Mistral, Qwen, Gemma, Phi and 100+ others. `ollama run llama3` and you're away.

OpenAI-compatible API

Drop-in replacement for OpenAI API endpoints. Point your existing SDK at your Edge Ollama VM and your code runs unchanged.

Models persisted in S3

Cache GGUF model files in Edge Storage so any new VM can pull them in seconds. Faster cold starts, no Hugging Face rate limits to dance around.

Your prompts, your data

Customer prompts and generated responses never leave your infrastructure. Critical for regulated industries and anyone with NDAs.

No per-token bills

OpenAI charges per million tokens. An Edge GPU VM running Ollama is a fixed monthly cost — generate as much as you like.

Reference architecture

How Ollama maps to Edge

A GPU VM (or several) running Ollama, model weights cached in object storage. Add a tiny load balancer for multi-GPU horizontal scale.

Compute (GPU)

Runs Ollama on a GPU instance, 1-N depending on traffic

Storage

S3-compatible bucket caching GGUF models for fast cold starts

CDN

Optional: caches streaming responses where applicable

DNS

Anycast DNS for `ai.example.com`

Indicative cost

~10M tokens/day, 8B model

OpenAI gpt-4o-mini at this volume ~$50–100/mo

Together / Replicate (Llama 3.1 8B) ~$30–80/mo

Edge GPU VM (A10) flat fee

Edge wins decisively as token volume grows.

Common questions

Which GPU should I pick?

For 7B–13B models, a 24GB consumer card (or single-slot A10) works well. For 70B+, look at A100 80GB or H100 instances. Our Compute team can size for your exact model.

How does this compare to OpenAI / Anthropic / Together?

Cheaper at scale (no per-token bills), private (your prompts stay on your infra), and works offline. Trade-off: open-weight models lag closed frontier models on the hardest tasks — pick based on your workload.

How do I scale beyond one GPU?

Spin up more GPU VMs, put a small load balancer in front (Caddy or Nginx). Ollama is stateless so this scales linearly.

Can I fine-tune?

Ollama runs inference; for fine-tuning use vLLM, Axolotl or Unsloth on a separate GPU VM, then push the resulting model back to Ollama for serving. See our vLLM stack page.

By Stack