Self-Host the RFP Evaluator: Ollama on an Azure GPU VM (Part 5)

Q: Which Azure VM size should I use for Ollama?

An NC-series T4 (NVIDIA T4, 16 GB) is a good, affordable fit for 7 to 8 billion parameter models. Step up to A10 or A100 sizes for larger models.

Q: How do I confirm the model is on the GPU?

Run docker compose exec ollama ollama ps; the PROCESSOR column should show 100% GPU. If it shows CPU, you didn't start with the GPU compose override.

By Asif·June 7, 2026·4 min read·AI Use Cases·Updated June 15, 2026

Some RFPs are confidential enough that the bids must never leave your infrastructure. In this final part we self-host the model: Ollama on an Azure GPU VM. The RFP evaluator runs end to end with no external API calls and no per-token cost — just the same graph, pointed at a local GPU-accelerated model. As always, only configuration changes.

Why self-host with a GPU

Self-hosting gives you maximum confidentiality (zero data egress), a fixed cost instead of per-token billing, and full control. A GPU makes a local model fast enough to evaluate proposals at a practical pace — CPU-only inference is fine for testing but slow for real workloads.

Step 1 — Provision an Azure GPU VM

az vm create \
  --resource-group rg-rfpeval --name vm-rfpeval-gpu \
  --image Ubuntu2204 \
  --size Standard_NC4as_T4_v3 \
  --admin-username azureuser --generate-ssh-keys

az vm open-port --resource-group rg-rfpeval --name vm-rfpeval-gpu --port 8000

The NC*_T4 family ships an NVIDIA T4 (16 GB) — a good, affordable fit for 7–8B models. For larger models, choose an A10 or A100 size. Deallocate the VM when idle — GPU VMs are billed by the hour.

Step 2 — Install the NVIDIA driver and container toolkit

# NVIDIA driver
sudo apt-get update && sudo apt-get install -y ubuntu-drivers-common
sudo ubuntu-drivers autoinstall
sudo reboot

# after reboot, verify the GPU is visible:
nvidia-smi

# Docker + NVIDIA Container Toolkit
curl -fsSL https://get.docker.com | sh
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
# add the repo, then:
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Step 3 — Bring up the GPU stack

This is exactly why we shipped a GPU compose override. It adds a GPU reservation to the Ollama service:

# docker-compose.gpu.yml
services:
  ollama:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: ["gpu"]

docker compose -f docker-compose.yml -f docker-compose.gpu.yml up -d --build

docker compose exec ollama ollama pull llama3.1:8b
docker compose exec ollama ollama pull nomic-embed-text

# confirm the model is on the GPU:
docker compose exec ollama ollama ps   # PROCESSOR should read 100% GPU

Step 4 — Run the evaluation

The .env already defaults to Ollama; on the GPU VM nothing else is needed:

LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_CHAT_MODEL=llama3.1:8b
OLLAMA_EMBEDDING_MODEL=nomic-embed-text

docker compose exec app \
  python -m rfpeval.cli evaluate samples/sample_proposal.md --rfp sample_rfp

You now have the complete RFP evaluator — proposal ingestion, RFP requirement retrieval, weighted scoring, the human shortlist gate, cited report — running entirely on your own GPU, with no data leaving the VM.

Model choice & performance

Model	VRAM (approx)	Fits on	Notes
llama3.1:8b	~6–8 GB	T4 (16 GB)	Good default; solid structured output
llama3.1:70b	~40+ GB	A100 (80 GB)	Higher quality; needs a big GPU
qwen2.5:14b	~10–12 GB	A10 (24 GB)	Strong mid-size option

Smaller models occasionally struggle with strict JSON schemas; if assessments come back malformed, step up a size or lower the temperature (we already use 0.0).

Cloud API vs self-hosted GPU

Dimension	Cloud API (Parts 3–4)	Self-hosted GPU (this part)
Data egress	Leaves your boundary	None
Cost	Per token	Fixed VM/GPU per hour
Ops burden	Minimal	You manage drivers, VM, models
Quality ceiling	Very high	Depends on model/GPU size

Troubleshooting & common errors

Symptom	Cause	Fix
`nvidia-smi: command not found`	Driver not installed	`sudo ubuntu-drivers autoinstall` then reboot
Container can’t see the GPU	NVIDIA Container Toolkit not configured	`sudo nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart docker`
`ollama ps` shows 100% CPU	GPU reservation missing	Bring it up with the `-f docker-compose.gpu.yml` override
CUDA out of memory	Model too large for the GPU	Use a smaller model or a bigger GPU size
Slow first response	Model load into VRAM	Expected on first call; subsequent calls are fast
Unexpected GPU bill	VM left running	`az vm deallocate` when idle

Series wrap-up

Over five parts we designed an RFP proposal evaluator from first principles, built it as an LLM-agnostic LangGraph application, and ran the same graph on Azure OpenAI, Google Vertex AI, and a self-hosted GPU — each a one-line provider switch. That is the lasting lesson: design for the abstract interface, push the provider to configuration, and you stay free to choose the right model for each deployment — cost, quality, or confidentiality — without rewriting a thing.

Frequently asked questions

Which Azure VM size should I use for Ollama?

An NC*_T4 (NVIDIA T4, 16 GB) is a good, affordable fit for 7–8B models. Step up to A10 or A100 sizes for larger models.

How do I confirm the model is actually on the GPU?

Run docker compose exec ollama ollama ps; the PROCESSOR column should show 100% GPU. If it shows CPU, you didn’t start with the GPU compose override.

Does self-hosting change the application code?

No. As with Azure and Vertex, only configuration changes; the LangGraph workflow is identical.

Conclusion

The RFP evaluator now runs fully self-hosted on an Azure GPU VM — confidential, fixed-cost, and offline — using the very same graph we built in Part 2. That completes the series: one design, three deployment paths, zero code changes between them.

Independent educational project; not affiliated with any employer; not procurement or legal advice.