Self-Host the RFP Evaluator: Ollama on an Azure GPU VM (Part 5)

Self-hosting the RFP evaluator with Ollama on an Azure GPU VM (Part 5)

Some RFPs are confidential enough that the bids must never leave your infrastructure. In this final part we self-host the model: Ollama on an Azure GPU VM. The RFP evaluator runs end to end with no external API calls and no per-token cost — just the same graph, pointed at a local GPU-accelerated model. As always, only configuration changes.

Why self-host with a GPU

Self-hosting gives you maximum confidentiality (zero data egress), a fixed cost instead of per-token billing, and full control. A GPU makes a local model fast enough to evaluate proposals at a practical pace — CPU-only inference is fine for testing but slow for real workloads.

Step 1 — Provision an Azure GPU VM

az vm create \
  --resource-group rg-rfpeval --name vm-rfpeval-gpu \
  --image Ubuntu2204 \
  --size Standard_NC4as_T4_v3 \
  --admin-username azureuser --generate-ssh-keys

az vm open-port --resource-group rg-rfpeval --name vm-rfpeval-gpu --port 8000

The NC*_T4 family ships an NVIDIA T4 (16 GB) — a good, affordable fit for 7–8B models. For larger models, choose an A10 or A100 size. Deallocate the VM when idle — GPU VMs are billed by the hour.

Step 2 — Install the NVIDIA driver and container toolkit

# NVIDIA driver
sudo apt-get update && sudo apt-get install -y ubuntu-drivers-common
sudo ubuntu-drivers autoinstall
sudo reboot

# after reboot, verify the GPU is visible:
nvidia-smi

# Docker + NVIDIA Container Toolkit
curl -fsSL https://get.docker.com | sh
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
# add the repo, then:
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Step 3 — Bring up the GPU stack

This is exactly why we shipped a GPU compose override. It adds a GPU reservation to the Ollama service:

# docker-compose.gpu.yml
services:
  ollama:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: ["gpu"]
docker compose -f docker-compose.yml -f docker-compose.gpu.yml up -d --build

docker compose exec ollama ollama pull llama3.1:8b
docker compose exec ollama ollama pull nomic-embed-text

# confirm the model is on the GPU:
docker compose exec ollama ollama ps   # PROCESSOR should read 100% GPU

Step 4 — Run the evaluation

The .env already defaults to Ollama; on the GPU VM nothing else is needed:

LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_CHAT_MODEL=llama3.1:8b
OLLAMA_EMBEDDING_MODEL=nomic-embed-text
docker compose exec app \
  python -m rfpeval.cli evaluate samples/sample_proposal.md --rfp sample_rfp

You now have the complete RFP evaluator — proposal ingestion, RFP requirement retrieval, weighted scoring, the human shortlist gate, cited report — running entirely on your own GPU, with no data leaving the VM.

Model choice & performance

Model VRAM (approx) Fits on Notes
llama3.1:8b ~6–8 GB T4 (16 GB) Good default; solid structured output
llama3.1:70b ~40+ GB A100 (80 GB) Higher quality; needs a big GPU
qwen2.5:14b ~10–12 GB A10 (24 GB) Strong mid-size option

Smaller models occasionally struggle with strict JSON schemas; if assessments come back malformed, step up a size or lower the temperature (we already use 0.0).

Cloud API vs self-hosted GPU

Dimension Cloud API (Parts 3–4) Self-hosted GPU (this part)
Data egress Leaves your boundary None
Cost Per token Fixed VM/GPU per hour
Ops burden Minimal You manage drivers, VM, models
Quality ceiling Very high Depends on model/GPU size

Troubleshooting & common errors

Symptom Cause Fix
nvidia-smi: command not found Driver not installed sudo ubuntu-drivers autoinstall then reboot
Container can’t see the GPU NVIDIA Container Toolkit not configured sudo nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart docker
ollama ps shows 100% CPU GPU reservation missing Bring it up with the -f docker-compose.gpu.yml override
CUDA out of memory Model too large for the GPU Use a smaller model or a bigger GPU size
Slow first response Model load into VRAM Expected on first call; subsequent calls are fast
Unexpected GPU bill VM left running az vm deallocate when idle

Series wrap-up

Over five parts we designed an RFP proposal evaluator from first principles, built it as an LLM-agnostic LangGraph application, and ran the same graph on Azure OpenAI, Google Vertex AI, and a self-hosted GPU — each a one-line provider switch. That is the lasting lesson: design for the abstract interface, push the provider to configuration, and you stay free to choose the right model for each deployment — cost, quality, or confidentiality — without rewriting a thing.

Frequently asked questions

Which Azure VM size should I use for Ollama?

An NC*_T4 (NVIDIA T4, 16 GB) is a good, affordable fit for 7–8B models. Step up to A10 or A100 sizes for larger models.

How do I confirm the model is actually on the GPU?

Run docker compose exec ollama ollama ps; the PROCESSOR column should show 100% GPU. If it shows CPU, you didn’t start with the GPU compose override.

Does self-hosting change the application code?

No. As with Azure and Vertex, only configuration changes; the LangGraph workflow is identical.

Conclusion

The RFP evaluator now runs fully self-hosted on an Azure GPU VM — confidential, fixed-cost, and offline — using the very same graph we built in Part 2. That completes the series: one design, three deployment paths, zero code changes between them.

Independent educational project; not affiliated with any employer; not procurement or legal advice.

MUASIF80 Avatar
Previous

Leave a Reply

Your email address will not be published. Required fields are marked *