Some RFPs are confidential enough that the bids must never leave your infrastructure. In this final part we self-host the model: Ollama on an Azure GPU VM. The RFP evaluator runs end to end with no external API calls and no per-token cost — just the same graph, pointed at a local GPU-accelerated model. As always, only configuration changes.
Why self-host with a GPU
Self-hosting gives you maximum confidentiality (zero data egress), a fixed cost instead of per-token billing, and full control. A GPU makes a local model fast enough to evaluate proposals at a practical pace — CPU-only inference is fine for testing but slow for real workloads.
Step 1 — Provision an Azure GPU VM
az vm create \
--resource-group rg-rfpeval --name vm-rfpeval-gpu \
--image Ubuntu2204 \
--size Standard_NC4as_T4_v3 \
--admin-username azureuser --generate-ssh-keys
az vm open-port --resource-group rg-rfpeval --name vm-rfpeval-gpu --port 8000
The NC*_T4 family ships an NVIDIA T4 (16 GB) — a good, affordable fit for 7–8B models. For larger models, choose an A10 or A100 size. Deallocate the VM when idle — GPU VMs are billed by the hour.
Step 2 — Install the NVIDIA driver and container toolkit
# NVIDIA driver
sudo apt-get update && sudo apt-get install -y ubuntu-drivers-common
sudo ubuntu-drivers autoinstall
sudo reboot
# after reboot, verify the GPU is visible:
nvidia-smi
# Docker + NVIDIA Container Toolkit
curl -fsSL https://get.docker.com | sh
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
# add the repo, then:
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Step 3 — Bring up the GPU stack
This is exactly why we shipped a GPU compose override. It adds a GPU reservation to the Ollama service:
# docker-compose.gpu.yml
services:
ollama:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: ["gpu"]
docker compose -f docker-compose.yml -f docker-compose.gpu.yml up -d --build
docker compose exec ollama ollama pull llama3.1:8b
docker compose exec ollama ollama pull nomic-embed-text
# confirm the model is on the GPU:
docker compose exec ollama ollama ps # PROCESSOR should read 100% GPU
Step 4 — Run the evaluation
The .env already defaults to Ollama; on the GPU VM nothing else is needed:
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_CHAT_MODEL=llama3.1:8b
OLLAMA_EMBEDDING_MODEL=nomic-embed-text
docker compose exec app \
python -m rfpeval.cli evaluate samples/sample_proposal.md --rfp sample_rfp
You now have the complete RFP evaluator — proposal ingestion, RFP requirement retrieval, weighted scoring, the human shortlist gate, cited report — running entirely on your own GPU, with no data leaving the VM.
Model choice & performance
| Model | VRAM (approx) | Fits on | Notes |
|---|---|---|---|
| llama3.1:8b | ~6–8 GB | T4 (16 GB) | Good default; solid structured output |
| llama3.1:70b | ~40+ GB | A100 (80 GB) | Higher quality; needs a big GPU |
| qwen2.5:14b | ~10–12 GB | A10 (24 GB) | Strong mid-size option |
Smaller models occasionally struggle with strict JSON schemas; if assessments come back malformed, step up a size or lower the temperature (we already use 0.0).
Cloud API vs self-hosted GPU
| Dimension | Cloud API (Parts 3–4) | Self-hosted GPU (this part) |
|---|---|---|
| Data egress | Leaves your boundary | None |
| Cost | Per token | Fixed VM/GPU per hour |
| Ops burden | Minimal | You manage drivers, VM, models |
| Quality ceiling | Very high | Depends on model/GPU size |
Troubleshooting & common errors
| Symptom | Cause | Fix |
|---|---|---|
nvidia-smi: command not found |
Driver not installed | sudo ubuntu-drivers autoinstall then reboot |
| Container can’t see the GPU | NVIDIA Container Toolkit not configured | sudo nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart docker |
ollama ps shows 100% CPU |
GPU reservation missing | Bring it up with the -f docker-compose.gpu.yml override |
| CUDA out of memory | Model too large for the GPU | Use a smaller model or a bigger GPU size |
| Slow first response | Model load into VRAM | Expected on first call; subsequent calls are fast |
| Unexpected GPU bill | VM left running | az vm deallocate when idle |
Series wrap-up
Over five parts we designed an RFP proposal evaluator from first principles, built it as an LLM-agnostic LangGraph application, and ran the same graph on Azure OpenAI, Google Vertex AI, and a self-hosted GPU — each a one-line provider switch. That is the lasting lesson: design for the abstract interface, push the provider to configuration, and you stay free to choose the right model for each deployment — cost, quality, or confidentiality — without rewriting a thing.
Frequently asked questions
Which Azure VM size should I use for Ollama?
An NC*_T4 (NVIDIA T4, 16 GB) is a good, affordable fit for 7–8B models. Step up to A10 or A100 sizes for larger models.
How do I confirm the model is actually on the GPU?
Run docker compose exec ollama ollama ps; the PROCESSOR column should show 100% GPU. If it shows CPU, you didn’t start with the GPU compose override.
Does self-hosting change the application code?
No. As with Azure and Vertex, only configuration changes; the LangGraph workflow is identical.
Conclusion
The RFP evaluator now runs fully self-hosted on an Azure GPU VM — confidential, fixed-cost, and offline — using the very same graph we built in Part 2. That completes the series: one design, three deployment paths, zero code changes between them.
Independent educational project; not affiliated with any employer; not procurement or legal advice.

Leave a Reply