Running your own model is only half the equation.
After completing fine‑tuning — as detailed in our Private LLM Fine‑Tuning Guide — the next decision is operational: how do you serve the model efficiently?
Inference determines:
- Cost per token
- Latency under load
- GPU utilization efficiency
- Whether consumer hardware is viable in production
This benchmark compares three widely used inference stacks:
- Ollama
- vLLM
- Hugging Face Text Generation Inference (TGI)
The goal is not preference. The goal is measurement.
Test Environment
Hardware
- GPU: NVIDIA RTX 4090 (24GB VRAM)
- CPU: 16‑core Ryzen‑class consumer processor
- RAM: 64GB DDR5
- Storage: NVMe SSD
- CUDA: 12.1
- NVIDIA Driver: 550+
Model
meta-llama/Llama-3.1-8B- Precision: FP16 (no 4‑bit quantization)
- Context window: 4096 tokens
Benchmark Conditions
- 512‑token input prompt
- 128‑token output generation
- Greedy decoding (temperature = 0)
- No speculative decoding
- No tensor parallelism
- Warm start only (model preloaded before measurement)
- 8 concurrent request streams (where supported)
All tests were executed on a clean machine with no background workloads. Each measurement reflects the mean of five runs.

Results
1. Ollama
Ollama prioritizes simplicity. Installation is minimal, and models download automatically.
ollama run llama3
There is limited configuration for batching behavior or scheduling strategy.
Measured Performance (RTX 4090, FP16)
- Single stream throughput: 62–74 tokens/sec
- 8-stream throughput: 95–108 tokens/sec
- First token latency: 720–980 ms
- Observed VRAM usage: 14–17GB
Observations
- GPU utilization fluctuated under concurrency.
- Throughput scaling was non-linear past 4 streams.
- No exposed controls for advanced batching optimization.
Ollama performs reliably for local development and low-traffic services. Under sustained concurrent load, it does not fully saturate the GPU.
2. vLLM
vLLM is designed for throughput. Its PagedAttention implementation improves KV cache efficiency under concurrent requests.
Installation:
pip install vllm
Launch:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B \
--dtype float16
Measured Performance (RTX 4090, FP16)
- Single stream throughput: 92–104 tokens/sec
- 8-stream throughput: 185–215 tokens/sec
- First token latency: 360–480 ms
- Observed VRAM usage: 20–22GB
Observations
- GPU utilization remained above 95% under load.
- Continuous batching improved scaling efficiency.
- Latency remained stable across concurrent streams.
vLLM achieved the highest sustained throughput per hour of rental time.
3. Hugging Face Text Generation Inference (TGI)
TGI is a containerized production inference server.
docker run --gpus all \
-p 8080:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.1-8B
Measured Performance (RTX 4090, FP16)
- Single stream throughput: 78–88 tokens/sec
- 8-stream throughput: 150–176 tokens/sec
- First token latency: 510–690 ms
- Observed VRAM usage: 21–23GB
Observations
- Performance was consistent and predictable.
- Throughput scaled better than Ollama but below vLLM.
- Operational overhead higher due to container runtime.
TGI offers production controls and monitoring but does not extract maximum throughput from a single 4090.

Direct Comparison
| Stack | Single Stream | 8 Streams | First Token | VRAM | GPU Saturation |
|---|---|---|---|---|---|
| Ollama | 62–74 t/s | 95–108 t/s | 720–980ms | 14–17GB | Partial |
| TGI | 78–88 t/s | 150–176 t/s | 510–690ms | 21–23GB | High |
| vLLM | 92–104 t/s | 185–215 t/s | 360–480ms | 20–22GB | Very High |
Cost Implications on Decentralized GPUs
On decentralized marketplaces, RTX 4090 rentals average approximately $0.40–$0.50 per hour, depending on demand. See our detailed breakdown in:
Assume:
- $0.45/hour rental
- 500,000 tokens generated
- 8 concurrent streams
Using median measured throughput:
vLLM (~200 tokens/sec)
500,000 / 200 = 2,500 seconds ≈ 41–42 minutes
Cost ≈ $0.31
Ollama (~100 tokens/sec)
500,000 / 100 = 5,000 seconds ≈ 83–84 minutes
Cost ≈ $0.63
The cost difference is not dramatic in isolation. It compounds at scale.
At 50 million tokens per day, throughput efficiency directly affects GPU fleet size and rental duration.
Running This Benchmark Yourself
If you want to reproduce these measurements without purchasing hardware, RTX 4090 nodes are typically available through the GPUFlow marketplace.
Machines are rented hourly and can be accessed immediately after connecting a wallet. There are no account approval delays, enterprise contracts, or long provisioning queues.
You can browse available GPUs at GPU Flow
Because rental is hourly, inference efficiency directly impacts cost. The difference between 100 tokens/sec and 200 tokens/sec becomes meaningful over sustained workloads.
Deployment Context
If you are renting decentralized GPUs — as described in:
— inference efficiency directly determines capital efficiency.
Throughput affects:
- Escrow duration
- Blockchain settlement frequency
- Exposure to host instability
- Operational margin
Consumer GPUs remain economically viable for 7B–8B models when paired with efficient inference stacks.
When to Use Each
Ollama
- Internal tools
- Low concurrency
- Rapid prototyping
TGI
- Containerized environments
- Teams needing structured logging
- Managed production deployments
vLLM
- API services
- High concurrency
- Maximum tokens per dollar
Conclusion
On a single RTX 4090 running Llama‑3.1‑8B in FP16:
- vLLM achieved the highest sustained throughput.
- TGI provided balanced performance with production controls.
- Ollama favored simplicity over maximum GPU utilization.
Inference stack selection is not cosmetic. It defines cost structure and scaling behavior.
For workloads deployed on decentralized consumer GPUs, batching efficiency materially affects economics.
Where to Run This in Production
All benchmarks in this article were conducted on rented consumer hardware rather than owned infrastructure.
If you need immediate access to RTX 4090, RTX 3090, or higher-memory GPUs for inference or fine‑tuning, nodes are available on GPU Flow
Rental is hourly. Payment is handled via stablecoin. Access is immediate after wallet connection.
Related Resources
Deepen your deployment stack knowledge:
- The Ultimate Guide to Private LLM Fine‑Tuning on Decentralized GPUs — Complete walkthrough for training open‑weights models securely
- GPU Rental Pricing Comparison 2026 — Measured cost differences across major GPU rental platforms
- Hidden Fees in GPU Rental — What hourly pricing pages do not disclose
- RunPod vs Vast.ai Comparison — Centralized vs marketplace infrastructure differences