RTX 4090 GPU inference benchmark displayed on terminal with performance metrics

Ollama vs vLLM vs TGI: RTX 4090 Inference Benchmark (Measured, Not Marketed)

A controlled RTX 4090 benchmark comparing Ollama, vLLM, and Hugging Face TGI for Llama‑3.1‑8B inference. Throughput, latency, VRAM usage, and cost-per-token analysis.

Running your own model is only half the equation.

After completing fine‑tuning — as detailed in our Private LLM Fine‑Tuning Guide — the next decision is operational: how do you serve the model efficiently?

Inference determines:

  • Cost per token
  • Latency under load
  • GPU utilization efficiency
  • Whether consumer hardware is viable in production

This benchmark compares three widely used inference stacks:

  • Ollama
  • vLLM
  • Hugging Face Text Generation Inference (TGI)

The goal is not preference. The goal is measurement.


Test Environment

Hardware

  • GPU: NVIDIA RTX 4090 (24GB VRAM)
  • CPU: 16‑core Ryzen‑class consumer processor
  • RAM: 64GB DDR5
  • Storage: NVMe SSD
  • CUDA: 12.1
  • NVIDIA Driver: 550+

Model

  • meta-llama/Llama-3.1-8B
  • Precision: FP16 (no 4‑bit quantization)
  • Context window: 4096 tokens

Benchmark Conditions

  • 512‑token input prompt
  • 128‑token output generation
  • Greedy decoding (temperature = 0)
  • No speculative decoding
  • No tensor parallelism
  • Warm start only (model preloaded before measurement)
  • 8 concurrent request streams (where supported)

All tests were executed on a clean machine with no background workloads. Each measurement reflects the mean of five runs.


Terminal showing structured inference benchmark metrics on RTX 4090


Results

1. Ollama

Ollama prioritizes simplicity. Installation is minimal, and models download automatically.

ollama run llama3

There is limited configuration for batching behavior or scheduling strategy.

Measured Performance (RTX 4090, FP16)

  • Single stream throughput: 62–74 tokens/sec
  • 8-stream throughput: 95–108 tokens/sec
  • First token latency: 720–980 ms
  • Observed VRAM usage: 14–17GB

Observations

  • GPU utilization fluctuated under concurrency.
  • Throughput scaling was non-linear past 4 streams.
  • No exposed controls for advanced batching optimization.

Ollama performs reliably for local development and low-traffic services. Under sustained concurrent load, it does not fully saturate the GPU.


2. vLLM

vLLM is designed for throughput. Its PagedAttention implementation improves KV cache efficiency under concurrent requests.

Installation:

pip install vllm

Launch:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B \
  --dtype float16

Measured Performance (RTX 4090, FP16)

  • Single stream throughput: 92–104 tokens/sec
  • 8-stream throughput: 185–215 tokens/sec
  • First token latency: 360–480 ms
  • Observed VRAM usage: 20–22GB

Observations

  • GPU utilization remained above 95% under load.
  • Continuous batching improved scaling efficiency.
  • Latency remained stable across concurrent streams.

vLLM achieved the highest sustained throughput per hour of rental time.


3. Hugging Face Text Generation Inference (TGI)

TGI is a containerized production inference server.

docker run --gpus all \
  -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-8B

Measured Performance (RTX 4090, FP16)

  • Single stream throughput: 78–88 tokens/sec
  • 8-stream throughput: 150–176 tokens/sec
  • First token latency: 510–690 ms
  • Observed VRAM usage: 21–23GB

Observations

  • Performance was consistent and predictable.
  • Throughput scaled better than Ollama but below vLLM.
  • Operational overhead higher due to container runtime.

TGI offers production controls and monitoring but does not extract maximum throughput from a single 4090.


nvidia-smi output showing GPU utilization during concurrent inference


Direct Comparison

StackSingle Stream8 StreamsFirst TokenVRAMGPU Saturation
Ollama62–74 t/s95–108 t/s720–980ms14–17GBPartial
TGI78–88 t/s150–176 t/s510–690ms21–23GBHigh
vLLM92–104 t/s185–215 t/s360–480ms20–22GBVery High

Cost Implications on Decentralized GPUs

On decentralized marketplaces, RTX 4090 rentals average approximately $0.40–$0.50 per hour, depending on demand. See our detailed breakdown in:

Assume:

  • $0.45/hour rental
  • 500,000 tokens generated
  • 8 concurrent streams

Using median measured throughput:

vLLM (~200 tokens/sec)
500,000 / 200 = 2,500 seconds ≈ 41–42 minutes
Cost ≈ $0.31

Ollama (~100 tokens/sec)
500,000 / 100 = 5,000 seconds ≈ 83–84 minutes
Cost ≈ $0.63

The cost difference is not dramatic in isolation. It compounds at scale.

At 50 million tokens per day, throughput efficiency directly affects GPU fleet size and rental duration.

Running This Benchmark Yourself

If you want to reproduce these measurements without purchasing hardware, RTX 4090 nodes are typically available through the GPUFlow marketplace.

Machines are rented hourly and can be accessed immediately after connecting a wallet. There are no account approval delays, enterprise contracts, or long provisioning queues.

You can browse available GPUs at GPU Flow

Because rental is hourly, inference efficiency directly impacts cost. The difference between 100 tokens/sec and 200 tokens/sec becomes meaningful over sustained workloads.


Deployment Context

If you are renting decentralized GPUs — as described in:

— inference efficiency directly determines capital efficiency.

Throughput affects:

  • Escrow duration
  • Blockchain settlement frequency
  • Exposure to host instability
  • Operational margin

Consumer GPUs remain economically viable for 7B–8B models when paired with efficient inference stacks.


When to Use Each

Ollama

  • Internal tools
  • Low concurrency
  • Rapid prototyping

TGI

  • Containerized environments
  • Teams needing structured logging
  • Managed production deployments

vLLM

  • API services
  • High concurrency
  • Maximum tokens per dollar

Conclusion

On a single RTX 4090 running Llama‑3.1‑8B in FP16:

  • vLLM achieved the highest sustained throughput.
  • TGI provided balanced performance with production controls.
  • Ollama favored simplicity over maximum GPU utilization.

Inference stack selection is not cosmetic. It defines cost structure and scaling behavior.

For workloads deployed on decentralized consumer GPUs, batching efficiency materially affects economics.

Where to Run This in Production

All benchmarks in this article were conducted on rented consumer hardware rather than owned infrastructure.

If you need immediate access to RTX 4090, RTX 3090, or higher-memory GPUs for inference or fine‑tuning, nodes are available on GPU Flow

Rental is hourly. Payment is handled via stablecoin. Access is immediate after wallet connection.

Deepen your deployment stack knowledge:

Frequently Asked Questions

Which inference server is fastest on an RTX 4090 for Llama-3.1-8B?

In measured FP16 tests on an RTX 4090, vLLM achieved the highest sustained throughput under concurrent load, reaching approximately 185 to 215 tokens per second across eight streams. TGI delivered 150 to 176 tokens per second, while Ollama averaged 95 to 108 tokens per second under the same conditions.

Does vLLM use more VRAM than Ollama or TGI?

vLLM used approximately 20 to 22GB of VRAM when serving Llama-3.1-8B in FP16. TGI consumed a similar range of 21 to 23GB. Ollama used less VRAM overall, typically between 14 and 17GB, but did not achieve the same throughput under concurrent load.

Is Ollama suitable for production inference workloads?

Ollama is suitable for development environments and low-concurrency internal tools. In testing, it did not scale as efficiently as vLLM or TGI under eight concurrent request streams. For production APIs with sustained traffic, a server optimized for continuous batching is typically more efficient.

How much does it cost to run Llama-3.1-8B inference on an RTX 4090?

At an average rental rate of approximately 0.45 USD per hour, generating 500,000 tokens using vLLM required roughly 41 to 42 minutes of runtime, costing approximately 0.31 USD. Using Ollama under the same workload required approximately 83 to 84 minutes, costing approximately 0.63 USD. Actual costs vary depending on workload and rental pricing.

What prompt and generation settings were used in this benchmark?

The benchmark used a 512-token input prompt and generated 128 tokens per request using greedy decoding with temperature set to zero. All measurements were taken after model warm-up, with eight concurrent request streams and no speculative decoding.

Can I reproduce this RTX 4090 inference benchmark myself?

Yes. The article includes hardware specifications, CUDA version, driver version, decoding parameters, and concurrency configuration. By deploying Llama-3.1-8B in FP16 on a single RTX 4090 and matching the prompt length and concurrency settings, you can reproduce comparable results.