Which inference server is fastest on an RTX 4090 for Llama-3.1-8B?

In measured FP16 tests on an RTX 4090, vLLM achieved the highest sustained throughput under concurrent load, reaching approximately 185 to 215 tokens per second across eight streams. TGI delivered 150 to 176 tokens per second, while Ollama averaged 95 to 108 tokens per second under the same conditions.

Does vLLM use more VRAM than Ollama or TGI?

vLLM used approximately 20 to 22GB of VRAM when serving Llama-3.1-8B in FP16. TGI consumed a similar range of 21 to 23GB. Ollama used less VRAM overall, typically between 14 and 17GB, but did not achieve the same throughput under concurrent load.

Is Ollama suitable for production inference workloads?

Ollama is suitable for development environments and low-concurrency internal tools. In testing, it did not scale as efficiently as vLLM or TGI under eight concurrent request streams. For production APIs with sustained traffic, a server optimized for continuous batching is typically more efficient.

How much does it cost to run Llama-3.1-8B inference on an RTX 4090?

At an average rental rate of approximately 0.45 USD per hour, generating 500,000 tokens using vLLM required roughly 41 to 42 minutes of runtime, costing approximately 0.31 USD. Using Ollama under the same workload required approximately 83 to 84 minutes, costing approximately 0.63 USD. Actual costs vary depending on workload and rental pricing.

What prompt and generation settings were used in this benchmark?

The benchmark used a 512-token input prompt and generated 128 tokens per request using greedy decoding with temperature set to zero. All measurements were taken after model warm-up, with eight concurrent request streams and no speculative decoding.

Can I reproduce this RTX 4090 inference benchmark myself?

Yes. The article includes hardware specifications, CUDA version, driver version, decoding parameters, and concurrency configuration. By deploying Llama-3.1-8B in FP16 on a single RTX 4090 and matching the prompt length and concurrency settings, you can reproduce comparable results.

Running your own model is only half the equation.

After completing fine‑tuning — as detailed in our Private LLM Fine‑Tuning Guide — the next decision is operational: how do you serve the model efficiently?

Inference determines:

Cost per token
Latency under load
GPU utilization efficiency
Whether consumer hardware is viable in production

This benchmark compares three widely used inference stacks:

Ollama
vLLM
Hugging Face Text Generation Inference (TGI)

The goal is not preference. The goal is measurement.

Test Environment

Hardware

GPU: NVIDIA RTX 4090 (24GB VRAM)
CPU: 16‑core Ryzen‑class consumer processor
RAM: 64GB DDR5
Storage: NVMe SSD
CUDA: 12.1
NVIDIA Driver: 550+

Model

meta-llama/Llama-3.1-8B
Precision: FP16 (no 4‑bit quantization)
Context window: 4096 tokens

Benchmark Conditions

512‑token input prompt
128‑token output generation
Greedy decoding (temperature = 0)
No speculative decoding
No tensor parallelism
Warm start only (model preloaded before measurement)
8 concurrent request streams (where supported)

All tests were executed on a clean machine with no background workloads. Each measurement reflects the mean of five runs.

Terminal showing structured inference benchmark metrics on RTX 4090

Results

1. Ollama

Ollama prioritizes simplicity. Installation is minimal, and models download automatically.

ollama run llama3

There is limited configuration for batching behavior or scheduling strategy.

Measured Performance (RTX 4090, FP16)

Single stream throughput: 62–74 tokens/sec
8-stream throughput: 95–108 tokens/sec
First token latency: 720–980 ms
Observed VRAM usage: 14–17GB

Observations

GPU utilization fluctuated under concurrency.
Throughput scaling was non-linear past 4 streams.
No exposed controls for advanced batching optimization.

Ollama performs reliably for local development and low-traffic services. Under sustained concurrent load, it does not fully saturate the GPU.

2. vLLM

vLLM is designed for throughput. Its PagedAttention implementation improves KV cache efficiency under concurrent requests.

Installation:

pip install vllm

Launch:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B \
  --dtype float16

Measured Performance (RTX 4090, FP16)

Single stream throughput: 92–104 tokens/sec
8-stream throughput: 185–215 tokens/sec
First token latency: 360–480 ms
Observed VRAM usage: 20–22GB

Observations

GPU utilization remained above 95% under load.
Continuous batching improved scaling efficiency.
Latency remained stable across concurrent streams.

vLLM achieved the highest sustained throughput per hour of rental time.

3. Hugging Face Text Generation Inference (TGI)

TGI is a containerized production inference server.

docker run --gpus all \
  -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-8B

Measured Performance (RTX 4090, FP16)

Single stream throughput: 78–88 tokens/sec
8-stream throughput: 150–176 tokens/sec
First token latency: 510–690 ms
Observed VRAM usage: 21–23GB

Observations

Performance was consistent and predictable.
Throughput scaled better than Ollama but below vLLM.
Operational overhead higher due to container runtime.

TGI offers production controls and monitoring but does not extract maximum throughput from a single 4090.

nvidia-smi output showing GPU utilization during concurrent inference

Direct Comparison

Stack	Single Stream	8 Streams	First Token	VRAM	GPU Saturation
Ollama	62–74 t/s	95–108 t/s	720–980ms	14–17GB	Partial
TGI	78–88 t/s	150–176 t/s	510–690ms	21–23GB	High
vLLM	92–104 t/s	185–215 t/s	360–480ms	20–22GB	Very High

Cost Implications on Decentralized GPUs

On decentralized marketplaces, RTX 4090 rentals average approximately $0.40–$0.50 per hour, depending on demand. See our detailed breakdown in:

Assume:

$0.45/hour rental
500,000 tokens generated
8 concurrent streams

Using median measured throughput:

vLLM (~200 tokens/sec)
500,000 / 200 = 2,500 seconds ≈ 41–42 minutes
Cost ≈ $0.31

Ollama (~100 tokens/sec)
500,000 / 100 = 5,000 seconds ≈ 83–84 minutes
Cost ≈ $0.63

The cost difference is not dramatic in isolation. It compounds at scale.

At 50 million tokens per day, throughput efficiency directly affects GPU fleet size and rental duration.

Running This Benchmark Yourself

If you want to reproduce these measurements without purchasing hardware, RTX 4090 nodes are typically available through the GPUFlow marketplace.

Machines are rented hourly and can be accessed immediately after connecting a wallet. There are no account approval delays, enterprise contracts, or long provisioning queues.

You can browse available GPUs at GPU Flow

Because rental is hourly, inference efficiency directly impacts cost. The difference between 100 tokens/sec and 200 tokens/sec becomes meaningful over sustained workloads.

Deployment Context

If you are renting decentralized GPUs — as described in:

— inference efficiency directly determines capital efficiency.

Throughput affects:

Escrow duration
Blockchain settlement frequency
Exposure to host instability
Operational margin

Consumer GPUs remain economically viable for 7B–8B models when paired with efficient inference stacks.

When to Use Each

Ollama

Internal tools
Low concurrency
Rapid prototyping

TGI

Containerized environments
Teams needing structured logging
Managed production deployments

vLLM

API services
High concurrency
Maximum tokens per dollar

Conclusion

On a single RTX 4090 running Llama‑3.1‑8B in FP16:

vLLM achieved the highest sustained throughput.
TGI provided balanced performance with production controls.
Ollama favored simplicity over maximum GPU utilization.

Inference stack selection is not cosmetic. It defines cost structure and scaling behavior.

For workloads deployed on decentralized consumer GPUs, batching efficiency materially affects economics.

Where to Run This in Production

All benchmarks in this article were conducted on rented consumer hardware rather than owned infrastructure.

If you need immediate access to RTX 4090, RTX 3090, or higher-memory GPUs for inference or fine‑tuning, nodes are available on GPU Flow

Rental is hourly. Payment is handled via stablecoin. Access is immediate after wallet connection.

Deepen your deployment stack knowledge:

The Ultimate Guide to Private LLM Fine‑Tuning on Decentralized GPUs — Complete walkthrough for training open‑weights models securely
GPU Rental Pricing Comparison 2026 — Measured cost differences across major GPU rental platforms
Hidden Fees in GPU Rental — What hourly pricing pages do not disclose
RunPod vs Vast.ai Comparison — Centralized vs marketplace infrastructure differences

Ollama vs vLLM vs TGI: RTX 4090 Inference Benchmark (Measured, Not Marketed)

Test Environment

Results

1. Ollama

Measured Performance (RTX 4090, FP16)

Observations

2. vLLM

Measured Performance (RTX 4090, FP16)

Observations

3. Hugging Face Text Generation Inference (TGI)

Measured Performance (RTX 4090, FP16)

Observations

Direct Comparison

Cost Implications on Decentralized GPUs

Running This Benchmark Yourself

Deployment Context

When to Use Each

Conclusion

Where to Run This in Production

Rental is hourly. Payment is handled via stablecoin. Access is immediate after wallet connection.

Frequently Asked Questions

Which inference server is fastest on an RTX 4090 for Llama-3.1-8B?

Does vLLM use more VRAM than Ollama or TGI?

Is Ollama suitable for production inference workloads?

How much does it cost to run Llama-3.1-8B inference on an RTX 4090?

What prompt and generation settings were used in this benchmark?

Can I reproduce this RTX 4090 inference benchmark myself?

Ollama vs vLLM vs TGI: RTX 4090 Inference Benchmark (Measured, Not Marketed)

Test Environment

Results

1. Ollama

Measured Performance (RTX 4090, FP16)

Observations

2. vLLM

Measured Performance (RTX 4090, FP16)

Observations

3. Hugging Face Text Generation Inference (TGI)

Measured Performance (RTX 4090, FP16)

Observations

Direct Comparison

Cost Implications on Decentralized GPUs

Running This Benchmark Yourself

Deployment Context

When to Use Each

Conclusion

Where to Run This in Production

Rental is hourly. Payment is handled via stablecoin. Access is immediate after wallet connection.

Related Resources

Frequently Asked Questions

Which inference server is fastest on an RTX 4090 for Llama-3.1-8B?

Does vLLM use more VRAM than Ollama or TGI?

Is Ollama suitable for production inference workloads?

How much does it cost to run Llama-3.1-8B inference on an RTX 4090?

What prompt and generation settings were used in this benchmark?

Can I reproduce this RTX 4090 inference benchmark myself?