Monitoring vLLM in Production with Grafana & Prometheus

Running a large language model in production is one thing. Knowing what's happening inside it is another. When you deploy Qwen3.6-27B on an A100 server with vLLM, you need visibility into GPU utilization, request throughput, token generation rates, queue depth, and more — in real time.

This guide documents our experience deploying vLLM monitoring dashboards on two A100 servers using Prometheus and Grafana, covering the architecture, deployment steps, common pitfalls, and the key metrics that matter.

vLLM Monitoring Dashboard

Why Monitor vLLM?

vLLM is the industry-standard inference engine for large language models, powering everything from local deployments to cloud-scale serving. But without monitoring, you're flying blind:

GPU memory leaks can silently fill up your 80GB A100, causing OOM crashes hours into a long-running session
Request queue buildup means users are waiting, but you won't know until they complain
Token generation rate drops can indicate hardware issues, thermal throttling, or model problems
Prefix cache hit rates tell you if your optimization strategies are actually working

The good news? vLLM ships with built-in Prometheus metrics — 386 of them, exposed at the /metrics endpoint. All you need is Prometheus to collect them and Grafana to visualize them.

Architecture Overview

The monitoring stack is simple and lightweight:

vLLM (port 8391) → /metrics endpoint (Prometheus format, 386 metrics)
    ↓
Prometheus (port 9090) → scrapes /metrics every 15s
    ↓
Grafana (port 3000) → visualizes with vLLM Monitoring V2 dashboard (ID: 24756)

All three components run on the same server. vLLM serves the model, Prometheus collects metrics, and Grafana renders the dashboard. The entire stack fits in a few Docker containers.

Step 1: Discover vLLM's Built-in Metrics

vLLM exposes metrics at http://<server>:8391/metrics in Prometheus text format. A quick curl reveals 386 distinct metrics covering:

GPU utilization: memory usage, KV cache allocation, GPU utilization percentage
Request metrics: total requests, running/queued/swapped requests, request latency
Token metrics: generated tokens per second, prompt tokens per second, iteration tokens
Cache metrics: prefix cache hit rate, block eviction count
Throughput: requests per second, tokens per second
MTP (Multi-Token Prediction): speculative token acceptance rate

The metrics are organized with proper Prometheus labels, making them easy to filter and aggregate in Grafana.

Step 2: Deploy Prometheus

Prometheus is deployed as a Docker container with host network mode — this is critical because the container needs to access vLLM's /metrics endpoint on the host machine.

docker run -d \
  --name prometheus \
  --network host \
  -v ~/monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
  -v ~/monitoring/prometheus/data:/prometheus \
  prom/prometheus:latest

The configuration file prometheus.yml targets the vLLM metrics endpoint:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['localhost:8391']
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

Key insight: Using --network host is essential. Without it, the container cannot reach localhost:8391 because Docker's default bridge network isolates the container from the host's loopback interface.

Step 3: Deploy Grafana

Grafana provides the visualization layer. We use the official Docker image with persistent storage:

docker run -d \
  --name grafana \
  --network host \
  -v ~/monitoring/grafana:/var/lib/grafana \
  -e GF_SECURITY_ADMIN_PASSWORD=admin \
  grafana/grafana:latest

Common pitfall: The Grafana data directory must have write permissions. If you get errors, run:

chmod 777 ~/monitoring/grafana

Port conflict: On our office A100 server, port 3000 was already occupied by a Node.js process. The solution was to map Grafana to port 3001 instead:

-p 3001:3000

Step 4: Configure the Data Source

After Grafana starts, add Prometheus as a data source via the API:

curl -s -X POST http://localhost:3000/api/datasources \
  -H "Authorization: Bearer <admin-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Prometheus",
    "type": "prometheus",
    "url": "http://localhost:9090",
    "access": "proxy",
    "isDefault": true
  }'

Verify the connection at http://<server>:3000 with credentials admin/admin.

Step 5: Import the vLLM Dashboard

The Grafana community has an excellent vLLM dashboard — ID 24756 (vLLM Monitoring V2) — which provides comprehensive visualization out of the box.

Important: The dashboard must be imported via the API endpoint /api/dashboards/db, NOT /api/dashboards/import. The latter is deprecated and will fail silently.

curl -s -X POST http://localhost:3000/api/dashboards/db \
  -H "Authorization: Bearer <admin-token>" \
  -H "Content-Type: application/json" \
  -d @vllm-dashboard.json

The dashboard includes panels for:

GPU memory usage over time
Request throughput (requests/sec)
Token generation rate (tokens/sec)
Queue depth and latency percentiles
KV cache utilization
Prefix cache hit rate
MTP speculative token acceptance
Active/running/queued request counts

Deployment on Two Servers

We deployed this monitoring stack on two different A100 servers, each with unique challenges:

Server 1: Dual A100 Server (172.24.217.58)

OS: Ubuntu with Docker pre-installed
vLLM: Qwen3.6-27B on GPU 1, port 8391, FP8 quantization
Grafana: Port 3000 (default)
Dashboard: http://172.24.217.58:3000
Notes: Smooth deployment, Docker Hub accessible via proxy (127.0.0.1:7897)

Server 2: Office A100 Server (172.24.168.225)

OS: Ubuntu 22.04, no Docker pre-installed
Docker: Installed via apt install docker.io (version 29.1.3)
Docker Hub: Direct connection timed out; configured mirror accelerator (docker.1ms.run)
Port conflict: Port 3000 occupied by Node.js → Grafana on port 3001
vLLM: Qwen3.6-27B FP8, max_model_len=200000
Dashboard: http://172.24.168.225:3001
Notes: Required Docker installation, mirror configuration, and port adjustment

Key Metrics That Matter

After running the dashboard for several days, these are the metrics we monitor most closely:

Metric	Why It Matters	Alert Threshold
GPU Memory Usage	Detect memory leaks early	>85% of total
Request Queue Length	User experience indicator	>10 pending
Tokens/Second	Throughput health check	<50% of baseline
Prefix Cache Hit Rate	Optimization effectiveness	Track trend
Request Latency P99	Tail latency monitoring	>2x median
KV Cache Blocks	Memory pressure signal	>90% allocated

Common Pitfalls and Solutions

1. Docker Network Mode

Problem: Container cannot reach vLLM on localhost. Solution: Use --network host for both Prometheus and Grafana containers.

2. Grafana Data Directory Permissions

Problem: Grafana crashes on startup with permission denied. Solution: chmod 777 on the mounted data directory before starting the container.

3. Dashboard Import API

Problem: Dashboard import fails silently. Solution: Use /api/dashboards/db endpoint, not the deprecated /api/dashboards/import.

4. Docker Hub Timeout

Problem: docker pull hangs on servers behind GFW. Solution: Configure Docker mirror accelerators in /etc/docker/daemon.json.

5. Port Conflicts

Problem: Default ports (3000, 9090) already in use. Solution: Check with ss -tlnp | grep <port> and use alternative ports.

Results

The monitoring dashboard provides real-time visibility into vLLM's operation. Key observations from our deployment:

FP8 quantization reduces GPU memory usage from ~69GB (BF16) to ~20GB, leaving ample headroom
Prefix caching shows high hit rates for repeated prompts, significantly reducing latency
MTP speculative decoding achieves ~60% token acceptance rate, boosting throughput
Request latency remains stable under moderate load, with P99 typically under 2 seconds

vLLM Monitoring Dashboard

Conclusion

Monitoring vLLM in production doesn't have to be complex. With the built-in /metrics endpoint, Prometheus, and Grafana's community dashboard, you can deploy a comprehensive monitoring stack in under 30 minutes.

The key takeaways:

Use Docker host network mode for container-to-host communication
Import dashboard via /api/dashboards/db API
Set proper permissions on Grafana data directories
Configure Docker mirrors when behind firewalls
Monitor the right metrics — GPU memory, queue depth, and token throughput are your north stars

With this setup, you'll catch issues before they become outages and have the data to optimize your LLM serving infrastructure.

Troubleshooting: `Datasource ${DS_PROMETHEUS} was not found`

After importing the Grafana Dashboard (ID 24756), if panels show this error:

PanelQueryRunner Error {message: 'Datasource ${DS_PROMETHEUS} was not found'}

Cause: The Dashboard JSON uses a variable reference ${DS_PROMETHEUS} for the datasource UID, but the Grafana instance doesn't have this template variable defined, so it cannot resolve to the actual datasource UID.

Solution: Replace ${DS_PROMETHEUS} in the Dashboard JSON with the actual Prometheus datasource UID.

# 1. Get Prometheus datasource UID
curl -s -u admin:your_password http://localhost:3000/api/datasources | python3 -m json.tool
# Find the "uid": "xxxxxxxxx" value

# 2. Export Dashboard JSON
curl -s -u admin:your_password http://localhost:3000/api/dashboards/uid/vllm-master-v2 > dashboard.json

# 3. Replace the variable (using python3 for JSON handling)
python3 -c "
import json
with open('dashboard.json', 'r') as f:
    data = json.load(f)
json_str = json.dumps(data)
json_str = json_str.replace('\${DS_PROMETHEUS}', 'YOUR_PROMETHEUS_UID')
with open('dashboard_fixed.json', 'w') as f:
    f.write(json_str)
print('Replaced successfully')
"

# 4. Re-import the dashboard
curl -s -u admin:your_password -X POST http://localhost:3000/api/dashboards/db \
  -H 'Content-Type: application/json' \
  -d @dashboard_fixed.json

Prevention: Before importing a Dashboard, ensure Prometheus datasource is configured in Grafana and that variable references in the Dashboard JSON can be resolved. For official Grafana Dashboards (like ID 24756), it's recommended to verify datasource references after import.

Have questions or suggestions? Feel free to reach out on GitHub or join our Discord community.