Monitoring vLLM in Production with Grafana & Prometheus
Running a large language model in production is one thing. Knowing what's happening inside it is another. When you deploy Qwen3.6-27B on an A100 server with vLLM, you need visibility into GPU utilization, request throughput, token generation rates, queue depth, and more — in real time.
This guide documents our experience deploying vLLM monitoring dashboards on two A100 servers using Prometheus and Grafana, covering the architecture, deployment steps, common pitfalls, and the key metrics that matter.

Why Monitor vLLM?
vLLM is the industry-standard inference engine for large language models, powering everything from local deployments to cloud-scale serving. But without monitoring, you're flying blind:
- GPU memory leaks can silently fill up your 80GB A100, causing OOM crashes hours into a long-running session
- Request queue buildup means users are waiting, but you won't know until they complain
- Token generation rate drops can indicate hardware issues, thermal throttling, or model problems
- Prefix cache hit rates tell you if your optimization strategies are actually working
The good news? vLLM ships with built-in Prometheus metrics — 386 of them, exposed at the /metrics endpoint. All you need is Prometheus to collect them and Grafana to visualize them.
Architecture Overview
The monitoring stack is simple and lightweight:
vLLM (port 8391) → /metrics endpoint (Prometheus format, 386 metrics)
↓
Prometheus (port 9090) → scrapes /metrics every 15s
↓
Grafana (port 3000) → visualizes with vLLM Monitoring V2 dashboard (ID: 24756)
All three components run on the same server. vLLM serves the model, Prometheus collects metrics, and Grafana renders the dashboard. The entire stack fits in a few Docker containers.
Step 1: Discover vLLM's Built-in Metrics
vLLM exposes metrics at http://<server>:8391/metrics in Prometheus text format. A quick curl reveals 386 distinct metrics covering:
- GPU utilization: memory usage, KV cache allocation, GPU utilization percentage
- Request metrics: total requests, running/queued/swapped requests, request latency
- Token metrics: generated tokens per second, prompt tokens per second, iteration tokens
- Cache metrics: prefix cache hit rate, block eviction count
- Throughput: requests per second, tokens per second
- MTP (Multi-Token Prediction): speculative token acceptance rate
The metrics are organized with proper Prometheus labels, making them easy to filter and aggregate in Grafana.
Step 2: Deploy Prometheus
Prometheus is deployed as a Docker container with host network mode — this is critical because the container needs to access vLLM's /metrics endpoint on the host machine.
docker run -d \
--name prometheus \
--network host \
-v ~/monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
-v ~/monitoring/prometheus/data:/prometheus \
prom/prometheus:latest
The configuration file prometheus.yml targets the vLLM metrics endpoint:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'vllm'
static_configs:
- targets: ['localhost:8391']
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
Key insight: Using --network host is essential. Without it, the container cannot reach localhost:8391 because Docker's default bridge network isolates the container from the host's loopback interface.
Step 3: Deploy Grafana
Grafana provides the visualization layer. We use the official Docker image with persistent storage:
docker run -d \
--name grafana \
--network host \
-v ~/monitoring/grafana:/var/lib/grafana \
-e GF_SECURITY_ADMIN_PASSWORD=admin \
grafana/grafana:latest
Common pitfall: The Grafana data directory must have write permissions. If you get errors, run:
chmod 777 ~/monitoring/grafana
Port conflict: On our office A100 server, port 3000 was already occupied by a Node.js process. The solution was to map Grafana to port 3001 instead:
-p 3001:3000
Step 4: Configure the Data Source
After Grafana starts, add Prometheus as a data source via the API:
curl -s -X POST http://localhost:3000/api/datasources \
-H "Authorization: Bearer <admin-token>" \
-H "Content-Type: application/json" \
-d '{
"name": "Prometheus",
"type": "prometheus",
"url": "http://localhost:9090",
"access": "proxy",
"isDefault": true
}'
Verify the connection at http://<server>:3000 with credentials admin/admin.
Step 5: Import the vLLM Dashboard
The Grafana community has an excellent vLLM dashboard — ID 24756 (vLLM Monitoring V2) — which provides comprehensive visualization out of the box.
Important: The dashboard must be imported via the API endpoint /api/dashboards/db, NOT /api/dashboards/import. The latter is deprecated and will fail silently.
curl -s -X POST http://localhost:3000/api/dashboards/db \
-H "Authorization: Bearer <admin-token>" \
-H "Content-Type: application/json" \
-d @vllm-dashboard.json
The dashboard includes panels for:
- GPU memory usage over time
- Request throughput (requests/sec)
- Token generation rate (tokens/sec)
- Queue depth and latency percentiles
- KV cache utilization
- Prefix cache hit rate
- MTP speculative token acceptance
- Active/running/queued request counts
Deployment on Two Servers
We deployed this monitoring stack on two different A100 servers, each with unique challenges:
Server 1: Dual A100 Server (172.24.217.58)
- OS: Ubuntu with Docker pre-installed
- vLLM: Qwen3.6-27B on GPU 1, port 8391, FP8 quantization
- Grafana: Port 3000 (default)
- Dashboard: http://172.24.217.58:3000
- Notes: Smooth deployment, Docker Hub accessible via proxy (127.0.0.1:7897)
Server 2: Office A100 Server (172.24.168.225)
- OS: Ubuntu 22.04, no Docker pre-installed
- Docker: Installed via
apt install docker.io(version 29.1.3) - Docker Hub: Direct connection timed out; configured mirror accelerator (docker.1ms.run)
- Port conflict: Port 3000 occupied by Node.js → Grafana on port 3001
- vLLM: Qwen3.6-27B FP8, max_model_len=200000
- Dashboard: http://172.24.168.225:3001
- Notes: Required Docker installation, mirror configuration, and port adjustment
Key Metrics That Matter
After running the dashboard for several days, these are the metrics we monitor most closely:
| Metric | Why It Matters | Alert Threshold |
|---|---|---|
| GPU Memory Usage | Detect memory leaks early | >85% of total |
| Request Queue Length | User experience indicator | >10 pending |
| Tokens/Second | Throughput health check | <50% of baseline |
| Prefix Cache Hit Rate | Optimization effectiveness | Track trend |
| Request Latency P99 | Tail latency monitoring | >2x median |
| KV Cache Blocks | Memory pressure signal | >90% allocated |
Common Pitfalls and Solutions
1. Docker Network Mode
Problem: Container cannot reach vLLM on localhost.
Solution: Use --network host for both Prometheus and Grafana containers.
2. Grafana Data Directory Permissions
Problem: Grafana crashes on startup with permission denied.
Solution: chmod 777 on the mounted data directory before starting the container.
3. Dashboard Import API
Problem: Dashboard import fails silently.
Solution: Use /api/dashboards/db endpoint, not the deprecated /api/dashboards/import.
4. Docker Hub Timeout
Problem: docker pull hangs on servers behind GFW.
Solution: Configure Docker mirror accelerators in /etc/docker/daemon.json.
5. Port Conflicts
Problem: Default ports (3000, 9090) already in use.
Solution: Check with ss -tlnp | grep <port> and use alternative ports.
Results
The monitoring dashboard provides real-time visibility into vLLM's operation. Key observations from our deployment:
- FP8 quantization reduces GPU memory usage from ~69GB (BF16) to ~20GB, leaving ample headroom
- Prefix caching shows high hit rates for repeated prompts, significantly reducing latency
- MTP speculative decoding achieves ~60% token acceptance rate, boosting throughput
- Request latency remains stable under moderate load, with P99 typically under 2 seconds

Conclusion
Monitoring vLLM in production doesn't have to be complex. With the built-in /metrics endpoint, Prometheus, and Grafana's community dashboard, you can deploy a comprehensive monitoring stack in under 30 minutes.
The key takeaways:
- Use Docker host network mode for container-to-host communication
- Import dashboard via
/api/dashboards/dbAPI - Set proper permissions on Grafana data directories
- Configure Docker mirrors when behind firewalls
- Monitor the right metrics — GPU memory, queue depth, and token throughput are your north stars
With this setup, you'll catch issues before they become outages and have the data to optimize your LLM serving infrastructure.
Troubleshooting: Datasource ${DS_PROMETHEUS} was not found
After importing the Grafana Dashboard (ID 24756), if panels show this error:
PanelQueryRunner Error {message: 'Datasource ${DS_PROMETHEUS} was not found'}
Cause: The Dashboard JSON uses a variable reference ${DS_PROMETHEUS} for the datasource UID, but the Grafana instance doesn't have this template variable defined, so it cannot resolve to the actual datasource UID.
Solution: Replace ${DS_PROMETHEUS} in the Dashboard JSON with the actual Prometheus datasource UID.
# 1. Get Prometheus datasource UID
curl -s -u admin:your_password http://localhost:3000/api/datasources | python3 -m json.tool
# Find the "uid": "xxxxxxxxx" value
# 2. Export Dashboard JSON
curl -s -u admin:your_password http://localhost:3000/api/dashboards/uid/vllm-master-v2 > dashboard.json
# 3. Replace the variable (using python3 for JSON handling)
python3 -c "
import json
with open('dashboard.json', 'r') as f:
data = json.load(f)
json_str = json.dumps(data)
json_str = json_str.replace('\${DS_PROMETHEUS}', 'YOUR_PROMETHEUS_UID')
with open('dashboard_fixed.json', 'w') as f:
f.write(json_str)
print('Replaced successfully')
"
# 4. Re-import the dashboard
curl -s -u admin:your_password -X POST http://localhost:3000/api/dashboards/db \
-H 'Content-Type: application/json' \
-d @dashboard_fixed.json
Prevention: Before importing a Dashboard, ensure Prometheus datasource is configured in Grafana and that variable references in the Dashboard JSON can be resolved. For official Grafana Dashboards (like ID 24756), it's recommended to verify datasource references after import.
Have questions or suggestions? Feel free to reach out on GitHub or join our Discord community.