Back to Blog

Monitoring vLLM in Production with Grafana & Prometheus

Running a large language model in production is one thing. Knowing what's happening inside it is another. When you deploy Qwen3.6-27B on an A100 server with vLLM, you need visibility into GPU utilization, request throughput, token generation rates, queue depth, and more — in real time.

This guide documents our experience deploying vLLM monitoring dashboards on two A100 servers using Prometheus and Grafana, covering the architecture, deployment steps, common pitfalls, and the key metrics that matter.

vLLM Monitoring Dashboard

Why Monitor vLLM?

vLLM is the industry-standard inference engine for large language models, powering everything from local deployments to cloud-scale serving. But without monitoring, you're flying blind:

The good news? vLLM ships with built-in Prometheus metrics — 386 of them, exposed at the /metrics endpoint. All you need is Prometheus to collect them and Grafana to visualize them.

Architecture Overview

The monitoring stack is simple and lightweight:

vLLM (port 8391) → /metrics endpoint (Prometheus format, 386 metrics)
    ↓
Prometheus (port 9090) → scrapes /metrics every 15s
    ↓
Grafana (port 3000) → visualizes with vLLM Monitoring V2 dashboard (ID: 24756)

All three components run on the same server. vLLM serves the model, Prometheus collects metrics, and Grafana renders the dashboard. The entire stack fits in a few Docker containers.

Step 1: Discover vLLM's Built-in Metrics

vLLM exposes metrics at http://<server>:8391/metrics in Prometheus text format. A quick curl reveals 386 distinct metrics covering:

The metrics are organized with proper Prometheus labels, making them easy to filter and aggregate in Grafana.

Step 2: Deploy Prometheus

Prometheus is deployed as a Docker container with host network mode — this is critical because the container needs to access vLLM's /metrics endpoint on the host machine.

docker run -d \
  --name prometheus \
  --network host \
  -v ~/monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
  -v ~/monitoring/prometheus/data:/prometheus \
  prom/prometheus:latest

The configuration file prometheus.yml targets the vLLM metrics endpoint:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['localhost:8391']
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

Key insight: Using --network host is essential. Without it, the container cannot reach localhost:8391 because Docker's default bridge network isolates the container from the host's loopback interface.

Step 3: Deploy Grafana

Grafana provides the visualization layer. We use the official Docker image with persistent storage:

docker run -d \
  --name grafana \
  --network host \
  -v ~/monitoring/grafana:/var/lib/grafana \
  -e GF_SECURITY_ADMIN_PASSWORD=admin \
  grafana/grafana:latest

Common pitfall: The Grafana data directory must have write permissions. If you get errors, run:

chmod 777 ~/monitoring/grafana

Port conflict: On our office A100 server, port 3000 was already occupied by a Node.js process. The solution was to map Grafana to port 3001 instead:

-p 3001:3000

Step 4: Configure the Data Source

After Grafana starts, add Prometheus as a data source via the API:

curl -s -X POST http://localhost:3000/api/datasources \
  -H "Authorization: Bearer <admin-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Prometheus",
    "type": "prometheus",
    "url": "http://localhost:9090",
    "access": "proxy",
    "isDefault": true
  }'

Verify the connection at http://<server>:3000 with credentials admin/admin.

Step 5: Import the vLLM Dashboard

The Grafana community has an excellent vLLM dashboard — ID 24756 (vLLM Monitoring V2) — which provides comprehensive visualization out of the box.

Important: The dashboard must be imported via the API endpoint /api/dashboards/db, NOT /api/dashboards/import. The latter is deprecated and will fail silently.

curl -s -X POST http://localhost:3000/api/dashboards/db \
  -H "Authorization: Bearer <admin-token>" \
  -H "Content-Type: application/json" \
  -d @vllm-dashboard.json

The dashboard includes panels for:

Deployment on Two Servers

We deployed this monitoring stack on two different A100 servers, each with unique challenges:

Server 1: Dual A100 Server (172.24.217.58)

Server 2: Office A100 Server (172.24.168.225)

Key Metrics That Matter

After running the dashboard for several days, these are the metrics we monitor most closely:

Metric Why It Matters Alert Threshold
GPU Memory Usage Detect memory leaks early >85% of total
Request Queue Length User experience indicator >10 pending
Tokens/Second Throughput health check <50% of baseline
Prefix Cache Hit Rate Optimization effectiveness Track trend
Request Latency P99 Tail latency monitoring >2x median
KV Cache Blocks Memory pressure signal >90% allocated

Common Pitfalls and Solutions

1. Docker Network Mode

Problem: Container cannot reach vLLM on localhost. Solution: Use --network host for both Prometheus and Grafana containers.

2. Grafana Data Directory Permissions

Problem: Grafana crashes on startup with permission denied. Solution: chmod 777 on the mounted data directory before starting the container.

3. Dashboard Import API

Problem: Dashboard import fails silently. Solution: Use /api/dashboards/db endpoint, not the deprecated /api/dashboards/import.

4. Docker Hub Timeout

Problem: docker pull hangs on servers behind GFW. Solution: Configure Docker mirror accelerators in /etc/docker/daemon.json.

5. Port Conflicts

Problem: Default ports (3000, 9090) already in use. Solution: Check with ss -tlnp | grep <port> and use alternative ports.

Results

The monitoring dashboard provides real-time visibility into vLLM's operation. Key observations from our deployment:

vLLM Monitoring Dashboard

Conclusion

Monitoring vLLM in production doesn't have to be complex. With the built-in /metrics endpoint, Prometheus, and Grafana's community dashboard, you can deploy a comprehensive monitoring stack in under 30 minutes.

The key takeaways:

  1. Use Docker host network mode for container-to-host communication
  2. Import dashboard via /api/dashboards/db API
  3. Set proper permissions on Grafana data directories
  4. Configure Docker mirrors when behind firewalls
  5. Monitor the right metrics — GPU memory, queue depth, and token throughput are your north stars

With this setup, you'll catch issues before they become outages and have the data to optimize your LLM serving infrastructure.


Troubleshooting: Datasource ${DS_PROMETHEUS} was not found

After importing the Grafana Dashboard (ID 24756), if panels show this error:

PanelQueryRunner Error {message: 'Datasource ${DS_PROMETHEUS} was not found'}

Cause: The Dashboard JSON uses a variable reference ${DS_PROMETHEUS} for the datasource UID, but the Grafana instance doesn't have this template variable defined, so it cannot resolve to the actual datasource UID.

Solution: Replace ${DS_PROMETHEUS} in the Dashboard JSON with the actual Prometheus datasource UID.

# 1. Get Prometheus datasource UID
curl -s -u admin:your_password http://localhost:3000/api/datasources | python3 -m json.tool
# Find the "uid": "xxxxxxxxx" value

# 2. Export Dashboard JSON
curl -s -u admin:your_password http://localhost:3000/api/dashboards/uid/vllm-master-v2 > dashboard.json

# 3. Replace the variable (using python3 for JSON handling)
python3 -c "
import json
with open('dashboard.json', 'r') as f:
    data = json.load(f)
json_str = json.dumps(data)
json_str = json_str.replace('\${DS_PROMETHEUS}', 'YOUR_PROMETHEUS_UID')
with open('dashboard_fixed.json', 'w') as f:
    f.write(json_str)
print('Replaced successfully')
"

# 4. Re-import the dashboard
curl -s -u admin:your_password -X POST http://localhost:3000/api/dashboards/db \
  -H 'Content-Type: application/json' \
  -d @dashboard_fixed.json

Prevention: Before importing a Dashboard, ensure Prometheus datasource is configured in Grafana and that variable references in the Dashboard JSON can be resolved. For official Grafana Dashboards (like ID 24756), it's recommended to verify datasource references after import.


Have questions or suggestions? Feel free to reach out on GitHub or join our Discord community.