Back to Blog

Deploy Qwen3.6-27B Locally with vLLM and Power QevosAgent

vLLM Architecture

In this tutorial, we'll walk through deploying the powerful Qwen3.6-27B language model locally using vLLM, and then connecting it to QevosAgent to create a fully autonomous AI assistant that runs entirely on your own hardware.

Why Local Deployment?

Running large language models locally offers several advantages:

Prerequisites

Before we begin, ensure you have:

  1. GPU: NVIDIA GPU with at least 24GB VRAM (RTX 3090/4090, A100, etc.)
  2. CUDA: CUDA 11.8 or later installed
  3. Python: Python 3.10 or 3.11 recommended
  4. Storage: At least 50GB free space for model weights

Step 1: Install vLLM

vLLM is a high-throughput memory-efficient inference and serving engine for LLMs.

# Create a virtual environment (recommended)
python -m venv vllm-env
source vllm-env/bin/activate  # On Linux/Mac
# vllm-env\Scripts\activate  # On Windows

# Install vLLM
pip install "vllm>=0.8.5"

Note for users in China: If you experience slow downloads from Hugging Face, you can use ModelScope instead:

export VLLM_USE_MODELSCOPE=true

Step 2: Choose Your Model Version

Qwen3.6-27B comes in two main variants:

Version Precision VRAM Required Best For
Qwen/Qwen3.6-27B-FP8 FP8 quantized ~24GB (single GPU) Most users, best performance/VRAM ratio
Qwen/Qwen3.6-27B BF16 full precision ~56GB (2×24GB GPUs) Maximum accuracy

Recommendation: For most users, the FP8 version provides excellent quality while fitting on a single RTX 3090/4090. We'll use FP8 in this tutorial.

# Install huggingface-cli for manual download (optional)
pip install huggingface_hub

# Download FP8 version (recommended)
huggingface-cli download Qwen/Qwen3.6-27B-FP8 --local-dir ./models/Qwen3.6-27B-FP8

# Or download full precision (requires more storage)
huggingface-cli download Qwen/Qwen3.6-27B --local-dir ./models/Qwen3.6-27B

Step 3: Start the vLLM Server

Now let's start the vLLM server with Qwen3.6-27B:

# For single GPU with FP8 (recommended for 24GB VRAM)
vllm serve Qwen/Qwen3.6-27B-FP8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

# For dual GPU with full precision
vllm serve Qwen/Qwen3.6-27B \
  --tensor-parallel-size 2 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

Key Parameters Explained

Parameter Description Recommended Value
--max-model-len Maximum context length (prompt + output) 262144 (256K, native support)
--reasoning-parser Required for Qwen3 reasoning output parsing qwen3
--gpu-memory-utilization Fraction of GPU memory to use 0.9 (90%, default)
--tensor-parallel-size Number of GPUs for tensor parallelism 1 for FP8, 2+ for BF16
--enforce-eager Disable CUDA graph for compatibility Use if encountering errors

Advanced: Enable Speculative Decoding

For even faster inference, you can enable MTP (Multi-Token Prediction):

vllm serve Qwen/Qwen3.6-27B-FP8 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
  --reasoning-parser qwen3

Step 4: Verify the Server

Once the server starts, you should see output similar to:

INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000

Test the API with curl:

curl http://localhost:8000/v1/chat/completions \n  -H "Content-Type: application/json" \n  -d '{
    "model": "Qwen/Qwen3.6-27B",
    "messages": [
      {"role": "user", "content": "Hello, who are you?"}
    ]
  }'

You should receive a JSON response with the model's generation.

Step 5: Connect QevosAgent to vLLM

QevosAgent uses the OpenAI-compatible API interface, making it easy to connect to your local vLLM server.

Set Environment Variables

# Point QevosAgent to your local vLLM server
export OPENAI_API_BASE="http://localhost:8000/v1"

# Set the context window to match vLLM configuration
export LLM_CONTEXT_WINDOW=262144

# Optional: Set API key (can be any string for local deployment)
export OPENAI_API_KEY="local"

Start QevosAgent

# Navigate to QevosAgent directory
cd /path/to/QevosAgent

# Start the agent
python run_goal.py

QevosAgent will now use your local Qwen3.6-27B model for all reasoning and tool execution!

Step 6: Advanced Configuration

Using tmux for Persistent Sessions

For server deployments, it's recommended to run vLLM in a tmux session:

# Create a new tmux session
tmux new-session -d -s vllm

# Start vLLM in the session
tmux send-keys -t vllm 'vllm serve Qwen/Qwen3.6-27B --max-model-len 232768 --gpu-memory-utilization 0.9' Enter

# Attach to the session to view logs
tmux attach -t vllm

Create a Startup Script

Create start_vllm.sh:

#!/bin/bash

# Kill existing vLLM processes
pkill -f "vllm serve"

# Wait for port to be freed
sleep 2

# Start vLLM with optimized settings
vllm serve Qwen/Qwen3.6-27B \n  --max-model-len 232768 \n  --gpu-memory-utilization 0.9 \n  --host 0.0.0.0 \n  --port 8000

Make it executable:

chmod +x start_vllm.sh

Performance Tips

  1. Use FP8 Quantization: If available, use FP8 quantized versions for better performance
  2. Enable Prefix Caching: vLLM automatically caches common prefixes for faster repeated queries
  3. Monitor GPU Memory: Use nvidia-smi to monitor GPU utilization
  4. Adjust Batch Size: For higher throughput, consider adjusting --max-num-batched-tokens

Troubleshooting

Out of Memory Errors

If you encounter OOM errors:

# Reduce GPU memory utilization
vllm serve Qwen/Qwen3.6-27B --gpu-memory-utilization 0.7

# Or reduce context length
vllm serve Qwen/Qwen3.6-27B --max-model-len 131072

CUDA Errors

If you encounter CUDA-related errors:

# Force eager mode
vllm serve Qwen/Qwen3.6-27B --enforce-eager

Slow First Request

The first request may be slow due to model loading. Subsequent requests will be faster thanks to vLLM's caching mechanisms.

Conclusion

You now have a fully local deployment of Qwen3.6-27B powered by vLLM, connected to QevosAgent for autonomous task execution. This setup gives you:

Feel free to experiment with different models, parameters, and QevosAgent configurations to build your perfect AI assistant!


Have questions or suggestions? Feel free to reach out on GitHub or join our Discord community.