Deploy Qwen3.6-27B Locally with vLLM and Power QevosAgent

In this tutorial, we'll walk through deploying the powerful Qwen3.6-27B language model locally using vLLM, and then connecting it to QevosAgent to create a fully autonomous AI assistant that runs entirely on your own hardware.
Why Local Deployment?
Running large language models locally offers several advantages:
- Privacy: Your data never leaves your machine
- Cost: No API fees per token
- Control: Full customization of model behavior
- Offline: Works without internet connection
- Performance: vLLM provides state-of-the-art throughput with PagedAttention
Prerequisites
Before we begin, ensure you have:
- GPU: NVIDIA GPU with at least 24GB VRAM (RTX 3090/4090, A100, etc.)
- CUDA: CUDA 11.8 or later installed
- Python: Python 3.10 or 3.11 recommended
- Storage: At least 50GB free space for model weights
Step 1: Install vLLM
vLLM is a high-throughput memory-efficient inference and serving engine for LLMs.
# Create a virtual environment (recommended)
python -m venv vllm-env
source vllm-env/bin/activate # On Linux/Mac
# vllm-env\Scripts\activate # On Windows
# Install vLLM
pip install "vllm>=0.8.5"
Note for users in China: If you experience slow downloads from Hugging Face, you can use ModelScope instead:
export VLLM_USE_MODELSCOPE=true
Step 2: Choose Your Model Version
Qwen3.6-27B comes in two main variants:
| Version | Precision | VRAM Required | Best For |
|---|---|---|---|
Qwen/Qwen3.6-27B-FP8 |
FP8 quantized | ~24GB (single GPU) | Most users, best performance/VRAM ratio |
Qwen/Qwen3.6-27B |
BF16 full precision | ~56GB (2×24GB GPUs) | Maximum accuracy |
Recommendation: For most users, the FP8 version provides excellent quality while fitting on a single RTX 3090/4090. We'll use FP8 in this tutorial.
# Install huggingface-cli for manual download (optional)
pip install huggingface_hub
# Download FP8 version (recommended)
huggingface-cli download Qwen/Qwen3.6-27B-FP8 --local-dir ./models/Qwen3.6-27B-FP8
# Or download full precision (requires more storage)
huggingface-cli download Qwen/Qwen3.6-27B --local-dir ./models/Qwen3.6-27B
Step 3: Start the vLLM Server
Now let's start the vLLM server with Qwen3.6-27B:
# For single GPU with FP8 (recommended for 24GB VRAM)
vllm serve Qwen/Qwen3.6-27B-FP8 \
--max-model-len 262144 \
--reasoning-parser qwen3
# For dual GPU with full precision
vllm serve Qwen/Qwen3.6-27B \
--tensor-parallel-size 2 \
--max-model-len 262144 \
--reasoning-parser qwen3
Key Parameters Explained
| Parameter | Description | Recommended Value |
|---|---|---|
--max-model-len |
Maximum context length (prompt + output) | 262144 (256K, native support) |
--reasoning-parser |
Required for Qwen3 reasoning output parsing | qwen3 |
--gpu-memory-utilization |
Fraction of GPU memory to use | 0.9 (90%, default) |
--tensor-parallel-size |
Number of GPUs for tensor parallelism | 1 for FP8, 2+ for BF16 |
--enforce-eager |
Disable CUDA graph for compatibility | Use if encountering errors |
Advanced: Enable Speculative Decoding
For even faster inference, you can enable MTP (Multi-Token Prediction):
vllm serve Qwen/Qwen3.6-27B-FP8 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
--reasoning-parser qwen3
Step 4: Verify the Server
Once the server starts, you should see output similar to:
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000
Test the API with curl:
curl http://localhost:8000/v1/chat/completions \n -H "Content-Type: application/json" \n -d '{
"model": "Qwen/Qwen3.6-27B",
"messages": [
{"role": "user", "content": "Hello, who are you?"}
]
}'
You should receive a JSON response with the model's generation.
Step 5: Connect QevosAgent to vLLM
QevosAgent uses the OpenAI-compatible API interface, making it easy to connect to your local vLLM server.
Set Environment Variables
# Point QevosAgent to your local vLLM server
export OPENAI_API_BASE="http://localhost:8000/v1"
# Set the context window to match vLLM configuration
export LLM_CONTEXT_WINDOW=262144
# Optional: Set API key (can be any string for local deployment)
export OPENAI_API_KEY="local"
Start QevosAgent
# Navigate to QevosAgent directory
cd /path/to/QevosAgent
# Start the agent
python run_goal.py
QevosAgent will now use your local Qwen3.6-27B model for all reasoning and tool execution!
Step 6: Advanced Configuration
Using tmux for Persistent Sessions
For server deployments, it's recommended to run vLLM in a tmux session:
# Create a new tmux session
tmux new-session -d -s vllm
# Start vLLM in the session
tmux send-keys -t vllm 'vllm serve Qwen/Qwen3.6-27B --max-model-len 232768 --gpu-memory-utilization 0.9' Enter
# Attach to the session to view logs
tmux attach -t vllm
Create a Startup Script
Create start_vllm.sh:
#!/bin/bash
# Kill existing vLLM processes
pkill -f "vllm serve"
# Wait for port to be freed
sleep 2
# Start vLLM with optimized settings
vllm serve Qwen/Qwen3.6-27B \n --max-model-len 232768 \n --gpu-memory-utilization 0.9 \n --host 0.0.0.0 \n --port 8000
Make it executable:
chmod +x start_vllm.sh
Performance Tips
- Use FP8 Quantization: If available, use FP8 quantized versions for better performance
- Enable Prefix Caching: vLLM automatically caches common prefixes for faster repeated queries
- Monitor GPU Memory: Use
nvidia-smito monitor GPU utilization - Adjust Batch Size: For higher throughput, consider adjusting
--max-num-batched-tokens
Troubleshooting
Out of Memory Errors
If you encounter OOM errors:
# Reduce GPU memory utilization
vllm serve Qwen/Qwen3.6-27B --gpu-memory-utilization 0.7
# Or reduce context length
vllm serve Qwen/Qwen3.6-27B --max-model-len 131072
CUDA Errors
If you encounter CUDA-related errors:
# Force eager mode
vllm serve Qwen/Qwen3.6-27B --enforce-eager
Slow First Request
The first request may be slow due to model loading. Subsequent requests will be faster thanks to vLLM's caching mechanisms.
Conclusion
You now have a fully local deployment of Qwen3.6-27B powered by vLLM, connected to QevosAgent for autonomous task execution. This setup gives you:
- ✅ Complete privacy and data control
- ✅ No API costs
- ✅ State-of-the-art inference performance
- ✅ Full customization capabilities
Feel free to experiment with different models, parameters, and QevosAgent configurations to build your perfect AI assistant!
Have questions or suggestions? Feel free to reach out on GitHub or join our Discord community.