Running Qwen3.5 27B on RTX 4090/3090 (24GB) via Ollama and Connecting to QevosAgent
Background
Qwen3.5 27B is the latest open-source model from Alibaba's Tongyi Qianwen series, delivering outstanding performance across multiple benchmarks. For developers with an RTX 4090 or RTX 3090 (24GB VRAM), running this 27-billion-parameter model locally and connecting it to autonomous agent frameworks like QevosAgent is a highly practical goal.
This article shares our real-world experience, covering the complete workflow from Ollama installation to QevosAgent integration.
1. Environment Setup
Hardware Requirements
- GPU: NVIDIA RTX 4090 or RTX 3090 (24GB VRAM)
- OS: Windows 10/11 or Linux
- Driver: Latest NVIDIA GPU driver
Install Ollama
Ollama is a simple and powerful local LLM inference server that supports one-click download and running of various open-source models.
# Windows: Download installer from https://ollama.com
# Linux:
curl -fsSL https://ollama.com/install.sh | sh
After installation, Ollama runs automatically in the background, listening on localhost:11434 by default.
2. Download and Run Qwen3.5 27B
Pull the Model
ollama pull qwen3.5:27b
This downloads the Q4_K_M quantized version of Qwen3.5 27B, approximately 17GB in size. The Q4_K_M quantization maintains high accuracy while compressing the model to fit within 24GB of VRAM.
Verify GPU Loading
After the model is downloaded, send a test request:
curl http://localhost:11434/api/generate -d '{
"model": "qwen3.5:27b",
"prompt": "Hi",
"stream": false
}'
Meanwhile, check nvidia-smi to observe VRAM usage. You should see VRAM increase from a few hundred MB to approximately 24GB (24104MB), confirming the model is fully loaded onto the GPU.
Key Runtime Parameters
From Ollama's logs, we can confirm the following critical settings:
| Parameter | Value | Description |
|---|---|---|
| GPU Layers | 65/65 layers offloaded | All layers run on GPU, no CPU offload |
| Context Length | 32K (KvSize:32768) | Auto-calculated based on 24GB VRAM |
| FlashAttention | Enabled | Accelerates attention computation |
| BatchSize | 512 | Batch processing size |
| NumThreads | 6 | Number of threads |
Important Note: With 24GB VRAM, Ollama automatically limits the context length to 32K. If previously set to 128K, the KV Cache would consume excessive memory, significantly slowing down inference. 32K context is sufficient for most use cases.
3. Configure Ollama Port
Local Access (Default)
Ollama listens on localhost:11434 by default. Applications on the same machine can access it directly:
http://localhost:11434/v1/chat/completions
External Access (Optional)
To allow other devices to access the Ollama service, set the environment variable:
# Windows
set OLLAMA_HOST=0.0.0.0:11434
# Linux/Mac
export OLLAMA_HOST=0.0.0.0:11434
Then restart the Ollama service.
4. Connect to QevosAgent
Modify .env Configuration
In QevosAgent's .env file, add the following configuration to connect to local Ollama:
# Using local Ollama (qwen3.5:27b on RTX 4090)
OPENAI_PROFILE=qwen3527ollama
OPENAI_PROFILE_QWEN3527OLLAMA_BASE_URL=http://localhost:11434/v1
Configuration Explanation:
OPENAI_PROFILE: Specifies the configuration profile name (can be customized)OPENAI_PROFILE_{NAME}_BASE_URL: The API endpoint corresponding to the profile name, formatted ashttp://localhost:11434/v1
Ollama's /v1 endpoint is compatible with the OpenAI API format, so QevosAgent can connect to Ollama directly using OpenAI-compatible mode without additional adaptation.
Verify the Connection
After starting QevosAgent, send a simple task to verify that the model responds correctly. If everything works, QevosAgent will use the local Qwen3.5 27B model for inference.
5. Performance
In our real-world tests, Qwen3.5 27B on RTX 4090 performed as follows:
- First Load: ~30 seconds (loading model from disk to VRAM)
- Inference Speed: Smooth, fast response with 32K context
- VRAM Usage: ~24GB (near the limit)
- Accuracy: Q4_K_M quantization performs close to full precision for most tasks
6. FAQ
Q: Why is model loading slow?
The first load requires reading the 17GB model from disk to VRAM, taking about 30 seconds. Subsequent requests reuse the loaded model without reloading.
Q: Can I adjust the context length?
With 24GB VRAM, Ollama automatically calculates the optimal context length (32K). You can adjust it via the OLLAMA_CONTEXT_LENGTH environment variable, but setting it too high may cause VRAM exhaustion.
Q: Can I run multiple models simultaneously?
24GB VRAM is only enough for one Qwen3.5 27B model. For multi-model switching, use the ollama run command to load models on demand.
Summary
With Ollama, running Qwen3.5 27B on consumer-grade RTX 4090/3090 GPUs is remarkably simple. Just one ollama pull command to download the model, and a simple .env configuration in QevosAgent to connect the API. This setup provides powerful local inference without internet dependency, ensuring data privacy — an ideal solution for individual developers and researchers.