Running Qwen3.5 27B on RTX 4090/3090 (24GB) via Ollama and Connecting to QevosAgent

Background

Qwen3.5 27B is the latest open-source model from Alibaba's Tongyi Qianwen series, delivering outstanding performance across multiple benchmarks. For developers with an RTX 4090 or RTX 3090 (24GB VRAM), running this 27-billion-parameter model locally and connecting it to autonomous agent frameworks like QevosAgent is a highly practical goal.

This article shares our real-world experience, covering the complete workflow from Ollama installation to QevosAgent integration.

1. Environment Setup

Hardware Requirements

GPU: NVIDIA RTX 4090 or RTX 3090 (24GB VRAM)
OS: Windows 10/11 or Linux
Driver: Latest NVIDIA GPU driver

Install Ollama

Ollama is a simple and powerful local LLM inference server that supports one-click download and running of various open-source models.

# Windows: Download installer from https://ollama.com
# Linux:
curl -fsSL https://ollama.com/install.sh | sh

After installation, Ollama runs automatically in the background, listening on localhost:11434 by default.

2. Download and Run Qwen3.5 27B

Pull the Model

ollama pull qwen3.5:27b

This downloads the Q4_K_M quantized version of Qwen3.5 27B, approximately 17GB in size. The Q4_K_M quantization maintains high accuracy while compressing the model to fit within 24GB of VRAM.

Verify GPU Loading

After the model is downloaded, send a test request:

curl http://localhost:11434/api/generate -d '{
  "model": "qwen3.5:27b",
  "prompt": "Hi",
  "stream": false
}'

Meanwhile, check nvidia-smi to observe VRAM usage. You should see VRAM increase from a few hundred MB to approximately 24GB (24104MB), confirming the model is fully loaded onto the GPU.

Key Runtime Parameters

From Ollama's logs, we can confirm the following critical settings:

Parameter	Value	Description
GPU Layers	65/65 layers offloaded	All layers run on GPU, no CPU offload
Context Length	32K (KvSize:32768)	Auto-calculated based on 24GB VRAM
FlashAttention	Enabled	Accelerates attention computation
BatchSize	512	Batch processing size
NumThreads	6	Number of threads

Important Note: With 24GB VRAM, Ollama automatically limits the context length to 32K. If previously set to 128K, the KV Cache would consume excessive memory, significantly slowing down inference. 32K context is sufficient for most use cases.

3. Configure Ollama Port

Local Access (Default)

Ollama listens on localhost:11434 by default. Applications on the same machine can access it directly:

http://localhost:11434/v1/chat/completions

External Access (Optional)

To allow other devices to access the Ollama service, set the environment variable:

# Windows
set OLLAMA_HOST=0.0.0.0:11434

# Linux/Mac
export OLLAMA_HOST=0.0.0.0:11434

Then restart the Ollama service.

4. Connect to QevosAgent

Modify .env Configuration

In QevosAgent's .env file, add the following configuration to connect to local Ollama:

# Using local Ollama (qwen3.5:27b on RTX 4090)
OPENAI_PROFILE=qwen3527ollama
OPENAI_PROFILE_QWEN3527OLLAMA_BASE_URL=http://localhost:11434/v1

Configuration Explanation:

OPENAI_PROFILE: Specifies the configuration profile name (can be customized)
OPENAI_PROFILE_{NAME}_BASE_URL: The API endpoint corresponding to the profile name, formatted as http://localhost:11434/v1

Ollama's /v1 endpoint is compatible with the OpenAI API format, so QevosAgent can connect to Ollama directly using OpenAI-compatible mode without additional adaptation.

Verify the Connection

After starting QevosAgent, send a simple task to verify that the model responds correctly. If everything works, QevosAgent will use the local Qwen3.5 27B model for inference.

5. Performance

In our real-world tests, Qwen3.5 27B on RTX 4090 performed as follows:

First Load: ~30 seconds (loading model from disk to VRAM)
Inference Speed: Smooth, fast response with 32K context
VRAM Usage: ~24GB (near the limit)
Accuracy: Q4_K_M quantization performs close to full precision for most tasks

6. FAQ

Q: Why is model loading slow?

The first load requires reading the 17GB model from disk to VRAM, taking about 30 seconds. Subsequent requests reuse the loaded model without reloading.

Q: Can I adjust the context length?

With 24GB VRAM, Ollama automatically calculates the optimal context length (32K). You can adjust it via the OLLAMA_CONTEXT_LENGTH environment variable, but setting it too high may cause VRAM exhaustion.

Q: Can I run multiple models simultaneously?

24GB VRAM is only enough for one Qwen3.5 27B model. For multi-model switching, use the ollama run command to load models on demand.

Summary

With Ollama, running Qwen3.5 27B on consumer-grade RTX 4090/3090 GPUs is remarkably simple. Just one ollama pull command to download the model, and a simple .env configuration in QevosAgent to connect the API. This setup provides powerful local inference without internet dependency, ensuring data privacy — an ideal solution for individual developers and researchers.