Back to Blog

Pushing the Limits: Qwen3.6-27B-FP8 on RTX 4090 48GB — 256K Context Window Experiment & Deployment Guide

Can a single RTX 4090 48GB run a 27B parameter model with a 256K context window? The answer is yes — with FP8 quantization and proper configuration, the full 256K context window is achievable on consumer-grade hardware. In this article, we share our complete experiment results and deployment guide for running Qwen3.6-27B-FP8 on consumer-grade hardware.

Experimental Setup

Our test environment consists of a single NVIDIA GeForce RTX 4090 with 48GB of VRAM, running vLLM 0.21.0 with CUDA 13.2. The model used is Qwen3.6-27B-FP8, an FP8 quantized version of the Qwen3.6 27B parameter model.

Model Architecture:

Hardware:

Context Window Experiment Results

We systematically tested 10 different context window sizes from 131K to 262K to find the practical limits of the 48GB GPU.

Test Configuration

All tests used the same base configuration:

--gpu-memory-utilization 0.85
--kv-cache-dtype fp8
--max-num-batched-tokens 16384
--max-num-seqs 4
--enable-chunked-prefill
--enable-prefix-caching

Results: No-MTP Mode (Maximum Context)

max-model-len Server Start Inference Test Notes
131,072 Baseline
163,840 163,831 input tokens
170,000 169,991 input tokens
180,000 172,679 input tokens
190,000
200,000 172,679 input tokens
220,000 172,679 input tokens
250,000 172,679 input tokens
261,856 GPU: 43,707/49,140 MiB (89%)
262,144 KV cache OOM (need 8.33 GiB, only 8.29 GiB)

Maximum stable context at gpu_memory_utilization=0.85: 261,856 tokens (~256K)

The model's theoretical maximum is 262,144 tokens. At gpu_memory_utilization=0.85, we reached 261,856 — just 288 tokens short of the absolute limit. The failure at 262,144 was due to KV cache memory being only 0.04 GiB short.

Follow-up Test: Reaching the Full 262,144

The result above raised an obvious question: if we're only 0.04 GiB short, can we reach the full 262,144 by increasing gpu_memory_utilization? We stopped the server, restarted with gpu_memory_utilization=0.90 and max_model_len=262144, and tested:

Prompt Tokens Result Time Notes
260,010 ✅ Success 64.27s First request (no cache)
261,010 ✅ Success 6.17s Prefix cached
262,010 Success 5.26s Prefix cached

GPU memory usage: 45,195 / 49,140 MiB (92%)

Conclusion: With gpu_memory_utilization=0.90, the full 262,144 context window is achievable on RTX 4090 48GB. The model's entire theoretical context range can be utilized on consumer hardware.

Key Findings

1. FP8 KV Cache is the Game Changer

The most critical configuration parameter is --kv-cache-dtype fp8. Using FP8 for the KV cache reduces memory consumption by approximately 50% compared to FP16, making it possible to fit 256K context on a 48GB GPU. Without FP8 KV cache, even 32K context would be extremely tight.

2. Hardware Limit Approaches Model Theoretical Maximum

The hardware limit (261,856) is nearly identical to the model's theoretical maximum (262,144). This means the RTX 4090 48GB can fully utilize the model's entire context window capability — a remarkable achievement for consumer-grade hardware.

4. MTP Trade-off: Speed vs. Context vs. Concurrency

MTP with n=3 can accelerate token generation by 55-143%. The context impact depends on concurrency:

For applications needing both speed and long context, set max_num_seqs=1 with MTP enabled. For high-concurrency scenarios, expect a proportional reduction in per-request context length.

Deployment Guide

Prerequisites

Step 1: Install vLLM

conda create -n vllm2 python=3.10 -y
conda activate vllm2
pip install vllm==0.21.0

Step 2: Download the Model

Download Qwen3.6-27B-FP8 from HuggingFace or ModelScope:

# Using huggingface-cli
huggingface-cli download Qwen/Qwen3.6-27B-FP8 --local-dir /path/to/models/Qwen3.6-27B-FP8

# Or using modelscope (recommended for China)
modelscope download --model Qwen/Qwen3.6-27B-FP8 --local-dir /path/to/models/Qwen3.6-27B-FP8

Step 3: Start the Server

Option A: Maximum Context (256K, No MTP)

For the longest context window (up to 261,856 tokens):

#!/bin/bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

python -m vllm.entrypoints.openai.api_server \
  --model /path/to/models/Qwen3.6-27B-FP8 \
  --port 8391 \
  --max-model-len 200000 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 4 \
  --enable-chunked-prefill \
  --enable-prefix-caching

Note: We recommend using --max-model-len 200000 for production to leave headroom. The absolute maximum is 261,856, but 200K provides a comfortable safety margin.

Option B: Faster Inference (180K, With MTP)

For faster generation speed while retaining longer context:

#!/bin/bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

python -m vllm.entrypoints.openai.api_server \
  --model /path/to/models/Qwen3.6-27B-FP8 \
  --port 8393 \
  --max-model-len 180000 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 4 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'

Step 4: Verify the Deployment

# Check model info
curl http://localhost:8391/v1/models | python3 -m json.tool

# Test inference
curl http://localhost:8391/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/path/to/models/Qwen3.6-27B-FP8",
    "messages": [{"role": "user", "content": "Hello, who are you?"}],
    "max_tokens": 100
  }'

Configuration Parameter Reference

Parameter Value Description
--max-model-len 200000 Maximum context window (model supports up to 262,144; practical limit ~261,856)
--gpu-memory-utilization 0.85 Fraction of GPU memory to use (lower = more headroom for KV cache growth)
--kv-cache-dtype fp8 KV cache data type (FP8 saves ~50% memory vs FP16; essential for 256K context)
--max-num-batched-tokens 16384 Max tokens processed per batch in chunked prefill
--max-num-seqs 4 Maximum concurrent sequences
--enable-chunked-prefill true Process long prompts in chunks to avoid memory spikes
--enable-prefix-caching true Cache common prefix tokens for repeated requests
--speculative-config MTP n=3 Multi-Token Prediction for faster generation; use with max_num_seqs=1 for maximum context

Conclusion

Running Qwen3.6-27B-FP8 with a 256K context window on a single RTX 4090 48GB is not just feasible — the full 262,144 context window is achievable by adjusting gpu_memory_utilization. The key ingredients are:

  1. FP8 KV cache — the single most important optimization, enabling 256K context on 48GB
  2. Chunked prefill — essential for handling long prompts without OOM
  3. MTP concurrency trade-off — MTP accelerates generation by 55-143%; with max_num_seqs=1 it supports ~256K context (99% of no-MTP), with max_num_seqs=4 it supports ~199K (76%)
  4. Conservative gpu_memory_utilization — 0.85 leaves headroom for KV cache growth

This setup makes it possible to run a powerful 27B parameter model with massive context on consumer-grade hardware, opening up new possibilities for local AI development and deployment. Whether you need 256K context for document analysis or MTP-accelerated generation with up to 256K context (single request), the RTX 4090 48GB delivers.