In Practice: Fine-tuning Qwen3.6-27B for Verilog Code Generation with Unsloth
In the field of hardware design, Verilog HDL is the core language for digital circuit description. However, general-purpose large language models often struggle to generate high-quality Verilog code — syntax errors, missing timing constraints, and mismatched module interfaces are common issues. We attempted to make Qwen3.6-27B an expert in Verilog code generation through LoRA fine-tuning.
Why Qwen3.6-27B?
Qwen3.6-27B is the latest open-source model from Alibaba, with the following advantages:
- 27B parameter scale: Balances performance and deployment cost
- Strong code capabilities: Excellent performance on HumanEval, MBPP and other programming benchmarks
- Open source: Apache 2.0 license, freely fine-tunable for commercial use
- Single-GPU deployable: Only 24GB VRAM needed for inference after quantization
Tech Stack
| Component | Choice |
|---|---|
| Base Model | Qwen3.6-27B |
| Fine-tuning Framework | Unsloth (memory optimized) |
| Training Method | LoRA (Low-Rank Adaptation) |
| Hardware | NVIDIA A100 80GB |
| Dataset | Resyn27k (27,000 Verilog instruction-response pairs) |
Advantages of Unsloth Framework
Unsloth is a fine-tuning framework focused on memory optimization, with core features including:
- 2-5x memory savings: Through kernel fusion and gradient checkpointing
- 2-5x training speedup: Optimized attention computation and position encoding
- Seamless integration: Compatible with HuggingFace Transformers and TRL
Dataset Preparation
We used the Resyn27k dataset, containing 27,000 high-quality Verilog instruction-response pairs. The data format is JSONL:
{
"Instruction": "Design a 4-bit adder module",
"Response": ["module adder_4bit(...)..."]
}
The data covers various Verilog task types:
- Combinational logic design (adders, multiplexers, encoders)
- Sequential logic design (counters, state machines, FIFOs)
- Interface protocols (AXI, SPI, I2C, UART)
- Testbench writing
Training Configuration
Core Hyperparameters
# Model parameters
MODEL_NAME = "/home/bjtc/models/Qwen3.6-27B"
MAX_SEQ_LENGTH = 4096
# LoRA parameters
LORA_R = 32 # Low-rank dimension
LORA_ALPHA = 64 # Scaling factor
LORA_DROPOUT = 0.05 # Dropout rate
# Training parameters
EPOCHS = 3
LEARNING_RATE = 2e-5
BATCH_SIZE = 2
GRADIENT_ACCUMULATION = 4 # Effective batch size = 16
WARMUP_RATIO = 0.05
BF16 = True # Mixed precision training
LoRA Target Modules
We selected the following modules for fine-tuning, covering attention mechanisms and feed-forward networks:
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj", # Attention layers
"gate_proj", "up_proj", "down_proj", # Feed-forward network
]
Key Configuration Notes
pack_examples=False: Unsloth is incompatible with TRL'spack_examplesparameter, must be set toFalsegradient_checkpointing='unsloth': Enable Unsloth-optimized gradient checkpointing to significantly save memorybf16=True: A100 natively supports BF16, recommended for better numerical stability
Multi-GPU Training Pitfalls
We initially attempted distributed training with dual A100s, but encountered compatibility issues between Unsloth and DDP (Distributed Data Parallel):
device_map='auto'conflicts with DDP: Unsloth's automatic device allocation conflicts with DDP's device managementaccelerate launchfailed: Inter-process communication anomaliestorchrunfailed: Similar device mapping issues
Final solution: Fallback to single-GPU training. Although slower, it offers the best stability. For a 27B parameter model, a single A100 80GB is fully sufficient.
Training Process
Training runs in the background in a tmux session, with key metrics as follows:
| Metric | Value |
|---|---|
| Total steps | 9,453 steps |
| Time per step | ~20 seconds |
| GPU utilization | 99% |
| Memory usage | 65GB / 80GB |
| Estimated total time | ~55 hours |
Training logs are output in real-time to /home/bjtc/workspace/train_final.log, monitorable via tail -f.
Memory Optimization Tips
Training a 27B model on A100 80GB requires critical memory optimization:
- Gradient checkpointing:
gradient_checkpointing=True, trade computation for memory - Gradient accumulation: Small batch size + gradient accumulation to simulate large batch effects
- BF16 mixed precision: Halves memory usage
- Unsloth optimization: Kernel fusion reduces memory usage of intermediate activations
Expected Results
After fine-tuning, we expect significant improvements in the following areas:
- Syntax correctness: Generated Verilog code passes
iverilogcompilation - Structural integrity: Correct module interfaces, port declarations, and instantiations
- Timing constraints: Proper use of sequential logic and clock domains
- Code style: Industry-standard naming and indentation
Next Steps
- Complete training: Monitor Loss curve to ensure convergence
- Model evaluation: Use automated scripts to assess syntax and functional correctness
- Inference testing: Compare code generation quality before and after fine-tuning
- Catastrophic forgetting detection: Verify whether general capabilities are impaired
Summary
Through the Unsloth framework, we successfully launched Verilog domain fine-tuning of Qwen3.6-27B on a single A100 80GB. The key points of the entire process are:
- Choosing appropriate LoRA hyperparameters (r=32, alpha=64)
- Avoiding DDP compatibility traps by using single-GPU training
- Fully leveraging Unsloth's memory optimization features
After training completion, we will share detailed evaluation results and inference examples.
This article is based on actual QevosAgent execution cases, with training scripts and configuration parameters from real execution records.