In Practice: Fine-tuning Qwen3.6-27B for Verilog Code Generation with Unsloth

In the field of hardware design, Verilog HDL is the core language for digital circuit description. However, general-purpose large language models often struggle to generate high-quality Verilog code — syntax errors, missing timing constraints, and mismatched module interfaces are common issues. We attempted to make Qwen3.6-27B an expert in Verilog code generation through LoRA fine-tuning.

Why Qwen3.6-27B?

Qwen3.6-27B is the latest open-source model from Alibaba, with the following advantages:

27B parameter scale: Balances performance and deployment cost
Strong code capabilities: Excellent performance on HumanEval, MBPP and other programming benchmarks
Open source: Apache 2.0 license, freely fine-tunable for commercial use
Single-GPU deployable: Only 24GB VRAM needed for inference after quantization

Tech Stack

Component	Choice
Base Model	Qwen3.6-27B
Fine-tuning Framework	Unsloth (memory optimized)
Training Method	LoRA (Low-Rank Adaptation)
Hardware	NVIDIA A100 80GB
Dataset	Resyn27k (27,000 Verilog instruction-response pairs)

Advantages of Unsloth Framework

Unsloth is a fine-tuning framework focused on memory optimization, with core features including:

2-5x memory savings: Through kernel fusion and gradient checkpointing
2-5x training speedup: Optimized attention computation and position encoding
Seamless integration: Compatible with HuggingFace Transformers and TRL

Dataset Preparation

We used the Resyn27k dataset, containing 27,000 high-quality Verilog instruction-response pairs. The data format is JSONL:

{
  "Instruction": "Design a 4-bit adder module",
  "Response": ["module adder_4bit(...)..."]
}

The data covers various Verilog task types:

Combinational logic design (adders, multiplexers, encoders)
Sequential logic design (counters, state machines, FIFOs)
Interface protocols (AXI, SPI, I2C, UART)
Testbench writing

Training Configuration

Core Hyperparameters

# Model parameters
MODEL_NAME = "/home/bjtc/models/Qwen3.6-27B"
MAX_SEQ_LENGTH = 4096

# LoRA parameters
LORA_R = 32          # Low-rank dimension
LORA_ALPHA = 64      # Scaling factor
LORA_DROPOUT = 0.05  # Dropout rate

# Training parameters
EPOCHS = 3
LEARNING_RATE = 2e-5
BATCH_SIZE = 2
GRADIENT_ACCUMULATION = 4  # Effective batch size = 16
WARMUP_RATIO = 0.05
BF16 = True           # Mixed precision training

LoRA Target Modules

We selected the following modules for fine-tuning, covering attention mechanisms and feed-forward networks:

target_modules = [
    "q_proj", "k_proj", "v_proj", "o_proj",     # Attention layers
    "gate_proj", "up_proj", "down_proj",         # Feed-forward network
]

Key Configuration Notes

pack_examples=False: Unsloth is incompatible with TRL's pack_examples parameter, must be set to False
gradient_checkpointing='unsloth': Enable Unsloth-optimized gradient checkpointing to significantly save memory
bf16=True: A100 natively supports BF16, recommended for better numerical stability

Multi-GPU Training Pitfalls

We initially attempted distributed training with dual A100s, but encountered compatibility issues between Unsloth and DDP (Distributed Data Parallel):

device_map='auto' conflicts with DDP: Unsloth's automatic device allocation conflicts with DDP's device management
accelerate launch failed: Inter-process communication anomalies
torchrun failed: Similar device mapping issues

Final solution: Fallback to single-GPU training. Although slower, it offers the best stability. For a 27B parameter model, a single A100 80GB is fully sufficient.

Training Process

Training runs in the background in a tmux session, with key metrics as follows:

Metric	Value
Total steps	9,453 steps
Time per step	~20 seconds
GPU utilization	99%
Memory usage	65GB / 80GB
Estimated total time	~55 hours

Training logs are output in real-time to /home/bjtc/workspace/train_final.log, monitorable via tail -f.

Memory Optimization Tips

Training a 27B model on A100 80GB requires critical memory optimization:

Gradient checkpointing: gradient_checkpointing=True, trade computation for memory
Gradient accumulation: Small batch size + gradient accumulation to simulate large batch effects
BF16 mixed precision: Halves memory usage
Unsloth optimization: Kernel fusion reduces memory usage of intermediate activations

Expected Results

After fine-tuning, we expect significant improvements in the following areas:

Syntax correctness: Generated Verilog code passes iverilog compilation
Structural integrity: Correct module interfaces, port declarations, and instantiations
Timing constraints: Proper use of sequential logic and clock domains
Code style: Industry-standard naming and indentation

Next Steps

Complete training: Monitor Loss curve to ensure convergence
Model evaluation: Use automated scripts to assess syntax and functional correctness
Inference testing: Compare code generation quality before and after fine-tuning
Catastrophic forgetting detection: Verify whether general capabilities are impaired

Summary

Through the Unsloth framework, we successfully launched Verilog domain fine-tuning of Qwen3.6-27B on a single A100 80GB. The key points of the entire process are:

Choosing appropriate LoRA hyperparameters (r=32, alpha=64)
Avoiding DDP compatibility traps by using single-GPU training
Fully leveraging Unsloth's memory optimization features

After training completion, we will share detailed evaluation results and inference examples.

This article is based on actual QevosAgent execution cases, with training scripts and configuration parameters from real execution records.