Back to Blog

In Practice: Fine-tuning Qwen3.6-27B for Verilog Code Generation with Unsloth

In the field of hardware design, Verilog HDL is the core language for digital circuit description. However, general-purpose large language models often struggle to generate high-quality Verilog code — syntax errors, missing timing constraints, and mismatched module interfaces are common issues. We attempted to make Qwen3.6-27B an expert in Verilog code generation through LoRA fine-tuning.

Why Qwen3.6-27B?

Qwen3.6-27B is the latest open-source model from Alibaba, with the following advantages:

Tech Stack

Component Choice
Base Model Qwen3.6-27B
Fine-tuning Framework Unsloth (memory optimized)
Training Method LoRA (Low-Rank Adaptation)
Hardware NVIDIA A100 80GB
Dataset Resyn27k (27,000 Verilog instruction-response pairs)

Advantages of Unsloth Framework

Unsloth is a fine-tuning framework focused on memory optimization, with core features including:

  1. 2-5x memory savings: Through kernel fusion and gradient checkpointing
  2. 2-5x training speedup: Optimized attention computation and position encoding
  3. Seamless integration: Compatible with HuggingFace Transformers and TRL

Dataset Preparation

We used the Resyn27k dataset, containing 27,000 high-quality Verilog instruction-response pairs. The data format is JSONL:

{
  "Instruction": "Design a 4-bit adder module",
  "Response": ["module adder_4bit(...)..."]
}

The data covers various Verilog task types:

Training Configuration

Core Hyperparameters

# Model parameters
MODEL_NAME = "/home/bjtc/models/Qwen3.6-27B"
MAX_SEQ_LENGTH = 4096

# LoRA parameters
LORA_R = 32          # Low-rank dimension
LORA_ALPHA = 64      # Scaling factor
LORA_DROPOUT = 0.05  # Dropout rate

# Training parameters
EPOCHS = 3
LEARNING_RATE = 2e-5
BATCH_SIZE = 2
GRADIENT_ACCUMULATION = 4  # Effective batch size = 16
WARMUP_RATIO = 0.05
BF16 = True           # Mixed precision training

LoRA Target Modules

We selected the following modules for fine-tuning, covering attention mechanisms and feed-forward networks:

target_modules = [
    "q_proj", "k_proj", "v_proj", "o_proj",     # Attention layers
    "gate_proj", "up_proj", "down_proj",         # Feed-forward network
]

Key Configuration Notes

Multi-GPU Training Pitfalls

We initially attempted distributed training with dual A100s, but encountered compatibility issues between Unsloth and DDP (Distributed Data Parallel):

  1. device_map='auto' conflicts with DDP: Unsloth's automatic device allocation conflicts with DDP's device management
  2. accelerate launch failed: Inter-process communication anomalies
  3. torchrun failed: Similar device mapping issues

Final solution: Fallback to single-GPU training. Although slower, it offers the best stability. For a 27B parameter model, a single A100 80GB is fully sufficient.

Training Process

Training runs in the background in a tmux session, with key metrics as follows:

Metric Value
Total steps 9,453 steps
Time per step ~20 seconds
GPU utilization 99%
Memory usage 65GB / 80GB
Estimated total time ~55 hours

Training logs are output in real-time to /home/bjtc/workspace/train_final.log, monitorable via tail -f.

Memory Optimization Tips

Training a 27B model on A100 80GB requires critical memory optimization:

  1. Gradient checkpointing: gradient_checkpointing=True, trade computation for memory
  2. Gradient accumulation: Small batch size + gradient accumulation to simulate large batch effects
  3. BF16 mixed precision: Halves memory usage
  4. Unsloth optimization: Kernel fusion reduces memory usage of intermediate activations

Expected Results

After fine-tuning, we expect significant improvements in the following areas:

Next Steps

  1. Complete training: Monitor Loss curve to ensure convergence
  2. Model evaluation: Use automated scripts to assess syntax and functional correctness
  3. Inference testing: Compare code generation quality before and after fine-tuning
  4. Catastrophic forgetting detection: Verify whether general capabilities are impaired

Summary

Through the Unsloth framework, we successfully launched Verilog domain fine-tuning of Qwen3.6-27B on a single A100 80GB. The key points of the entire process are:

After training completion, we will share detailed evaluation results and inference examples.


This article is based on actual QevosAgent execution cases, with training scripts and configuration parameters from real execution records.