Fine-tuning allows you to customize large language models for specific tasks, improving performance while reducing computational costs. This guide covers practical techniques with working code.
Why Fine-Tune LLMs?
- Customize model behavior for specific domains
- Improve performance on specialized tasks
- Reduce inference costs with smaller, focused models
- Add company-specific knowledge and terminology
1. Setting Up the Environment
# Install required packages
pip install torch transformers datasets peft accelerate bitsandbytes
# For QLoRA (4-bit quantization)
pip install -U bitsandbytes
2. Preparing Your Dataset
from datasets import Dataset
import json
# Prepare instruction-following dataset
training_data = [
{
"instruction": "Summarize the following text:",
"input": "The quick brown fox jumps over the lazy dog...",
"output": "A fox jumps over a dog."
},
{
"instruction": "Translate to French:",
"input": "Hello, how are you?",
"output": "Bonjour, comment allez-vous?"
}
]
def format_prompt(example):
return {
"text": f"""### Instruction:
{example['instruction']}
### Input:
{example['input']}
### Response:
{example['output']}"""
}
dataset = Dataset.from_list(training_data)
dataset = dataset.map(format_prompt)
3. LoRA Fine-Tuning (Efficient)
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
# Load base model
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# LoRA configuration
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32, # Alpha for scaling
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=100,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch",
optim="adamw_torch"
)
# Train
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=training_args,
tokenizer=tokenizer,
dataset_text_field="text",
max_seq_length=512
)
trainer.train()
model.save_pretrained("./fine-tuned-model")
4. QLoRA (4-bit Quantized Fine-Tuning)
from transformers import BitsAndBytesConfig
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True
)
# Load 4-bit quantized model
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
# Train as before...
5. Inference with Fine-Tuned Model
from peft import PeftModel
# Load fine-tuned model
base_model = AutoModelForCausalLM.from_pretrained(model_name)
model = PeftModel.from_pretrained(base_model, "./fine-tuned-model")
model = model.merge_and_unload() # Merge LoRA weights
# Generate
def generate_response(prompt):
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.7,
do_sample=True
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
response = generate_response("Summarize the benefits of exercise:")
print(response)
Fine-Tuning Comparison
| Method | VRAM Required | Training Time | Quality |
|---|---|---|---|
| Full Fine-Tuning | 80GB+ | Long | Best |
| LoRA | 16-24GB | Medium | Very Good |
| QLoRA (4-bit) | 8-12GB | Medium | Good |
Start with QLoRA for experimentation, then scale up as needed for production!