Full fine-tuning of a large model updates every weight and needs many high-memory GPUs. Parameter-efficient fine-tuning (PEFT) freezes the pretrained weights and trains only a tiny set of new parameters. QLoRA pushes this further by also quantising the frozen model to 4 bits, so a 65B-parameter model can be adapted on a single 48 GB GPU.
Working principle
LoRA observes that the weight change during fine-tuning is low-rank, so it represents the update ΔW as a product of two small matrices B·A. Only A and B are trained; the original W stays frozen. QLoRA adds three ideas: store W in a new 4-bit NormalFloat (NF4) format, apply double quantisation to compress the quantisation constants, and use paged optimisers to avoid memory spikes. Gradients flow through the frozen 4-bit weights into the 16-bit adapters.
| Method | Trainable params | Memory | Quality |
|---|---|---|---|
| Full fine-tune | 100% | Very high | Reference |
| LoRA | < 1% | Moderate | ≈ full |
| QLoRA | < 1% + 4-bit base | Low (single GPU) | ≈ full (16-bit) |
Key resultQLoRA's headline result: 4-bit fine-tuning matched 16-bit full fine-tuning quality on instruction tasks, democratising LLM customisation to commodity hardware.
Applications
- Domain adaptation (legal, medical, code) of open-weight LLMs
- On-device and private fine-tuning where data cannot leave the org
- Rapid, cheap experimentation with many task-specific adapters
References & further reading
- Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,” ICLR 2022.
- Dettmers et al., “QLoRA: Efficient Finetuning of Quantized LLMs,” NeurIPS 2023.
- Houlsby et al., “Parameter-Efficient Transfer Learning for NLP,” ICML 2019.