Parameter-Efficient Fine-Tuning (QLoRA)

Full fine-tuning of a large model updates every weight and needs many high-memory GPUs. Parameter-efficient fine-tuning (PEFT) freezes the pretrained weights and trains only a tiny set of new parameters. QLoRA pushes this further by also quantising the frozen model to 4 bits, so a 65B-parameter model can be adapted on a single 48 GB GPU.

Working principle

LoRA observes that the weight change during fine-tuning is low-rank, so it represents the update ΔW as a product of two small matrices B·A. Only A and B are trained; the original W stays frozen. QLoRA adds three ideas: store W in a new 4-bit NormalFloat (NF4) format, apply double quantisation to compress the quantisation constants, and use paged optimisers to avoid memory spikes. Gradients flow through the frozen 4-bit weights into the 16-bit adapters.

Figure 1. Forward path h = Wx + (B·A)x. Only the small adapter matrices A, B are updated; the base weights stay frozen in 4-bit.

Table 1. Fine-tuning strategies compared
Method	Trainable params	Memory	Quality
Full fine-tune	100%	Very high	Reference
LoRA	< 1%	Moderate	≈ full
QLoRA	< 1% + 4-bit base	Low (single GPU)	≈ full (16-bit)

Key resultQLoRA's headline result: 4-bit fine-tuning matched 16-bit full fine-tuning quality on instruction tasks, democratising LLM customisation to commodity hardware.

Applications

Domain adaptation (legal, medical, code) of open-weight LLMs
On-device and private fine-tuning where data cannot leave the org
Rapid, cheap experimentation with many task-specific adapters

References & further reading

Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,” ICLR 2022.
Dettmers et al., “QLoRA: Efficient Finetuning of Quantized LLMs,” NeurIPS 2023.
Houlsby et al., “Parameter-Efficient Transfer Learning for NLP,” ICML 2019.