QLoRA

Background

Citation: QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)

QLoRA addresses the issue of LoRA fine-tuning still requires substantial GPU memory, which becomes prohibitive for very large LLMs. QLoRA combines LoRA’s parameter efficiency with 4-bit quantization, enabling fine-tuning of 65B parameter models on a single 48GB GPU while preserving full 16-bit fine-tuning task performance. It can reduce the average memory requirements from >780GB to <48GB without degrading performance.

Quick Facts

  1. QLoRA is a memory-efficient fine-tuning method that uses 4-bit quantization.

  2. QLoRA introduces no decrease in performance compared to the full 16-bit LoRA fine-tuning.

  3. QLoRA works with any neural network containing dense layers.

Algorithmic Idea

The core idea behind QLoRA is that LoRA fine-tuning can be further optimized by quantizing the base model weights to 4 bits while keeping the adapter weights at full precision. This enables more people to fine-tune large models on their own hardware. QLoRA quantizes the pre-trained model weights to reduce memory usage while preserving the trainable LoRA parameters at 16-bit precision.

For a pre-trained weight matrix \(\mathbf{W}_0\), QLoRA quantizes it to 4-bit NormalFloat format. During computation, the quantized weights are dynamically dequantized back to 16 bits when performing operations with the input sequence \(\mathbf{x}\) and the adapter matrices \(\mathbf{A}\) and \(\mathbf{B}\), which remain in 16-bit precision.

During fine-tuning, the following hold true:

  1. \(\mathbf{W}_0\) is quantized to 4-bit precision and frozen and doesn’t receive any gradient updates.

  2. Only \(\mathbf{A}\) and \(\mathbf{B}\) contain trainable parameters in 16-bit precision.

  3. The forward pass dynamically dequantizes weights: \(\mathbf{h} = p_{16}(\mathbf{W}_0^{\text{NF4}}) \mathbf{x} + \gamma_r \mathbf{B}\mathbf{A} \mathbf{x}\).

  4. QLoRA uses NF4 quantization and paged optimizers to reduce memory usage dynamically.

Key Equations

For a pre-trained weight matrix \(\mathbf{W}_0\), the QLoRA forward pass is defined as:

\[\mathbf{y} = p_{16}(\mathbf{W}_0^{\text{NF4}}) \mathbf{x} + \gamma_r \mathbf{B}\mathbf{A} \mathbf{x}\]

Where:

  1. \(p_{16}(\mathbf{W}_0^{\text{NF4}})\) represents the dynamic dequantization of 4-bit NormalFloat weights back to 16-bit precision.

  2. \(\mathbf{W}_0^{\text{NF4}}\) are the pre-trained weights quantized to 4-bit NormalFloat format.

  3. \(\mathbf{A} \in \mathbb{R}^{r \times k}\) and \(\mathbf{B} \in \mathbb{R}^{d \times r}\) are the 16-bit LoRA adapter matrices.

  4. \(\gamma_r = \alpha/r\) is the scaling factor where \(\alpha\) is a hyperparameter and \(r\) is the adapter rank.

The double dequantization process is:

\[\text{doubleDequant}(c_1^{\text{FP32}}, c_2^{k\text{-bit}}, \mathbf{W}^{k\text{-bit}}) = \text{dequant}(\text{dequant}(c_1^{\text{FP32}}, c_2^{k\text{-bit}}), \mathbf{W}^{4\text{-bit}})\]

Where \(c_1\) and \(c_2\) are the first and second-level quantization constants.

Implementation in FinLoRA

To use QLoRA in FinLoRA, configure fine-tuning with 4-bit quantization:

python lora/finetune.py sentiment_llama_3_1_8b_4bits_r4

Configuration example from lora/finetune_configs.json:

"sentiment_llama_3_1_8b_4bits_r4": {
  "base_model": "meta-llama/Llama-3.1-8B-Instruct",
  "dataset_path": "../data/train/finlora_sentiment_train.jsonl",
  "lora_r": 4,
  "quant_bits": 4,
  "learning_rate": 0.0001,
  "num_epochs": 4,
  "batch_size": 8,
  "gradient_accumulation_steps": 2
}

Key parameters: - lora_r: The rank \(r\) of the LoRA adapter (typically 4-8 for QLoRA) - quant_bits: The quantization bits (4 for QLoRA, automatically enables NF4 and optimizations) - lora_alpha: The scaling parameter \(\alpha\) (default: 16, giving \(\gamma_r = \alpha/r\))

Usage Example

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load base model
base_model_name = "meta-llama/Llama-3.1-8B-Instruct"
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load QLoRA adapter
adapter_path = "./lora_adapters/4bits_r4/sentiment_llama_3_1_8b_4bits_r4"
model = PeftModel.from_pretrained(base_model, adapter_path)

# Generate text
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
prompt = "The financial markets showed positive sentiment today"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=100, temperature=0)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

References

Why This Method?

QLoRA is important for understanding memory-efficient fine-tuning of large language models that are too large to fit on a single GPU. It introduces key quantization techniques that enable LoRA fine-tuning to be done on consumer hardware without losing performance. QLoRA provides practical innovations for 4-bit fine-tuning that make fine-tuning accessible to a wider range of researchers at an affordable cost.