rsLoRA 

Background 

Citation: A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA (Kalajdzievski, 2023)

rsLoRA addresses a limitation of vanilla LoRA where the scaling factor \(\alpha/r\) can cause gradient instability as rank increases, which means fine-tuning can be unstable at high ranks in practice. rsLoRA introduces fixes this issue by using a rank-stabilized scaling factor \(\alpha/\sqrt{r}\) that maintains gradient stability at higher ranks, enabling higher-rank adapters to be used for increased performance on complex tasks without additional inference cost.

Quick Facts 

rsLoRA only changes the scaling factor that’s used in LoRA, changing it from \(\alpha/r\) to \(\alpha/\sqrt{r}\).
rsLoRA enables stable fine-tuning at higher ranks, which can increase performance for complex tasks.
rsLoRA has the same inference cost as LoRA.

Algorithmic Idea 

The core insight behind rsLoRA is that vanilla LoRA’s scaling factor can cause the gradient magnitude to collapse as rank increases. This can prevent effective fine-tuning at higher ranks. rsLoRA’s rank-stabilized scaling factor enables higher-rank adapters to be used for increased performance on complex tasks without additional inference cost.

During fine-tuning, the following hold true:

The \(\alpha/\sqrt{r}\) scaling factor in rsLoRA leads to consistent gradient magnitudes across all ranks, allowing rsLoRA to capture more complex details for nuanced and complex tasks.

Key Equations 

The rsLoRA adapter modifies the pre-trained weight matrix \(\mathbf{W}_0 \in \mathbb{R}^{d \times k}\) as:

\[\mathbf{W}' = \mathbf{W}_0 + \frac{\alpha}{\sqrt{r}} \mathbf{B} \mathbf{A}\]

where \(\mathbf{A} \in \mathbb{R}^{r \times d}\) and \(\mathbf{B} \in \mathbb{R}^{k \times r}\) with \(r \ll \min(d,k)\).

The key theoretical result proves that for rank-stabilized adapters, the scaling factor must satisfy:

\[\gamma_r \in \Theta\left(\frac{1}{\sqrt{r}}\right)\]

This ensures that both forward activations and backward gradients maintain \(\Theta(1)\) magnitude regardless of rank \(r\).

Implementation in FinLoRA 

FinLoRA automatically uses rsLoRA when the use_rsLoRA parameter is enabled in the configuration.

Key Parameters:

\(r\) (lora_rank): Adapter rank, can be set higher than vanilla LoRA for better performance
\(\alpha\) (lora_alpha): Scaling parameter, typically set to 16 or 32
use_rsLoRA: Boolean flag to enable rank-stabilized scaling

Usage Example 

Enable rsLoRA in your FinLoRA configuration:

# Enable rsLoRA with higher rank for complex tasks
use_rsLoRA: true
lora_rank: 64        # Higher ranks work effectively with rsLoRA
lora_alpha: 16
quant_bits: 8

# rsLoRA particularly beneficial for complex financial tasks
dataset: "xbrl_extract_train.jsonl"
model_name: "meta-llama/Llama-3.1-8B-Instruct"

The scaling factor \(\gamma_r = \alpha/\sqrt{r} = 16/\sqrt{64} = 2.0\) enables gradient stability at higher ranks.

References 

Why This Method?

rsLoRA enables gradient stability at higher ranks, which allows researchers to achieve consistently better performance at higher ranks for complex tasks without additional inference cost. It is particularly valuable for complex financial NLP tasks where higher model capacity can capture nuanced and complex domain-specific patterns.

Useful Links 

rsLoRA Technical Blog by Author - Technical blog by the rsLoRA paper’s author
Axolotl - Training framework with rsLoRA support used in FinLoRA