Definition
LoRA, short for Low-Rank Adaptation, is a parameter-efficient fine-tuning technique introduced by Hu et al. in 2021. Instead of updating the billions of weights in a large model, LoRA freezes the original weights and trains a tiny pair of low-rank matrices for each layer. The insight behind it is that the change a model needs during adaptation can be approximated by a much smaller, lower-rank update, so you do not have to touch the full weight matrices at all.
How LoRA Works
For a frozen weight matrix W, LoRA learns two small matrices, A and B, and represents the adaptation as their product (W becomes W plus B times A). Because the rank of these matrices is far smaller than the original dimensions, the number of trainable parameters can drop by orders of magnitude, sometimes by a factor of ten thousand, while retaining accuracy comparable to full fine-tuning. The memory footprint shrinks accordingly.
Why It Matters for Local Operators
LoRA makes customization practical on modest, self-owned hardware. You can fine-tune a large model on a single consumer GPU, and because the learned matrices can be merged back into the original weights after training, LoRA adds no inference latency once deployed. The resulting adapter file is also small enough to share and swap freely, letting you maintain a library of task-specific personalities for one base model.
LoRA is the leading form of efficient Fine-Tuning, and its small adapters are ideal for customizing a privately hosted Local LLM.
In Simple Terms
LoRA, short for Low-Rank Adaptation, is a parameter-efficient fine-tuning technique introduced by Hu et al. in 2021. Instead of updating the billions of weights in…
