If you want to build an application with a language model today, there are many powerful open-source models to choose from. There’s really no need to build one from scratch – you can start with a capable pre-trained model and fine-tune it to your specific task. That’s not to say this is easy.
The most straightforward approach is full fine-tuning, where a pre-trained model is tuned on task-specific data, with all the model weights being updated. This works well, but it’s costly. Many of today’s models have tens of billions of parameters, and updating all of them requires a significant amount of compute and energy, making full fine-tuning impractical in many cases.
Another option is known as low-rank adaptation, or LoRA, a parameter-efficient fine-tuning method. Instead of updating every weight, LoRA freezes the original model and injects a small set of trainable matrices alongside it. The matrices can approximate the adaptations full-fine tuning would make, but at a fraction of the cost. For good reason, LoRA has become one of the most widely used fine-tuning methods.
And yet, LoRA isn’t quite as good as full fine-tuning when it comes to performance. This is particularly true when LoRA is performed at low ranks where few parameters are trained. Raising the rank improves performance, but it also raises the cost.
A new study from researchers at MBZUAI and Amazon Science, UK proposes a new method to fill this gap. It’s called LoFT (low-rank adaptation that behaves like fine-tuning) and the researchers will present their findings at the 14th International Conference on Learning Representations (ICLR) in Rio de Janeiro.
Most past efforts to improve LoRA have focused on what’s known as gradient approximation, the idea that LoRA, by working in a compressed subspace, loses some information when computing the gradients that determine weight updates. The idea was that by improving gradient estimates, performance would increase as well.
Nurbek Tastan, a third-year doctoral student in machine learning at MBZUAI and first author of the study presented at ICLR, says that he and his co-authors suspected that gradients were only part of the story. The co-authors of the study are Stefanos Laskaridis, Martin Takáč, Karthik Nandakumar, and Samuel Horváth.
The researchers’ insight relates to how the optimizer, the algorithm that executes weight updates during fine-tuning, actually works. For example, an optimizer known as AdamW that’s often used in fine-tuning, maintains running statistics about past gradients, which helps stabilize and accelerate learning. These statistics, called the first and second moments, accumulate over many steps and give the optimizer a sense of which direction and how far to move.
The researchers found that LoRA’s compressed structure creates a mismatch between these statistics and what full fine-tuning would otherwise compute. Past work trying to improve LoRA hadn’t accounted for this, Tastan says: “Previous works have investigated gradient scaling. We go one step further and calibrate the momentums.”
The result is LoFT, which is composed of five components: gradient scaling, alternating updates, optimizer state calibration, construction of a projected full fine-tuning update followed by low-rank projection, and projected full fine-tuning-aware clipping.
In addition to better gradient handling, LoFT introduces calibration matrices that correct the first and second moment estimates as the process evolves, keeping the optimizer’s internal dynamics aligned with what full fine-tuning would compute. The researchers show that when the rank is set high enough, LoFT exactly recovers the behavior of AdamW on the full model, which hadn’t been done before.
LoFT also eliminates scaling the hyperparameter alpha, which is a practical challenge of working with LoRA. Tastan explains that getting alpha right typically requires running multiple training jobs, and the wrong value can cause unstable or degraded training. LoFT’s design removes the need to make these adjustments. “The moment you claim that you don’t need to tune for a particular hyperparameter,” Tastan says, “you are more efficient.”
LoFT’s gains were impressive in low rank settings. On commonsense reasoning benchmarks using a relatively small model, LLaMA 7B, LoFT at rank 4 outperformed standard LoRA at rank 16. That said, LoFT performed better than both LoRA and DoRA, another fine-tuning method, across all scales and rank configurations.
The researchers also experimented on much larger models. During the review process for the conference, the reviewers encouraged the team to demonstrate LoFT on a very large model, which typically isn’t done for low-rank adaptation methods. So the researchers tested it on LLaMA 70B. “Nobody has done that in the literature,” Tastan says. “We are the first, to the best of my knowledge.” LoFT’s advantages extended to this much larger model, even at rank one.
The next step for LoFT is to bring it into federated learning scenarios, where computational constraints can be even more significant than in standard fine-tuning. The ability to tune models at very low ranks, he says, will be critical in those environments.
Abdulla Almansoori explains how he is using machine learning to give back to the institutions that made.....
Read MoreMaster’s student Bashayer Alsereidi explains how a desire to make the UAE proud of her provided the.....
MBZUAI researchers have developed a neural optimal transport framework to better analyze and predict outcomes in complex.....