Feature/add frod#3270
Conversation
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for the PR to add FRoD. This looks already quite good. In this review, I focused on the integration, so I haven't checked the docs, examples, and experiments yet.
|
|
||
|
|
||
| @dataclass | ||
| class FRODConfig(PeftConfig): |
There was a problem hiding this comment.
Let's go with FrodConfig etc. It's easier to type and more consistent overall (e.g. we use LoraConfig, not LoRAConfig).
There was a problem hiding this comment.
Updated to FrodConfig, FrodModel, and FrodLayer to match PEFT naming conventions. I kept PeftType.FROD uppercase because it follows the enum naming style.
| self.merged_adapters = [] | ||
|
|
||
| base_layer = self.get_base_layer() | ||
| if isinstance(base_layer, nn.Linear): |
There was a problem hiding this comment.
Let's use peft.tuners.tuners_utils._get_in_out_features.
There was a problem hiding this comment.
Updated FrodLayer to use peft.tuners.tuners_utils._get_in_out_features() instead of manually handling nn.Linear and Conv1D.
| self.kwargs = kwargs | ||
|
|
||
| @property | ||
| def merged(self) -> bool: |
There was a problem hiding this comment.
Removed the duplicate merged property.
| init_weights, | ||
| ): | ||
| base_layer = self.get_base_layer() | ||
| weight = base_layer.weight.T if isinstance(base_layer, Conv1D) else base_layer.weight |
There was a problem hiding this comment.
Let's use peft.utils.other.transpose.
There was a problem hiding this comment.
Updated this to use transpose(base_layer.weight, self.fan_in_fan_out).
| import warnings | ||
| from collections import defaultdict | ||
|
|
||
| import numpy as np |
There was a problem hiding this comment.
Again, I would prefer to use torch throughout instead of numpy.
There was a problem hiding this comment.
Updated the FRoD initialization path to use torch.linalg throughout and removed the NumPy dependency.
| ) | ||
| if config.save_projection and not has_projection: | ||
| raise ValueError( | ||
| "Specified to load FRoD projection tensors from state dictionary however they were not present!" |
There was a problem hiding this comment.
It would be good to add instructions on how to deal with that situation.
There was a problem hiding this comment.
Added instructions to the error message: either load with save_projection=False to regenerate the projections from the base model weights, or re-save the adapter with save_projection=True.
| == mlp_same_prng.base_model.model.lin2.frod_s_indices["other"].data_ptr() | ||
| ) | ||
|
|
||
| def test_multiple_adapters_different_prng_raises(self): |
There was a problem hiding this comment.
As mentioned above, let's put this into test_initializaton.py.
There was a problem hiding this comment.
Added this coverage to tests/test_initialization.py using the same structure as TestVeraInitialization.
| return peft_model | ||
|
|
||
| @staticmethod | ||
| def _make_second_adapter_different(peft_model): |
There was a problem hiding this comment.
Shouldn't this be unnecessary with init_weigths=False?
There was a problem hiding this comment.
It's really helpful. Removed the helper and its call; init_weights=False already makes the two adapters different for this test.
| ("Vanilla MLP 2 FRoD", "MLP", FRODConfig, {"target_modules": ["lin0"]}), | ||
| ("Vanilla MLP 3 FRoD", "MLP", FRODConfig, {"target_modules": ["lin1"]}), | ||
| ("Vanilla MLP 4 FRoD", "MLP", FRODConfig, {"target_modules": ["lin0", "lin1"]}), | ||
| ("Vanilla MLP 5 FRoD", "MLP", FRODConfig, {"target_modules": ["lin0"], "modules_to_save": ["lin1"]}), |
There was a problem hiding this comment.
Could you please also add FRoD to MULTIPLE_ACTIVE_ADAPTERS_TEST_CASES?
There was a problem hiding this comment.
Added FRoD to MULTIPLE_ACTIVE_ADAPTERS_TEST_CASES with the same-target adapter case.
|
Updated the PR in 0e0d816 to address the review feedback:
|
|
The two new commits fix the FRoD sparse forward semantics and align the image-classification example with the CLIP-ViT setup used in our experiments. I also re-tested the examples with two examples: the Stanford Cars image-classification example and the SST-2 text-classification example. The CLIP-ViT Stanford Cars run converged as expected in the local 3-epoch test. Thank you for the review and suggestions. Looking forward to your feedback. |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for the updates. I did another review focusing on the integration and still found a handful of issues, please check.
| return result | ||
|
|
||
| def __repr__(self) -> str: | ||
| # Match PEFT tuner convention so printed models show FRoD-wrapped layers as `frod.*`. |
There was a problem hiding this comment.
This comment can be removed.
There was a problem hiding this comment.
Removed the unnecessary comment.
| weight = self.get_base_layer().weight | ||
| return self._get_delta_weight(adapter) | ||
|
|
||
| def _get_delta_weight(self, adapter, base_weight: Optional[torch.Tensor] = None) -> torch.Tensor: |
There was a problem hiding this comment.
I don't see why we need _get_delta_weight with the option of passing base_weight. AFAICT, it's always self.get_base_layer().weight, so there is never a need to pass the argument explicitly.
There was a problem hiding this comment.
Simplified get_delta_weight() to always use the base layer weight. merge() now precomputes adapter deltas before mutating the base weight, so the explicit base_weight argument is no longer needed.
| values = self.frod_lambda_s_values[active_adapter].to(device=x.device, dtype=target_dtype) | ||
| lambda_l = self.frod_lambda_l[active_adapter].to(device=x.device, dtype=target_dtype) | ||
|
|
||
| x = x.to(target_dtype) |
There was a problem hiding this comment.
target_dtype is defined as x.dtype, so this call is useless.
There was a problem hiding this comment.
Removed the redundant cast.
| # F.linear(h, U @ (S + diag(lambda_l)) @ V.T). | ||
| # CUDA sparse fp16/bf16 kernels are less reliable, so use fp32 here and cast the update back below. | ||
| matmul_dtype = z_flat.dtype | ||
| if z_flat.is_cuda and matmul_dtype in (torch.float16, torch.bfloat16): |
There was a problem hiding this comment.
Just wondering: Why is this limited to is_cuda? Wouldn't we expect the same issue on other devices?
There was a problem hiding this comment.
Changed the fp16/bf16 sparse matmul fallback to use fp32 regardless of device.
| base_weight = transpose(self.get_base_layer().weight, self.fan_in_fan_out).to( | ||
| device=x.device, dtype=target_dtype | ||
| ) | ||
| base_out = F.linear(x, base_weight) | ||
|
|
||
| result = result - base_out + out_add |
There was a problem hiding this comment.
This part I don't understand: For each FRoD adapter, we need to remove the base result from result? So my understanding is that the FRoD result already includes the result from the base model, so we need to subtract it to prevent it being included multiple times.
However, this seems very wasteful: First we calculate result = self.base_layer(x, *args, **kwargs), then we calculate base_out = F.linear(x, base_weight), which is basically the same thing (just without bias) and remove it from the result. And then repeat it for each active adapter.
What I would like to see instead is for the FRoD result to only contain the "delta" from FRoD without the base result, so that we can simply accumulate it. This would be the cleanest implementation IMO. It would also help to simplify get_delta_weight, where there is a similar logic to remove the base weight from the FRoD weight.
If, for some reason, it's not possible, we should instead count how many active adapters we have. Then we calculate the base result only once and remove it x times, once for each active adapter. This should be a lot cheaper than the current implementation.
There was a problem hiding this comment.
Updated the forward pass so the FRoD branch explicitly computes and accumulates only the adapter delta. As described in the paper, FRoD directly reconstructs the adapted weight through the joint matrix factorization, so the base-weight contribution must be subtracted; otherwise the base model contribution would be included twice. The code now computes out_add - adapter_base_out and reuses the base output when dropout is identity.
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for the new updates. The integration is getting there, but there were still some unclear parts of the code that I commented. Please take a look.
| self.frod_V[adapter_name] = frod_V[adapter_name] | ||
| self.frod_s_indices[adapter_name] = frod_s_indices[adapter_name] | ||
| self.frod_s_size[adapter_name] = frod_s_size[adapter_name] |
There was a problem hiding this comment.
I don't think this can be deleted. This follows the same pattern as VeRA: the shared projection buffers are created at the tuner level and then referenced by each wrapped layer. In VeRA, update_layer() assigns the shared vera_A / vera_B buffers to the layer; here FRoD attaches the shared category-level frod_V, frod_s_indices, and frod_s_size entries to the layer-local non-persistent buffers. These are shared references used by forward() and get_delta_weight(), not independent persistent copies.
There was a problem hiding this comment.
What I mean is that here, we add the three parameters, and then just below, we add them again and cast them. Why not do this in a single step?
| if active_adapter not in self.frod_lambda_s_values: | ||
| continue | ||
|
|
||
| V = self.frod_V[active_adapter].to(device=x.device, dtype=target_dtype) |
There was a problem hiding this comment.
The code here is partly exactly the same as in get_delta_weight. How about factoring out the common parts into a dedicated method?
There was a problem hiding this comment.
There is some overlap in fetching/casting the FRoD tensors, but the two paths intentionally do different computations. get_delta_weight() materializes the dense delta weight for merge/unmerge, while forward() keeps the sparse activation-side computation and avoids constructing the full dense matrix.
I could factor out only the tensor-fetching part, but that helper would mostly return a long tuple of tensors and may make the forward path harder to read. I would prefer to keep the dense merge path and sparse forward path separate unless you think the small helper would still be clearer.
There was a problem hiding this comment.
Yes, I think it would be better to have a small helper function that takes the device and dtype and returns the relevant tensors, up to the sparse tensor (before making it dense). Not only would it help to keep the two code paths in sync, IMO it also makes it easier to see what is shared between the merging and the forward path.
There was a problem hiding this comment.
Factored the shared tensor fetching/casting and sparse tensor construction into _get_frod_tensors(), while keeping the dense merge path and sparse forward path separate.
|
@BenjaminBossan Thanks for the review! I have addressed the requested changes and re-requested your review. Could you please take another look when you have time? |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for the new updates. For this review, I focused on these changes and the comments that were still open. Generally, my questions have been resolved by I still have a few comments.
Interestingly, one single FRoD test failed in one of the Windows CI runs. I restarted that run, in case it's just a flaky result, but if it fails again, we should investigate.
One more thing: We now added an image generation benchmark to PEFT: https://github.com/huggingface/peft/tree/main/method_comparison/image-gen. The idea is exactly the same as for the MetaMath benchmark. It would be nice if you could add an experiment for FRoD to this benchmark. Check the existing experimental settings for guidance.
| self.frod_V[adapter_name] = frod_V[adapter_name] | ||
| self.frod_s_indices[adapter_name] = frod_s_indices[adapter_name] | ||
| self.frod_s_size[adapter_name] = frod_s_size[adapter_name] |
There was a problem hiding this comment.
What I mean is that here, we add the three parameters, and then just below, we add them again and cast them. Why not do this in a single step?
| V = self.frod_V[adapter].to(device=device, dtype=dtype) | ||
| indices = self.frod_s_indices[adapter].to(device=U.device, dtype=torch.long) | ||
| size = tuple(int(dim) for dim in self.frod_s_size[adapter].tolist()) | ||
| values = self.frod_lambda_s_values[adapter].to(U.device, U.dtype).clone() |
There was a problem hiding this comment.
I'm wondering if we really need the .clone here.
There was a problem hiding this comment.
Removed the unnecessary .clone() since the sparse values are not modified in place.
| if active_adapter not in self.frod_lambda_s_values: | ||
| continue | ||
|
|
||
| V = self.frod_V[active_adapter].to(device=x.device, dtype=target_dtype) |
There was a problem hiding this comment.
Yes, I think it would be better to have a small helper function that takes the device and dtype and returns the relevant tensors, up to the sparse tensor (before making it dense). Not only would it help to keep the two code paths in sync, IMO it also makes it easier to see what is shared between the merging and the forward path.
|
Updated in 740f432. This addresses the remaining FrodLayer feedback by simplifying the shared buffer assignment, adding the merge-path comment, removing the unnecessary clone, and factoring the shared FRoD tensor preparation into I checked the previous PR head before pushing this update, and all 10 checks were successful, including the Windows jobs. Could you please let me know if the Windows CI issue still appears on your side after this new run starts? I also added the FRoD experiment settings for the new image generation benchmark under |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for the recent changes, they look good. However, ruff is complaining, so could you please run make style?
I have another suggestion: In PR #3117, we recently added the foundation for generic quantization support. What that means is that with only a few changes, it should be possible for most PEFT methods to support multiple quantization backends. In other words, FRoD could support bitsandbytes, torchao, and gptqmodel without the need to add extra FRoD layer classes for each of them.
To achieve that, use the changes in the linked PR as an example and apply the same idea to FRoD. If you run into any difficulties, please let me know. This feature is not a must have, if you don't want to support in FRoD, it's fine. But I would highly recommend to add it, as many users are interested in fine-tuning quantized models.
|
Applied |
|
Thanks for the suggestion. I agree that generic quantization support is an important direction, and PR #3117 provides a good shared path for methods that naturally operate on quantized base layers. For FRoD, I think the quantization design needs a separate and more careful treatment. Unlike LoRA-style methods, FRoD does not fundamentally need to keep using the original base weight after initialization. Instead, the target weight is reparameterized through To avoid making this PR larger, I would prefer to keep this PR focused on adding the core FRoD method. I can investigate the quantization design separately, and if it works well, open a follow-up PR specifically for FRoD quantization support. |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for the new updates, the tests are passing (one flaky CI run can be ignored).
Regarding the forward path, I think I found one possible optimization, please check my comment. I also checked the examples and added some comments regarding those.
About the quantization: Thanks for explaining why it's not trivially supported here. Let's drop it for now. If you could revisit it later, that would be great.
| MODEL_NAME = os.environ.get("FROD_IMAGE_MODEL_NAME", "openai/clip-vit-base-patch32") | ||
| OUTPUT_DIR = os.environ.get("FROD_IMAGE_OUTPUT_DIR", "clip-vit-base-patch32-frod-stanford-cars") | ||
| DATA_DIR = os.environ.get("FROD_STANFORD_CARS_DATA_DIR") | ||
| NUM_TRAIN_EPOCHS = int(os.environ.get("FROD_IMAGE_NUM_TRAIN_EPOCHS", "3")) | ||
| TRAIN_BATCH_SIZE = int(os.environ.get("FROD_IMAGE_TRAIN_BATCH_SIZE", "64")) | ||
| EVAL_BATCH_SIZE = int(os.environ.get("FROD_IMAGE_EVAL_BATCH_SIZE", "64")) | ||
| SPARSE_RATE = float(os.environ.get("FROD_IMAGE_SPARSE_RATE", "0.01")) | ||
| FROD_LAMBDA_L_LR = float(os.environ.get("FROD_IMAGE_LAMBDA_L_LR", "5e-4")) | ||
| FROD_LAMBDA_S_LR = float(os.environ.get("FROD_IMAGE_LAMBDA_S_LR", "5e-5")) | ||
| CLASSIFIER_LR = float(os.environ.get("FROD_IMAGE_CLASSIFIER_LR", "1e-4")) | ||
| CLIP_TARGET_MODULES = ["q_proj", "k_proj", "v_proj", "out_proj", "fc1", "fc2"] |
There was a problem hiding this comment.
I'm not a fan of setting all of these as environment variables. Let's please parse them as proper command line arguments. You could use HfArgumentParser(TrainingArguments) for the Trainer arguments and parse the rest manually.
There was a problem hiding this comment.
Replaced the environment-variable based configuration with HfArgumentParser-backed CLI arguments.
| OUTPUT_DIR = "bert-base-uncased-frod-sst2" | ||
| FROD_LAMBDA_L_LR = 2e-2 | ||
| FROD_LAMBDA_S_LR = 2e-3 | ||
| CLASSIFIER_LR = 1e-2 |
There was a problem hiding this comment.
Similar argument, let's add a proper CLI.
There was a problem hiding this comment.
Applied the same CLI structure to the text example as well.
|
|
||
| generator = torch.Generator(device="cpu").manual_seed(config.projection_prng_key) | ||
| categories = {category for layer_dict in weights.values() for category in layer_dict} | ||
| for category in sorted(categories): |
There was a problem hiding this comment.
Running through this loop can be quite lengthy, without any indication to the user of what's happening. Usually, calling get_peft_model is quite fast, so this could come as a surprise. WDYT about adding some progress indication, e.g. using tqdm for the loop? You could add a progressbar: bool argument to enable/disable it. I would also document that this can be slow.
There was a problem hiding this comment.
Thank you very much for your suggestion. We need to perform Hierarchical Joint Decomposition on the matrices that require fine-tuning, which is very time-consuming. However, the good news is that the result of this Hierarchical Joint Decomposition is measured by the model, so fine-tuning a model for different tasks only requires one decomposition. We will use tqdm for the loop and document it.
There was a problem hiding this comment.
I tried to run this experiment on my machine with 24 GB and got OOM. The other tested PEFT methods usually stay below 12 GB. Do you think this is expected? For the final run, we'll use 48 GB cards, so the experiment may succeed there, but it could still be an indication that there is room for improvement. One suspicion I have is that float32 is used for the computation, not sure if that explains all of it.
There was a problem hiding this comment.
This is a very serious bug, not just an issue with fp32, but also a problem encountered when using sparse matrices. I will fix this bug as soon as possible.
Applied the suggestion so eval-mode dropout reuses the shared base output as well. Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Applied this by deriving `base_out` from `result` and lazily constructing `base_weight` only when dropout is active during training. Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
|
@BenjaminBossan After investigating the unexpectedly high memory usage, we found a few implementation issues in the current FRoD code. This OOM was not expected and was caused by a serious implementation bug, not only by fp32 computation. First, the shared FRoD projection buffers were not actually shared on GPU. Although Second, we re-tested the sparse forward path in both fp32 and fp16/bf16. In our supported setup, fp16/bf16 was not less reliable, so forcing the sparse path through fp32 was unnecessary and added avoidable memory overhead. Third, the forward path was still too close to a PEFT delta-style implementation. FRoD semantically reconstructs the adapted weight rather than adding a LoRA-style delta, so the single-adapter/no-dropout active path does not need the target base linear output. Keeping target base weights on GPU in that path added unnecessary memory pressure, although we still need to preserve PEFT behavior for disable/merge/unmerge. I updated the implementation accordingly. I also updated the sparse forward path to compute the COO activation update directly in the requested dtype when sparse addmm is unavailable, instead of forcing an unconditional fp32 path. For the base-weight memory issue, I added an opt-in Could you please retry |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for implementing the feedback and for investigating the cause of the memory issue.
Unfortunately, when I tried to re-run the image generation benchmark, I still got OOM. Most other PEFT methods manage to stay below 12 GB. Can you replicate this issue? Note that for image generation, we have enabled gradient checkpointing, just mentioning it in case it matters. The MetaMath benchmark also OOMs with 24 GB, but that one skirts much closer to that number, so it doesn't necessarily mean something is wrong there.
| self.__dict__["frod_V"] = frod_V | ||
| self.__dict__["frod_s_indices"] = frod_s_indices | ||
| self.__dict__["frod_s_size"] = frod_s_size |
There was a problem hiding this comment.
Hmm, I wonder why we need __dict__ assignment here but not for other PEFT methods like VeRA?
| "projection_prng_key": 0, | ||
| "regularization_alpha": 0.001, | ||
| "revision": null, | ||
| "runtime_offload_base_weight": true, |
There was a problem hiding this comment.
It would make sense to add this option to the MetaMath experiments too.
| if merge: | ||
| self.merge(safe_merge=safe_merge, adapter_names=adapter_names) | ||
| self._move_base_weight_to_device_of_adapter(self.active_adapters[0] if self.active_adapters else None) | ||
| else: | ||
| self._move_base_weight_to_device_of_adapter(self.active_adapters[0] if self.active_adapters else None) |
There was a problem hiding this comment.
Should do the same, right?
| if merge: | |
| self.merge(safe_merge=safe_merge, adapter_names=adapter_names) | |
| self._move_base_weight_to_device_of_adapter(self.active_adapters[0] if self.active_adapters else None) | |
| else: | |
| self._move_base_weight_to_device_of_adapter(self.active_adapters[0] if self.active_adapters else None) | |
| if merge: | |
| self.merge(safe_merge=safe_merge, adapter_names=adapter_names) | |
| self._move_base_weight_to_device_of_adapter(self.active_adapters[0] if self.active_adapters else None) |
Also, please add a comment why _move_base_weight_to_device_of_adapter needs to be called.
| self._move_base_weight_to_device(x.device) | ||
| result = self.base_layer(x, *args, **kwargs) | ||
| elif self.merged: | ||
| self._move_base_weight_to_device(x.device) |
There was a problem hiding this comment.
Why are the _move_base_weight_to_device calls needed?
| ) | ||
| }, | ||
| ) | ||
| runtime_offload_base_weight: bool = field( |
There was a problem hiding this comment.
I like the idea of adding this as an option. However, it wouldn't make sense to have multiple adapters on the same base model where some have runtime_offload_base_weight=True and others have False, right? I think we should enforce the same value for all.
The best way to achieve this is to override _check_new_adapter_config on FrodModel. This check comes from the parent class, so don't forget to call super(). Then, if there are multiple configs with different runtime_offload_base_weight values, an error should be raised. Add a test to tests/test_initialization.py.
|
Thanks for re-running this and for pointing out the gradient checkpointing detail. I could reproduce the unexpectedly high memory behavior with the previous implementation, and it was not expected. I found and fixed a few FRoD-specific memory issues. First, the shared FRoD projection buffers ( I re-ran the image generation benchmark with the existing benchmark script/settings, without disabling gradient checkpointing. The FRoD config I tested was:
After the fix, I could not reproduce the training OOM. The benchmark-reported training memory was:
For comparison, I also ran LoRA One caveat: if I watch |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thank you for investigating further to fix the memory issue. I can confirm that the benchmarks now both run on my machine.
One issue I noticed, however, is that the benchmarks run extremely slowly. Did you observe the same behavior? E.g. for the image benchmark, it takes ~600 sec for 100 steps to train and ~100 sec for the eval step. As a point of comparison, LoRA requires ~70 sec and ~13 sec, respectively.
I checked if it could be related to offloading, but after disabling it, the picture was the same. Then I changed the chunk size, e.g. by increasing it 8x. This might have had a small effect but at the cost of higher memory. If you have time, it would be great if you could investigate once more if there is something obviously wrong in the PEFT integration compared to your original code. If not, it's also okay to leave it as is, saving memory is more important for PEFT than good runtimes.
| assert torch.allclose(z.grad, z_expected.grad, atol=1e-3, rtol=1e-3) | ||
|
|
||
| @pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA is not available") | ||
| def test_frod_runtime_offload_keeps_base_weight_on_cpu_after_cuda(self, mlp): |
There was a problem hiding this comment.
As this test requires a GPU to run, could you please move it to tests/test_gpu_examples.py?
| class FrodLayer(BaseTunerLayer): | ||
| adapter_layer_names = ("frod_lambda_l", "frod_lambda_s_values") | ||
| other_param_names = ("frod_V", "frod_U", "frod_s_indices", "frod_s_size", "runtime_offload_base_weight") | ||
| _frod_sparse_chunk_size = 16_384 |
There was a problem hiding this comment.
Would it make sense to make this a config parameter for users to tune? It should mention the tradeoffs when increasing/decreasing that value.
| args=training_args, | ||
| train_dataset=tokenized["train"], | ||
| eval_dataset=tokenized["validation"], | ||
| tokenizer=tokenizer, |
There was a problem hiding this comment.
| tokenizer=tokenizer, | |
| processing_class=tokenizer, |
|
I investigated this and found that the slowdown came from the custom chunked COO activation path, not from base-weight offloading. The chunked implementation reduced temporary memory, but it used Python-level I changed the fp16/bf16 sparse activation path to densify only the sparse |
|
I investigated the runtime issue and found that the slowdown came from the custom chunked COO activation path, not from base-weight offloading. I replaced that path with a faster implementation that temporarily densifies only the sparse I pushed the fix in I also re-ran the relevant checks: |
BenjaminBossan
left a comment
There was a problem hiding this comment.
@Bane-Elvin Thank you for your further experiments. I can confirm that FRoD now runs fast and in a memory efficient way. The fact that the code is even simplified is a nice bonus. From my side, there is nothing more to add and we can go ahead and merge.
Just one more thing that we try to test out: Would you like to provide a tweet-sized blurb for FRoD? Something we can put into the release notes and that allows the interested user to quickly understand when they should use FRoD. So basically: What it is, what it's especially good at, when not to use it.
|
@BenjaminBossan Thank you very much for the careful review, the additional benchmarking, and all the valuable suggestions. They helped uncover several important implementation issues and made the final FRoD integration both simpler and much more practical. Tweet-sized blurb: FRoD is a full-rank, replacement-style PEFT method from FRoD: Full-Rank Efficient Fine-Tuning with Rotational Degrees for Fast Convergence. Instead of adding low-rank deltas, it reconstructs selected weights with shared rotational subspaces and sparse trainable coefficients. It is especially useful when fast convergence and a higher full-rank capacity ceiling are important, and its large sparse rotational subspace may also be promising for model merging. The main tradeoffs are the costly joint-decomposition initialization and slightly slower forward/backward passes than LoRA due to the sparse structured factors, so it may be less attractive for a one-off single-task fine-tune. |
|
Thanks a lot @Bane-Elvin for this great work and your patience with the reviews. |
Summary
This PR implements the FRoD PEFT tuner proposed in #3244.
Examples
google-bert/bert-base-uncasedandnyu-mll/glue.google/vit-base-patch16-224andtanganke/stanford_cars.Tests
PYTHONPATH=src python -m ruff check ...PYTHONPATH=src python -m py_compile examples/frod_finetuning/*.pyPYTHONPATH=src python -m pytest -o addopts='' tests/test_decoder_models.py tests/test_encoder_decoder_models.py tests/test_feature_extraction_models.py tests/test_seq_classifier.py tests/test_custom_models.py tests/test_frod.py -q -k "FROD or frod"PYTHONPATH=src python -m pytest -o addopts='' tests/test_config.py -q -k "FROD or frod"