Feature/add frod by Bane-Elvin · Pull Request #3270 · huggingface/peft

Bane-Elvin · 2026-05-27T05:26:52Z

Summary

This PR implements the FRoD PEFT tuner proposed in #3244.

Add FRoD config, tuner, layer implementation, save/load integration, and PEFT exports.
Add Conv1D handling and ViT target-module mapping.
Add FRoD to PEFT test matrices.
Add package reference docs, runnable Trainer examples, and benchmark settings.

Examples

BERT on SST-2 with google-bert/bert-base-uncased and nyu-mll/glue.
ViT on Stanford Cars with google/vit-base-patch16-224 and tanganke/stanford_cars.

Tests

PYTHONPATH=src python -m ruff check ...
PYTHONPATH=src python -m py_compile examples/frod_finetuning/*.py
PYTHONPATH=src python -m pytest -o addopts='' tests/test_decoder_models.py tests/test_encoder_decoder_models.py tests/test_feature_extraction_models.py tests/test_seq_classifier.py tests/test_custom_models.py tests/test_frod.py -q -k "FROD or frod"
PYTHONPATH=src python -m pytest -o addopts='' tests/test_config.py -q -k "FROD or frod"

BenjaminBossan

Thanks for the PR to add FRoD. This looks already quite good. In this review, I focused on the integration, so I haven't checked the docs, examples, and experiments yet.

BenjaminBossan · 2026-05-27T09:59:04Z

+
+
+@dataclass
+class FRODConfig(PeftConfig):


Let's go with FrodConfig etc. It's easier to type and more consistent overall (e.g. we use LoraConfig, not LoRAConfig).

Updated to FrodConfig, FrodModel, and FrodLayer to match PEFT naming conventions. I kept PeftType.FROD uppercase because it follows the enum naming style.

BenjaminBossan · 2026-05-27T10:46:53Z

+        self.merged_adapters = []
+
+        base_layer = self.get_base_layer()
+        if isinstance(base_layer, nn.Linear):


Let's use peft.tuners.tuners_utils._get_in_out_features.

Updated FrodLayer to use peft.tuners.tuners_utils._get_in_out_features() instead of manually handling nn.Linear and Conv1D.

BenjaminBossan · 2026-05-27T10:47:28Z

+        self.kwargs = kwargs
+
+    @property
+    def merged(self) -> bool:


Can be removed.

Removed the duplicate merged property.

BenjaminBossan · 2026-05-27T10:48:42Z

+        init_weights,
+    ):
+        base_layer = self.get_base_layer()
+        weight = base_layer.weight.T if isinstance(base_layer, Conv1D) else base_layer.weight


Let's use peft.utils.other.transpose.

Updated this to use transpose(base_layer.weight, self.fan_in_fan_out).

BenjaminBossan · 2026-05-27T14:57:18Z

+import warnings
+from collections import defaultdict
+
+import numpy as np


Again, I would prefer to use torch throughout instead of numpy.

Updated the FRoD initialization path to use torch.linalg throughout and removed the NumPy dependency.

BenjaminBossan · 2026-05-27T15:00:43Z

+            )
+            if config.save_projection and not has_projection:
+                raise ValueError(
+                    "Specified to load FRoD projection tensors from state dictionary however they were not present!"


It would be good to add instructions on how to deal with that situation.

Added instructions to the error message: either load with save_projection=False to regenerate the projections from the base model weights, or re-save the adapter with save_projection=True.

BenjaminBossan · 2026-05-27T15:02:15Z

+            == mlp_same_prng.base_model.model.lin2.frod_s_indices["other"].data_ptr()
+        )
+
+    def test_multiple_adapters_different_prng_raises(self):


As mentioned above, let's put this into test_initializaton.py.

Added this coverage to tests/test_initialization.py using the same structure as TestVeraInitialization.

BenjaminBossan · 2026-05-27T15:03:22Z

+        return peft_model
+
+    @staticmethod
+    def _make_second_adapter_different(peft_model):


Shouldn't this be unnecessary with init_weigths=False?

It's really helpful. Removed the helper and its call; init_weights=False already makes the two adapters different for this test.

BenjaminBossan · 2026-05-27T15:05:28Z

+    ("Vanilla MLP 2 FRoD", "MLP", FRODConfig, {"target_modules": ["lin0"]}),
+    ("Vanilla MLP 3 FRoD", "MLP", FRODConfig, {"target_modules": ["lin1"]}),
+    ("Vanilla MLP 4 FRoD", "MLP", FRODConfig, {"target_modules": ["lin0", "lin1"]}),
+    ("Vanilla MLP 5 FRoD", "MLP", FRODConfig, {"target_modules": ["lin0"], "modules_to_save": ["lin1"]}),


Could you please also add FRoD to MULTIPLE_ACTIVE_ADAPTERS_TEST_CASES?

Added FRoD to MULTIPLE_ACTIVE_ADAPTERS_TEST_CASES with the same-target adapter case.

Bane-Elvin · 2026-05-28T10:08:35Z

Updated the PR in 0e0d816 to address the review feedback:

Renamed the public FRoD classes to follow PEFT naming conventions: FrodConfig, FrodModel, and FrodLayer.
Replaced custom shape/transpose handling with existing PEFT utilities where applicable.
Removed redundant code paths and simplified adapter initialization logic.
Switched FRoD initialization from NumPy to torch.linalg and removed the NumPy dependency.
Added comments for non-trivial implementation choices, including shared projection buffers, sparse CUDA matmul dtype handling, and pre-injection initialization.
Added/updated initialization tests and moved FRoD-specific initialization checks into tests/test_initialization.py.
Added FRoD coverage to multiple active adapter tests.
Improved the FRoD projection loading error message with recovery instructions.
Updated docs and runnable examples to use the new FrodConfig naming.

Bane-Elvin · 2026-05-29T04:53:04Z

The two new commits fix the FRoD sparse forward semantics and align the image-classification example with the CLIP-ViT setup used in our experiments.

I also re-tested the examples with two examples: the Stanford Cars image-classification example and the SST-2 text-classification example. The CLIP-ViT Stanford Cars run converged as expected in the local 3-epoch test.

Thank you for the review and suggestions. Looking forward to your feedback.

BenjaminBossan

Thanks for the updates. I did another review focusing on the integration and still found a handful of issues, please check.

BenjaminBossan · 2026-05-29T12:33:29Z

        return result

    def __repr__(self) -> str:
+        # Match PEFT tuner convention so printed models show FRoD-wrapped layers as `frod.*`.


This comment can be removed.

Removed the unnecessary comment.

BenjaminBossan · 2026-05-29T12:38:24Z

-        weight = self.get_base_layer().weight
+        return self._get_delta_weight(adapter)
+
+    def _get_delta_weight(self, adapter, base_weight: Optional[torch.Tensor] = None) -> torch.Tensor:


I don't see why we need _get_delta_weight with the option of passing base_weight. AFAICT, it's always self.get_base_layer().weight, so there is never a need to pass the argument explicitly.

Simplified get_delta_weight() to always use the base layer weight. merge() now precomputes adapter deltas before mutating the base weight, so the explicit base_weight argument is no longer needed.

BenjaminBossan · 2026-05-29T12:40:55Z

+                values = self.frod_lambda_s_values[active_adapter].to(device=x.device, dtype=target_dtype)
+                lambda_l = self.frod_lambda_l[active_adapter].to(device=x.device, dtype=target_dtype)
+
+                x = x.to(target_dtype)


target_dtype is defined as x.dtype, so this call is useless.

Removed the redundant cast.

BenjaminBossan · 2026-05-29T12:41:48Z

+                # F.linear(h, U @ (S + diag(lambda_l)) @ V.T).
+                # CUDA sparse fp16/bf16 kernels are less reliable, so use fp32 here and cast the update back below.
+                matmul_dtype = z_flat.dtype
+                if z_flat.is_cuda and matmul_dtype in (torch.float16, torch.bfloat16):


Just wondering: Why is this limited to is_cuda? Wouldn't we expect the same issue on other devices?

Changed the fp16/bf16 sparse matmul fallback to use fp32 regardless of device.

BenjaminBossan · 2026-05-29T12:48:14Z

+                base_weight = transpose(self.get_base_layer().weight, self.fan_in_fan_out).to(
+                    device=x.device, dtype=target_dtype
+                )
+                base_out = F.linear(x, base_weight)
+
+                result = result - base_out + out_add


This part I don't understand: For each FRoD adapter, we need to remove the base result from result? So my understanding is that the FRoD result already includes the result from the base model, so we need to subtract it to prevent it being included multiple times.

However, this seems very wasteful: First we calculate result = self.base_layer(x, *args, **kwargs), then we calculate base_out = F.linear(x, base_weight), which is basically the same thing (just without bias) and remove it from the result. And then repeat it for each active adapter.

What I would like to see instead is for the FRoD result to only contain the "delta" from FRoD without the base result, so that we can simply accumulate it. This would be the cleanest implementation IMO. It would also help to simplify get_delta_weight, where there is a similar logic to remove the base weight from the FRoD weight.

If, for some reason, it's not possible, we should instead count how many active adapters we have. Then we calculate the base result only once and remove it x times, once for each active adapter. This should be a lot cheaper than the current implementation.

Updated the forward pass so the FRoD branch explicitly computes and accumulates only the adapter delta. As described in the paper, FRoD directly reconstructs the adapted weight through the joint matrix factorization, so the base-weight contribution must be subtracted; otherwise the base model contribution would be included twice. The code now computes out_add - adapter_base_out and reuses the base output when dropout is identity.

BenjaminBossan

Thanks for the new updates. The integration is getting there, but there were still some unclear parts of the code that I commented. Please take a look.

BenjaminBossan · 2026-06-01T10:52:35Z

+        self.frod_V[adapter_name] = frod_V[adapter_name]
+        self.frod_s_indices[adapter_name] = frod_s_indices[adapter_name]
+        self.frod_s_size[adapter_name] = frod_s_size[adapter_name]


Can this be deleted?

I don't think this can be deleted. This follows the same pattern as VeRA: the shared projection buffers are created at the tuner level and then referenced by each wrapped layer. In VeRA, update_layer() assigns the shared vera_A / vera_B buffers to the layer; here FRoD attaches the shared category-level frod_V, frod_s_indices, and frod_s_size entries to the layer-local non-persistent buffers. These are shared references used by forward() and get_delta_weight(), not independent persistent copies.

What I mean is that here, we add the three parameters, and then just below, we add them again and cast them. Why not do this in a single step?

BenjaminBossan · 2026-06-01T10:57:46Z

+                if active_adapter not in self.frod_lambda_s_values:
+                    continue
+
+                V = self.frod_V[active_adapter].to(device=x.device, dtype=target_dtype)


The code here is partly exactly the same as in get_delta_weight. How about factoring out the common parts into a dedicated method?

There is some overlap in fetching/casting the FRoD tensors, but the two paths intentionally do different computations. get_delta_weight() materializes the dense delta weight for merge/unmerge, while forward() keeps the sparse activation-side computation and avoids constructing the full dense matrix.

I could factor out only the tensor-fetching part, but that helper would mostly return a long tuple of tensors and may make the forward path harder to read. I would prefer to keep the dense merge path and sparse forward path separate unless you think the small helper would still be clearer.

Yes, I think it would be better to have a small helper function that takes the device and dtype and returns the relevant tensors, up to the sparse tensor (before making it dense). Not only would it help to keep the two code paths in sync, IMO it also makes it easier to see what is shared between the merging and the forward path.

Factored the shared tensor fetching/casting and sparse tensor construction into _get_frod_tensors(), while keeping the dense merge path and sparse forward path separate.

Bane-Elvin · 2026-06-09T05:17:14Z

@BenjaminBossan Thanks for the review! I have addressed the requested changes and re-requested your review. Could you please take another look when you have time?

HuggingFaceDocBuilderDev · 2026-06-09T11:56:50Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan

Thanks for the new updates. For this review, I focused on these changes and the comments that were still open. Generally, my questions have been resolved by I still have a few comments.

Interestingly, one single FRoD test failed in one of the Windows CI runs. I restarted that run, in case it's just a flaky result, but if it fails again, we should investigate.

One more thing: We now added an image generation benchmark to PEFT: https://github.com/huggingface/peft/tree/main/method_comparison/image-gen. The idea is exactly the same as for the MetaMath benchmark. It would be nice if you could add an experiment for FRoD to this benchmark. Check the existing experimental settings for guidance.

BenjaminBossan · 2026-06-09T11:49:37Z

+        self.frod_V[adapter_name] = frod_V[adapter_name]
+        self.frod_s_indices[adapter_name] = frod_s_indices[adapter_name]
+        self.frod_s_size[adapter_name] = frod_s_size[adapter_name]


What I mean is that here, we add the three parameters, and then just below, we add them again and cast them. Why not do this in a single step?

BenjaminBossan · 2026-06-09T11:53:12Z

+        V = self.frod_V[adapter].to(device=device, dtype=dtype)
+        indices = self.frod_s_indices[adapter].to(device=U.device, dtype=torch.long)
+        size = tuple(int(dim) for dim in self.frod_s_size[adapter].tolist())
+        values = self.frod_lambda_s_values[adapter].to(U.device, U.dtype).clone()


I'm wondering if we really need the .clone here.

Removed the unnecessary .clone() since the sparse values are not modified in place.

BenjaminBossan · 2026-06-09T11:55:11Z

+                if active_adapter not in self.frod_lambda_s_values:
+                    continue
+
+                V = self.frod_V[active_adapter].to(device=x.device, dtype=target_dtype)


Yes, I think it would be better to have a small helper function that takes the device and dtype and returns the relevant tensors, up to the sparse tensor (before making it dense). Not only would it help to keep the two code paths in sync, IMO it also makes it easier to see what is shared between the merging and the forward path.

Bane-Elvin · 2026-06-10T08:57:18Z

Updated in 740f432.

This addresses the remaining FrodLayer feedback by simplifying the shared buffer assignment, adding the merge-path comment, removing the unnecessary clone, and factoring the shared FRoD tensor preparation into _get_frod_tensors() while keeping the dense merge path and sparse forward path separate.

I checked the previous PR head before pushing this update, and all 10 checks were successful, including the Windows jobs. Could you please let me know if the Windows CI issue still appears on your side after this new run starts?

I also added the FRoD experiment settings for the new image generation benchmark under method_comparison/image-gen/experiments/frod/flux2-klein-default.

BenjaminBossan

Thanks for the recent changes, they look good. However, ruff is complaining, so could you please run make style?

I have another suggestion: In PR #3117, we recently added the foundation for generic quantization support. What that means is that with only a few changes, it should be possible for most PEFT methods to support multiple quantization backends. In other words, FRoD could support bitsandbytes, torchao, and gptqmodel without the need to add extra FRoD layer classes for each of them.

To achieve that, use the changes in the linked PR as an example and apply the same idea to FRoD. If you run into any difficulties, please let me know. This feature is not a must have, if you don't want to support in FRoD, it's fine. But I would highly recommend to add it, as many users are interested in fine-tuning quantized models.

Bane-Elvin · 2026-06-11T12:20:06Z

Applied make style and pushed the formatting fix in c391d35. make quality passes locally now.

Bane-Elvin · 2026-06-11T12:40:58Z

Thanks for the suggestion. I agree that generic quantization support is an important direction, and PR #3117 provides a good shared path for methods that naturally operate on quantized base layers.

For FRoD, I think the quantization design needs a separate and more careful treatment. Unlike LoRA-style methods, FRoD does not fundamentally need to keep using the original base weight after initialization. Instead, the target weight is reparameterized through U, the shared projection V, and sparse/diagonal factors, and the effective adapter weight is reconstructed from these FRoD factors. Therefore, simply supporting a quantized base layer through the generic wrapper may not be the most meaningful compression path for this method. A more relevant direction may be to study whether the FRoD factors themselves, especially U and the shared V, can be quantized while preserving the intended weight reconstruction behavior.

To avoid making this PR larger, I would prefer to keep this PR focused on adding the core FRoD method. I can investigate the quantization design separately, and if it works well, open a follow-up PR specifically for FRoD quantization support.

BenjaminBossan

Thanks for the new updates, the tests are passing (one flaky CI run can be ignored).

Regarding the forward path, I think I found one possible optimization, please check my comment. I also checked the examples and added some comments regarding those.

About the quantization: Thanks for explaining why it's not trivially supported here. Let's drop it for now. If you could revisit it later, that would be great.

BenjaminBossan · 2026-06-12T10:05:53Z

+MODEL_NAME = os.environ.get("FROD_IMAGE_MODEL_NAME", "openai/clip-vit-base-patch32")
+OUTPUT_DIR = os.environ.get("FROD_IMAGE_OUTPUT_DIR", "clip-vit-base-patch32-frod-stanford-cars")
+DATA_DIR = os.environ.get("FROD_STANFORD_CARS_DATA_DIR")
+NUM_TRAIN_EPOCHS = int(os.environ.get("FROD_IMAGE_NUM_TRAIN_EPOCHS", "3"))
+TRAIN_BATCH_SIZE = int(os.environ.get("FROD_IMAGE_TRAIN_BATCH_SIZE", "64"))
+EVAL_BATCH_SIZE = int(os.environ.get("FROD_IMAGE_EVAL_BATCH_SIZE", "64"))
+SPARSE_RATE = float(os.environ.get("FROD_IMAGE_SPARSE_RATE", "0.01"))
+FROD_LAMBDA_L_LR = float(os.environ.get("FROD_IMAGE_LAMBDA_L_LR", "5e-4"))
+FROD_LAMBDA_S_LR = float(os.environ.get("FROD_IMAGE_LAMBDA_S_LR", "5e-5"))
+CLASSIFIER_LR = float(os.environ.get("FROD_IMAGE_CLASSIFIER_LR", "1e-4"))
+CLIP_TARGET_MODULES = ["q_proj", "k_proj", "v_proj", "out_proj", "fc1", "fc2"]


I'm not a fan of setting all of these as environment variables. Let's please parse them as proper command line arguments. You could use HfArgumentParser(TrainingArguments) for the Trainer arguments and parse the rest manually.

Replaced the environment-variable based configuration with HfArgumentParser-backed CLI arguments.

BenjaminBossan · 2026-06-12T10:07:06Z

+OUTPUT_DIR = "bert-base-uncased-frod-sst2"
+FROD_LAMBDA_L_LR = 2e-2
+FROD_LAMBDA_S_LR = 2e-3
+CLASSIFIER_LR = 1e-2


Similar argument, let's add a proper CLI.

Applied the same CLI structure to the text example as well.

BenjaminBossan · 2026-06-12T10:23:26Z

+
+        generator = torch.Generator(device="cpu").manual_seed(config.projection_prng_key)
+        categories = {category for layer_dict in weights.values() for category in layer_dict}
+        for category in sorted(categories):


Running through this loop can be quite lengthy, without any indication to the user of what's happening. Usually, calling get_peft_model is quite fast, so this could come as a surprise. WDYT about adding some progress indication, e.g. using tqdm for the loop? You could add a progressbar: bool argument to enable/disable it. I would also document that this can be slow.

Thank you very much for your suggestion. We need to perform Hierarchical Joint Decomposition on the matrices that require fine-tuning, which is very time-consuming. However, the good news is that the result of this Hierarchical Joint Decomposition is measured by the model, so fine-tuning a model for different tasks only requires one decomposition. We will use tqdm for the loop and document it.

BenjaminBossan · 2026-06-12T10:25:57Z

I tried to run this experiment on my machine with 24 GB and got OOM. The other tested PEFT methods usually stay below 12 GB. Do you think this is expected? For the final run, we'll use 48 GB cards, so the experiment may succeed there, but it could still be an indication that there is room for improvement. One suspicion I have is that float32 is used for the computation, not sure if that explains all of it.

This is a very serious bug, not just an issue with fp32, but also a problem encountered when using sparse matrices. I will fix this bug as soon as possible.

Applied the suggestion so eval-mode dropout reuses the shared base output as well. Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

Applied this by deriving `base_out` from `result` and lazily constructing `base_weight` only when dropout is active during training. Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

Bane-Elvin · 2026-06-14T16:03:17Z

@BenjaminBossan After investigating the unexpectedly high memory usage, we found a few implementation issues in the current FRoD code. This OOM was not expected and was caused by a serious implementation bug, not only by fp32 computation.

First, the shared FRoD projection buffers were not actually shared on GPU. Although V, s_indices, and s_size are initialized once per module category, the previous implementation registered them again inside each wrapped FrodLayer. After model.to(device), PyTorch materialized those registered buffers per layer, so the same category-level tensors were duplicated. This was a significant implementation bug because FRoD expects these tensors to be shared across layers in the same category.

Second, we re-tested the sparse forward path in both fp32 and fp16/bf16. In our supported setup, fp16/bf16 was not less reliable, so forcing the sparse path through fp32 was unnecessary and added avoidable memory overhead.

Third, the forward path was still too close to a PEFT delta-style implementation. FRoD semantically reconstructs the adapted weight rather than adding a LoRA-style delta, so the single-adapter/no-dropout active path does not need the target base linear output. Keeping target base weights on GPU in that path added unnecessary memory pressure, although we still need to preserve PEFT behavior for disable/merge/unmerge.

I updated the implementation accordingly. FrodLayer now keeps non-registered references to the tuner-level V, s_indices, and s_size buffers, so those tensors remain shared by category instead of being duplicated per layer.

I also updated the sparse forward path to compute the COO activation update directly in the requested dtype when sparse addmm is unavailable, instead of forcing an unconditional fp32 path. For the base-weight memory issue, I added an opt-in runtime_offload_base_weight config. Since FRoD reconstructs the adapted weight directly, this option can move target base weights to CPU during the single-adapter/no-dropout active path. It defaults to False to preserve PEFT’s usual device invariant and existing test expectations.

Could you please retry method_comparison/image-gen/experiments/frod/flux2-klein-default/adapter_config.json with the latest changes? This should be a much more representative memory test after the sharing and sparse-forward fixes.

BenjaminBossan

Thanks for implementing the feedback and for investigating the cause of the memory issue.

Unfortunately, when I tried to re-run the image generation benchmark, I still got OOM. Most other PEFT methods manage to stay below 12 GB. Can you replicate this issue? Note that for image generation, we have enabled gradient checkpointing, just mentioning it in case it matters. The MetaMath benchmark also OOMs with 24 GB, but that one skirts much closer to that number, so it doesn't necessarily mean something is wrong there.

BenjaminBossan · 2026-06-15T10:50:46Z

+        self.__dict__["frod_V"] = frod_V
+        self.__dict__["frod_s_indices"] = frod_s_indices
+        self.__dict__["frod_s_size"] = frod_s_size


Hmm, I wonder why we need __dict__ assignment here but not for other PEFT methods like VeRA?

BenjaminBossan · 2026-06-15T10:53:08Z

  "projection_prng_key": 0,
  "regularization_alpha": 0.001,
  "revision": null,
+  "runtime_offload_base_weight": true,


It would make sense to add this option to the MetaMath experiments too.

BenjaminBossan · 2026-06-15T10:54:55Z

+        if merge:
+            self.merge(safe_merge=safe_merge, adapter_names=adapter_names)
+            self._move_base_weight_to_device_of_adapter(self.active_adapters[0] if self.active_adapters else None)
+        else:
+            self._move_base_weight_to_device_of_adapter(self.active_adapters[0] if self.active_adapters else None)


Should do the same, right?

Suggested change

if merge:

self.merge(safe_merge=safe_merge, adapter_names=adapter_names)

self._move_base_weight_to_device_of_adapter(self.active_adapters[0] if self.active_adapters else None)

else:

self._move_base_weight_to_device_of_adapter(self.active_adapters[0] if self.active_adapters else None)

if merge:

self.merge(safe_merge=safe_merge, adapter_names=adapter_names)

self._move_base_weight_to_device_of_adapter(self.active_adapters[0] if self.active_adapters else None)

Also, please add a comment why _move_base_weight_to_device_of_adapter needs to be called.

BenjaminBossan · 2026-06-15T10:56:11Z

+            self._move_base_weight_to_device(x.device)
            result = self.base_layer(x, *args, **kwargs)
        elif self.merged:
+            self._move_base_weight_to_device(x.device)


Why are the _move_base_weight_to_device calls needed?

BenjaminBossan · 2026-06-15T11:03:30Z

+            )
+        },
+    )
+    runtime_offload_base_weight: bool = field(


I like the idea of adding this as an option. However, it wouldn't make sense to have multiple adapters on the same base model where some have runtime_offload_base_weight=True and others have False, right? I think we should enforce the same value for all.

The best way to achieve this is to override _check_new_adapter_config on FrodModel. This check comes from the parent class, so don't forget to call super(). Then, if there are multiple configs with different runtime_offload_base_weight values, an error should be raised. Add a test to tests/test_initialization.py.

Bane-Elvin · 2026-06-15T16:44:43Z

Thanks for re-running this and for pointing out the gradient checkpointing detail. I could reproduce the unexpectedly high memory behavior with the previous implementation, and it was not expected.

I found and fixed a few FRoD-specific memory issues. First, the shared FRoD projection buffers (V, s_indices, and s_size) were initialized once per module category, but were then registered again inside each wrapped FrodLayer, so they were duplicated on GPU after model.to(device). Second, the sparse forward path was unnecessarily forcing part of the computation through fp32; after re-testing fp16/bf16 in the supported setup, this was not needed. Third, since FRoD reconstructs the adapted weight directly, the active single-adapter/no-dropout path does not need the target base weight on GPU, so I added runtime_offload_base_weight for this case.

I re-ran the image generation benchmark with the existing benchmark script/settings, without disabling gradient checkpointing. The FRoD config I tested was:

model: black-forest-labs/FLUX.2-klein-base-4B
dataset: peft-internal-testing/cat-image-dataset
target modules: to_v, add_v_proj, to_qkv_mlp_proj, linear_out, to_k, to_add_out, add_q_proj, linear_in, add_k_proj, to_q, to_out.0
sparse_rate=0.01
frod_dropout=0.0
regularization_alpha=0.001
projection_prng_key=0
save_projection=true
runtime_offload_base_weight=true
optimizer lr: 1e-3
max_steps=20, eval_steps=20, valid_size=1, test_size=1

After the fix, I could not reproduce the training OOM. The benchmark-reported training memory was:

mem allocated: 9,128,246,938 bytes, about 8.50 GiB
mem reserved: 10,524,137,882 bytes, about 9.80 GiB
accelerator memory max: 10042 MB

For comparison, I also ran LoRA r=8, lora_alpha=16 with the same 20-step benchmark setup. It reported accelerator memory max: 9608 MB, so FRoD is now close to LoRA r=8 and below 12 GB under the benchmark training-memory metric.

One caveat: if I watch nvidia-smi during the final test/evaluation generation after Training finished after 20 steps, evaluation on test set follows, I can still see the process go to around 17 GB. That appears to be the final eval/generation phase rather than the benchmark's training memory measurement. If the OOM you saw happens specifically during that final generation/eval phase, I can investigate that separately.

BenjaminBossan

Thank you for investigating further to fix the memory issue. I can confirm that the benchmarks now both run on my machine.

One issue I noticed, however, is that the benchmarks run extremely slowly. Did you observe the same behavior? E.g. for the image benchmark, it takes ~600 sec for 100 steps to train and ~100 sec for the eval step. As a point of comparison, LoRA requires ~70 sec and ~13 sec, respectively.

I checked if it could be related to offloading, but after disabling it, the picture was the same. Then I changed the chunk size, e.g. by increasing it 8x. This might have had a small effect but at the cost of higher memory. If you have time, it would be great if you could investigate once more if there is something obviously wrong in the PEFT integration compared to your original code. If not, it's also okay to leave it as is, saving memory is more important for PEFT than good runtimes.

BenjaminBossan · 2026-06-16T09:59:12Z

+        assert torch.allclose(z.grad, z_expected.grad, atol=1e-3, rtol=1e-3)
+
+    @pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA is not available")
+    def test_frod_runtime_offload_keeps_base_weight_on_cpu_after_cuda(self, mlp):


As this test requires a GPU to run, could you please move it to tests/test_gpu_examples.py?

BenjaminBossan · 2026-06-16T10:13:43Z

 class FrodLayer(BaseTunerLayer):
    adapter_layer_names = ("frod_lambda_l", "frod_lambda_s_values")
    other_param_names = ("frod_V", "frod_U", "frod_s_indices", "frod_s_size", "runtime_offload_base_weight")
+    _frod_sparse_chunk_size = 16_384


Would it make sense to make this a config parameter for users to tune? It should mention the tradeoffs when increasing/decreasing that value.

BenjaminBossan · 2026-06-16T10:17:45Z

+        args=training_args,
+        train_dataset=tokenized["train"],
+        eval_dataset=tokenized["validation"],
+        tokenizer=tokenizer,


Suggested change

tokenizer=tokenizer,

processing_class=tokenizer,

Bane-Elvin · 2026-06-16T11:37:54Z

I investigated this and found that the slowdown came from the custom chunked COO activation path, not from base-weight offloading. The chunked implementation reduced temporary memory, but it used Python-level index_select/index_add_ loops and became very slow for the large image-gen projection categories, especially the 9216-dimensional linear_out category.

I changed the fp16/bf16 sparse activation path to densify only the sparse S factor temporarily and then use dense GEMM. This keeps the trainable COO parameterization and gradients to the sparse values, but avoids the slow chunked Python loop. In my local image benchmark check with bf16, resolution 512, batch size 2, and gradient checkpointing enabled, 20 training steps took 21.2s and peak reserved memory was 10200MB. This should bring the 100-step training time much closer to the LoRA comparison while staying below 12GB.

Bane-Elvin · 2026-06-16T13:17:02Z

I investigated the runtime issue and found that the slowdown came from the custom chunked COO activation path, not from base-weight offloading. I replaced that path with a faster implementation that temporarily densifies only the sparse S factor and then uses dense GEMM. This keeps the trainable COO parameterization and gradients to the sparse values, but avoids the Python-level index_select / index_add_ loop.

I pushed the fix in b102e705. I also re-ran the local image-generation benchmark with bf16, resolution 512, batch size 2, and gradient checkpointing enabled. For 100 training steps, I got 106.2s train time and 18.0s eval time, with peak reserved memory around 10.78GB and accelerator max memory 10284MB.

I also re-ran the relevant checks: tests/test_frod.py and the FRoD GPU test passed, and both make style and make quality passed.

BenjaminBossan

@Bane-Elvin Thank you for your further experiments. I can confirm that FRoD now runs fast and in a memory efficient way. The fact that the code is even simplified is a nice bonus. From my side, there is nothing more to add and we can go ahead and merge.

Just one more thing that we try to test out: Would you like to provide a tweet-sized blurb for FRoD? Something we can put into the release notes and that allows the interested user to quickly understand when they should use FRoD. So basically: What it is, what it's especially good at, when not to use it.

Bane-Elvin · 2026-06-19T08:58:38Z

@BenjaminBossan Thank you very much for the careful review, the additional benchmarking, and all the valuable suggestions. They helped uncover several important implementation issues and made the final FRoD integration both simpler and much more practical.

Tweet-sized blurb: FRoD is a full-rank, replacement-style PEFT method from FRoD: Full-Rank Efficient Fine-Tuning with Rotational Degrees for Fast Convergence. Instead of adding low-rank deltas, it reconstructs selected weights with shared rotational subspaces and sparse trainable coefficients. It is especially useful when fast convergence and a higher full-rank capacity ceiling are important, and its large sparse rotational subspace may also be promising for model merging. The main tradeoffs are the costly joint-decomposition initialization and slightly slower forward/backward passes than LoRA due to the sparse structured factors, so it may be less attractive for a one-off single-task fine-tune.

BenjaminBossan · 2026-06-19T09:57:16Z

Thanks a lot @Bane-Elvin for this great work and your patience with the reviews.

Bane-Elvin added 3 commits May 21, 2026 16:19

Add FRoD tuner

88ccfaa

Add FRoD-specific tests

2d3c730

Finalize FROD integration

b252c65

BenjaminBossan requested changes May 27, 2026

View reviewed changes

Address FROD PR review feedback

0e0d816

Bane-Elvin added 2 commits May 29, 2026 04:48

Fix FRoD sparse forward semantics

1735344

Align FRoD image example with CLIP setup

80c1870

BenjaminBossan requested changes May 29, 2026

View reviewed changes

Address FRoD layer review comments

e08e49f

BenjaminBossan requested changes Jun 1, 2026

View reviewed changes

Bane-Elvin added 2 commits June 2, 2026 13:30

Document FRoD module key parsing

3bfdc23

Test FRoD category inference

3209feb

Bane-Elvin requested a review from BenjaminBossan June 9, 2026 05:06

BenjaminBossan requested changes Jun 9, 2026

View reviewed changes

Address remaining FRoD review feedback

740f432

BenjaminBossan requested changes Jun 11, 2026

View reviewed changes

Apply FRoD style formatting

c391d35

BenjaminBossan requested changes Jun 12, 2026

View reviewed changes

Bane-Elvin and others added 3 commits June 12, 2026 21:24

Update src/peft/tuners/frod/layer.py

c7fe810

Applied the suggestion so eval-mode dropout reuses the shared base output as well. Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

Update src/peft/tuners/frod/layer.py

7c0f785

Applied this by deriving `base_out` from `result` and lazily constructing `base_weight` only when dropout is active during training. Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

Address FRoD memory and example feedback

3706721

BenjaminBossan requested changes Jun 15, 2026

View reviewed changes

Address FRoD runtime offload feedback

e703da2

BenjaminBossan requested changes Jun 16, 2026

View reviewed changes

Improve FRoD sparse activation runtime

b102e70

Merge branch 'main' into feature/add-frod

a380b38

BenjaminBossan approved these changes Jun 18, 2026

View reviewed changes

BenjaminBossan merged commit 53ce53f into huggingface:main Jun 19, 2026
18 of 19 checks passed

Conversation

Bane-Elvin commented May 27, 2026

Summary

Examples

Tests

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bane-Elvin commented May 28, 2026

Uh oh!

Bane-Elvin commented May 29, 2026

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!