Skip to content

Feature/add frod#3270

Merged
BenjaminBossan merged 17 commits into
huggingface:mainfrom
Bane-Elvin:feature/add-frod
Jun 19, 2026
Merged

Feature/add frod#3270
BenjaminBossan merged 17 commits into
huggingface:mainfrom
Bane-Elvin:feature/add-frod

Conversation

@Bane-Elvin

Copy link
Copy Markdown
Contributor

Summary

This PR implements the FRoD PEFT tuner proposed in #3244.

  • Add FRoD config, tuner, layer implementation, save/load integration, and PEFT exports.
  • Add Conv1D handling and ViT target-module mapping.
  • Add FRoD to PEFT test matrices.
  • Add package reference docs, runnable Trainer examples, and benchmark settings.

Examples

  • BERT on SST-2 with google-bert/bert-base-uncased and nyu-mll/glue.
  • ViT on Stanford Cars with google/vit-base-patch16-224 and tanganke/stanford_cars.

Tests

  • PYTHONPATH=src python -m ruff check ...
  • PYTHONPATH=src python -m py_compile examples/frod_finetuning/*.py
  • PYTHONPATH=src python -m pytest -o addopts='' tests/test_decoder_models.py tests/test_encoder_decoder_models.py tests/test_feature_extraction_models.py tests/test_seq_classifier.py tests/test_custom_models.py tests/test_frod.py -q -k "FROD or frod"
  • PYTHONPATH=src python -m pytest -o addopts='' tests/test_config.py -q -k "FROD or frod"

@BenjaminBossan BenjaminBossan left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR to add FRoD. This looks already quite good. In this review, I focused on the integration, so I haven't checked the docs, examples, and experiments yet.

Comment thread src/peft/tuners/frod/config.py Outdated
Comment thread src/peft/tuners/frod/config.py Outdated


@dataclass
class FRODConfig(PeftConfig):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with FrodConfig etc. It's easier to type and more consistent overall (e.g. we use LoraConfig, not LoRAConfig).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to FrodConfig, FrodModel, and FrodLayer to match PEFT naming conventions. I kept PeftType.FROD uppercase because it follows the enum naming style.

Comment thread src/peft/tuners/frod/layer.py Outdated
self.merged_adapters = []

base_layer = self.get_base_layer()
if isinstance(base_layer, nn.Linear):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use peft.tuners.tuners_utils._get_in_out_features.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated FrodLayer to use peft.tuners.tuners_utils._get_in_out_features() instead of manually handling nn.Linear and Conv1D.

Comment thread src/peft/tuners/frod/layer.py Outdated
self.kwargs = kwargs

@property
def merged(self) -> bool:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be removed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the duplicate merged property.

Comment thread src/peft/tuners/frod/layer.py Outdated
init_weights,
):
base_layer = self.get_base_layer()
weight = base_layer.weight.T if isinstance(base_layer, Conv1D) else base_layer.weight

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use peft.utils.other.transpose.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this to use transpose(base_layer.weight, self.fan_in_fan_out).

Comment thread src/peft/tuners/frod/model.py Outdated
import warnings
from collections import defaultdict

import numpy as np

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I would prefer to use torch throughout instead of numpy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the FRoD initialization path to use torch.linalg throughout and removed the NumPy dependency.

Comment thread src/peft/utils/save_and_load.py Outdated
)
if config.save_projection and not has_projection:
raise ValueError(
"Specified to load FRoD projection tensors from state dictionary however they were not present!"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add instructions on how to deal with that situation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added instructions to the error message: either load with save_projection=False to regenerate the projections from the base model weights, or re-save the adapter with save_projection=True.

Comment thread tests/test_frod.py Outdated
== mlp_same_prng.base_model.model.lin2.frod_s_indices["other"].data_ptr()
)

def test_multiple_adapters_different_prng_raises(self):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, let's put this into test_initializaton.py.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this coverage to tests/test_initialization.py using the same structure as TestVeraInitialization.

Comment thread tests/test_frod.py Outdated
return peft_model

@staticmethod
def _make_second_adapter_different(peft_model):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be unnecessary with init_weigths=False?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really helpful. Removed the helper and its call; init_weights=False already makes the two adapters different for this test.

Comment thread tests/test_custom_models.py Outdated
("Vanilla MLP 2 FRoD", "MLP", FRODConfig, {"target_modules": ["lin0"]}),
("Vanilla MLP 3 FRoD", "MLP", FRODConfig, {"target_modules": ["lin1"]}),
("Vanilla MLP 4 FRoD", "MLP", FRODConfig, {"target_modules": ["lin0", "lin1"]}),
("Vanilla MLP 5 FRoD", "MLP", FRODConfig, {"target_modules": ["lin0"], "modules_to_save": ["lin1"]}),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also add FRoD to MULTIPLE_ACTIVE_ADAPTERS_TEST_CASES?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added FRoD to MULTIPLE_ACTIVE_ADAPTERS_TEST_CASES with the same-target adapter case.

@Bane-Elvin

Copy link
Copy Markdown
Contributor Author

Updated the PR in 0e0d816 to address the review feedback:

  • Renamed the public FRoD classes to follow PEFT naming conventions: FrodConfig, FrodModel, and FrodLayer.
  • Replaced custom shape/transpose handling with existing PEFT utilities where applicable.
  • Removed redundant code paths and simplified adapter initialization logic.
  • Switched FRoD initialization from NumPy to torch.linalg and removed the NumPy dependency.
  • Added comments for non-trivial implementation choices, including shared projection buffers, sparse CUDA matmul dtype handling, and pre-injection initialization.
  • Added/updated initialization tests and moved FRoD-specific initialization checks into tests/test_initialization.py.
  • Added FRoD coverage to multiple active adapter tests.
  • Improved the FRoD projection loading error message with recovery instructions.
  • Updated docs and runnable examples to use the new FrodConfig naming.

@Bane-Elvin

Copy link
Copy Markdown
Contributor Author

The two new commits fix the FRoD sparse forward semantics and align the image-classification example with the CLIP-ViT setup used in our experiments.

I also re-tested the examples with two examples: the Stanford Cars image-classification example and the SST-2 text-classification example. The CLIP-ViT Stanford Cars run converged as expected in the local 3-epoch test.

Thank you for the review and suggestions. Looking forward to your feedback.

@BenjaminBossan BenjaminBossan left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. I did another review focusing on the integration and still found a handful of issues, please check.

Comment thread src/peft/tuners/frod/layer.py Outdated
return result

def __repr__(self) -> str:
# Match PEFT tuner convention so printed models show FRoD-wrapped layers as `frod.*`.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment can be removed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unnecessary comment.

Comment thread src/peft/tuners/frod/layer.py
Comment thread src/peft/tuners/frod/layer.py Outdated
weight = self.get_base_layer().weight
return self._get_delta_weight(adapter)

def _get_delta_weight(self, adapter, base_weight: Optional[torch.Tensor] = None) -> torch.Tensor:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why we need _get_delta_weight with the option of passing base_weight. AFAICT, it's always self.get_base_layer().weight, so there is never a need to pass the argument explicitly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified get_delta_weight() to always use the base layer weight. merge() now precomputes adapter deltas before mutating the base weight, so the explicit base_weight argument is no longer needed.

Comment thread src/peft/tuners/frod/layer.py Outdated
values = self.frod_lambda_s_values[active_adapter].to(device=x.device, dtype=target_dtype)
lambda_l = self.frod_lambda_l[active_adapter].to(device=x.device, dtype=target_dtype)

x = x.to(target_dtype)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

target_dtype is defined as x.dtype, so this call is useless.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the redundant cast.

Comment thread src/peft/tuners/frod/layer.py Outdated
# F.linear(h, U @ (S + diag(lambda_l)) @ V.T).
# CUDA sparse fp16/bf16 kernels are less reliable, so use fp32 here and cast the update back below.
matmul_dtype = z_flat.dtype
if z_flat.is_cuda and matmul_dtype in (torch.float16, torch.bfloat16):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering: Why is this limited to is_cuda? Wouldn't we expect the same issue on other devices?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the fp16/bf16 sparse matmul fallback to use fp32 regardless of device.

Comment thread src/peft/tuners/frod/layer.py Outdated
Comment on lines +264 to +269
base_weight = transpose(self.get_base_layer().weight, self.fan_in_fan_out).to(
device=x.device, dtype=target_dtype
)
base_out = F.linear(x, base_weight)

result = result - base_out + out_add

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part I don't understand: For each FRoD adapter, we need to remove the base result from result? So my understanding is that the FRoD result already includes the result from the base model, so we need to subtract it to prevent it being included multiple times.

However, this seems very wasteful: First we calculate result = self.base_layer(x, *args, **kwargs), then we calculate base_out = F.linear(x, base_weight), which is basically the same thing (just without bias) and remove it from the result. And then repeat it for each active adapter.

What I would like to see instead is for the FRoD result to only contain the "delta" from FRoD without the base result, so that we can simply accumulate it. This would be the cleanest implementation IMO. It would also help to simplify get_delta_weight, where there is a similar logic to remove the base weight from the FRoD weight.

If, for some reason, it's not possible, we should instead count how many active adapters we have. Then we calculate the base result only once and remove it x times, once for each active adapter. This should be a lot cheaper than the current implementation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the forward pass so the FRoD branch explicitly computes and accumulates only the adapter delta. As described in the paper, FRoD directly reconstructs the adapted weight through the joint matrix factorization, so the base-weight contribution must be subtracted; otherwise the base model contribution would be included twice. The code now computes out_add - adapter_base_out and reuses the base output when dropout is identity.

@BenjaminBossan BenjaminBossan left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the new updates. The integration is getting there, but there were still some unclear parts of the code that I commented. Please take a look.

Comment thread src/peft/tuners/frod/layer.py
Comment thread src/peft/tuners/frod/layer.py
Comment thread src/peft/tuners/frod/model.py
Comment thread src/peft/tuners/frod/layer.py Outdated
Comment on lines +89 to +91
self.frod_V[adapter_name] = frod_V[adapter_name]
self.frod_s_indices[adapter_name] = frod_s_indices[adapter_name]
self.frod_s_size[adapter_name] = frod_s_size[adapter_name]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be deleted?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this can be deleted. This follows the same pattern as VeRA: the shared projection buffers are created at the tuner level and then referenced by each wrapped layer. In VeRA, update_layer() assigns the shared vera_A / vera_B buffers to the layer; here FRoD attaches the shared category-level frod_V, frod_s_indices, and frod_s_size entries to the layer-local non-persistent buffers. These are shared references used by forward() and get_delta_weight(), not independent persistent copies.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that here, we add the three parameters, and then just below, we add them again and cast them. Why not do this in a single step?

Comment thread src/peft/tuners/frod/layer.py Outdated
if active_adapter not in self.frod_lambda_s_values:
continue

V = self.frod_V[active_adapter].to(device=x.device, dtype=target_dtype)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here is partly exactly the same as in get_delta_weight. How about factoring out the common parts into a dedicated method?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some overlap in fetching/casting the FRoD tensors, but the two paths intentionally do different computations. get_delta_weight() materializes the dense delta weight for merge/unmerge, while forward() keeps the sparse activation-side computation and avoids constructing the full dense matrix.

I could factor out only the tensor-fetching part, but that helper would mostly return a long tuple of tensors and may make the forward path harder to read. I would prefer to keep the dense merge path and sparse forward path separate unless you think the small helper would still be clearer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it would be better to have a small helper function that takes the device and dtype and returns the relevant tensors, up to the sparse tensor (before making it dense). Not only would it help to keep the two code paths in sync, IMO it also makes it easier to see what is shared between the merging and the forward path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factored the shared tensor fetching/casting and sparse tensor construction into _get_frod_tensors(), while keeping the dense merge path and sparse forward path separate.

Comment thread src/peft/tuners/frod/layer.py Outdated
Comment thread src/peft/tuners/frod/model.py
Comment thread src/peft/tuners/frod/model.py
Comment thread tests/test_frod.py
@Bane-Elvin Bane-Elvin requested a review from BenjaminBossan June 9, 2026 05:06
@Bane-Elvin

Copy link
Copy Markdown
Contributor Author

@BenjaminBossan Thanks for the review! I have addressed the requested changes and re-requested your review. Could you please take another look when you have time?

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@BenjaminBossan BenjaminBossan left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the new updates. For this review, I focused on these changes and the comments that were still open. Generally, my questions have been resolved by I still have a few comments.

Interestingly, one single FRoD test failed in one of the Windows CI runs. I restarted that run, in case it's just a flaky result, but if it fails again, we should investigate.

One more thing: We now added an image generation benchmark to PEFT: https://github.com/huggingface/peft/tree/main/method_comparison/image-gen. The idea is exactly the same as for the MetaMath benchmark. It would be nice if you could add an experiment for FRoD to this benchmark. Check the existing experimental settings for guidance.

Comment thread src/peft/tuners/frod/layer.py Outdated
Comment on lines +89 to +91
self.frod_V[adapter_name] = frod_V[adapter_name]
self.frod_s_indices[adapter_name] = frod_s_indices[adapter_name]
self.frod_s_size[adapter_name] = frod_s_size[adapter_name]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that here, we add the three parameters, and then just below, we add them again and cast them. Why not do this in a single step?

Comment thread src/peft/tuners/frod/layer.py
Comment thread src/peft/tuners/frod/layer.py Outdated
V = self.frod_V[adapter].to(device=device, dtype=dtype)
indices = self.frod_s_indices[adapter].to(device=U.device, dtype=torch.long)
size = tuple(int(dim) for dim in self.frod_s_size[adapter].tolist())
values = self.frod_lambda_s_values[adapter].to(U.device, U.dtype).clone()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we really need the .clone here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unnecessary .clone() since the sparse values are not modified in place.

Comment thread src/peft/tuners/frod/layer.py Outdated
if active_adapter not in self.frod_lambda_s_values:
continue

V = self.frod_V[active_adapter].to(device=x.device, dtype=target_dtype)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it would be better to have a small helper function that takes the device and dtype and returns the relevant tensors, up to the sparse tensor (before making it dense). Not only would it help to keep the two code paths in sync, IMO it also makes it easier to see what is shared between the merging and the forward path.

@Bane-Elvin

Copy link
Copy Markdown
Contributor Author

Updated in 740f432.

This addresses the remaining FrodLayer feedback by simplifying the shared buffer assignment, adding the merge-path comment, removing the unnecessary clone, and factoring the shared FRoD tensor preparation into _get_frod_tensors() while keeping the dense merge path and sparse forward path separate.

I checked the previous PR head before pushing this update, and all 10 checks were successful, including the Windows jobs. Could you please let me know if the Windows CI issue still appears on your side after this new run starts?

I also added the FRoD experiment settings for the new image generation benchmark under method_comparison/image-gen/experiments/frod/flux2-klein-default.

@BenjaminBossan BenjaminBossan left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the recent changes, they look good. However, ruff is complaining, so could you please run make style?

I have another suggestion: In PR #3117, we recently added the foundation for generic quantization support. What that means is that with only a few changes, it should be possible for most PEFT methods to support multiple quantization backends. In other words, FRoD could support bitsandbytes, torchao, and gptqmodel without the need to add extra FRoD layer classes for each of them.

To achieve that, use the changes in the linked PR as an example and apply the same idea to FRoD. If you run into any difficulties, please let me know. This feature is not a must have, if you don't want to support in FRoD, it's fine. But I would highly recommend to add it, as many users are interested in fine-tuning quantized models.

@Bane-Elvin

Copy link
Copy Markdown
Contributor Author

Applied make style and pushed the formatting fix in c391d35. make quality passes locally now.

@Bane-Elvin

Copy link
Copy Markdown
Contributor Author

Thanks for the suggestion. I agree that generic quantization support is an important direction, and PR #3117 provides a good shared path for methods that naturally operate on quantized base layers.

For FRoD, I think the quantization design needs a separate and more careful treatment. Unlike LoRA-style methods, FRoD does not fundamentally need to keep using the original base weight after initialization. Instead, the target weight is reparameterized through U, the shared projection V, and sparse/diagonal factors, and the effective adapter weight is reconstructed from these FRoD factors. Therefore, simply supporting a quantized base layer through the generic wrapper may not be the most meaningful compression path for this method. A more relevant direction may be to study whether the FRoD factors themselves, especially U and the shared V, can be quantized while preserving the intended weight reconstruction behavior.

To avoid making this PR larger, I would prefer to keep this PR focused on adding the core FRoD method. I can investigate the quantization design separately, and if it works well, open a follow-up PR specifically for FRoD quantization support.

@BenjaminBossan BenjaminBossan left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the new updates, the tests are passing (one flaky CI run can be ignored).

Regarding the forward path, I think I found one possible optimization, please check my comment. I also checked the examples and added some comments regarding those.

About the quantization: Thanks for explaining why it's not trivially supported here. Let's drop it for now. If you could revisit it later, that would be great.

Comment thread src/peft/tuners/frod/layer.py Outdated
Comment thread src/peft/tuners/frod/layer.py Outdated
Comment on lines +15 to +25
MODEL_NAME = os.environ.get("FROD_IMAGE_MODEL_NAME", "openai/clip-vit-base-patch32")
OUTPUT_DIR = os.environ.get("FROD_IMAGE_OUTPUT_DIR", "clip-vit-base-patch32-frod-stanford-cars")
DATA_DIR = os.environ.get("FROD_STANFORD_CARS_DATA_DIR")
NUM_TRAIN_EPOCHS = int(os.environ.get("FROD_IMAGE_NUM_TRAIN_EPOCHS", "3"))
TRAIN_BATCH_SIZE = int(os.environ.get("FROD_IMAGE_TRAIN_BATCH_SIZE", "64"))
EVAL_BATCH_SIZE = int(os.environ.get("FROD_IMAGE_EVAL_BATCH_SIZE", "64"))
SPARSE_RATE = float(os.environ.get("FROD_IMAGE_SPARSE_RATE", "0.01"))
FROD_LAMBDA_L_LR = float(os.environ.get("FROD_IMAGE_LAMBDA_L_LR", "5e-4"))
FROD_LAMBDA_S_LR = float(os.environ.get("FROD_IMAGE_LAMBDA_S_LR", "5e-5"))
CLASSIFIER_LR = float(os.environ.get("FROD_IMAGE_CLASSIFIER_LR", "1e-4"))
CLIP_TARGET_MODULES = ["q_proj", "k_proj", "v_proj", "out_proj", "fc1", "fc2"]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of setting all of these as environment variables. Let's please parse them as proper command line arguments. You could use HfArgumentParser(TrainingArguments) for the Trainer arguments and parse the rest manually.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced the environment-variable based configuration with HfArgumentParser-backed CLI arguments.

OUTPUT_DIR = "bert-base-uncased-frod-sst2"
FROD_LAMBDA_L_LR = 2e-2
FROD_LAMBDA_S_LR = 2e-3
CLASSIFIER_LR = 1e-2

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar argument, let's add a proper CLI.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied the same CLI structure to the text example as well.

Comment thread src/peft/tuners/frod/model.py Outdated

generator = torch.Generator(device="cpu").manual_seed(config.projection_prng_key)
categories = {category for layer_dict in weights.values() for category in layer_dict}
for category in sorted(categories):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running through this loop can be quite lengthy, without any indication to the user of what's happening. Usually, calling get_peft_model is quite fast, so this could come as a surprise. WDYT about adding some progress indication, e.g. using tqdm for the loop? You could add a progressbar: bool argument to enable/disable it. I would also document that this can be slow.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your suggestion. We need to perform Hierarchical Joint Decomposition on the matrices that require fine-tuning, which is very time-consuming. However, the good news is that the result of this Hierarchical Joint Decomposition is measured by the model, so fine-tuning a model for different tasks only requires one decomposition. We will use tqdm for the loop and document it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to run this experiment on my machine with 24 GB and got OOM. The other tested PEFT methods usually stay below 12 GB. Do you think this is expected? For the final run, we'll use 48 GB cards, so the experiment may succeed there, but it could still be an indication that there is room for improvement. One suspicion I have is that float32 is used for the computation, not sure if that explains all of it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very serious bug, not just an issue with fp32, but also a problem encountered when using sparse matrices. I will fix this bug as soon as possible.

Bane-Elvin and others added 3 commits June 12, 2026 21:24
Applied the suggestion so eval-mode dropout reuses the shared base output as well.

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Applied this by deriving `base_out` from `result` and lazily constructing `base_weight` only when dropout is active during training.

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
@Bane-Elvin

Copy link
Copy Markdown
Contributor Author

@BenjaminBossan After investigating the unexpectedly high memory usage, we found a few implementation issues in the current FRoD code. This OOM was not expected and was caused by a serious implementation bug, not only by fp32 computation.

First, the shared FRoD projection buffers were not actually shared on GPU. Although V, s_indices, and s_size are initialized once per module category, the previous implementation registered them again inside each wrapped FrodLayer. After model.to(device), PyTorch materialized those registered buffers per layer, so the same category-level tensors were duplicated. This was a significant implementation bug because FRoD expects these tensors to be shared across layers in the same category.

Second, we re-tested the sparse forward path in both fp32 and fp16/bf16. In our supported setup, fp16/bf16 was not less reliable, so forcing the sparse path through fp32 was unnecessary and added avoidable memory overhead.

Third, the forward path was still too close to a PEFT delta-style implementation. FRoD semantically reconstructs the adapted weight rather than adding a LoRA-style delta, so the single-adapter/no-dropout active path does not need the target base linear output. Keeping target base weights on GPU in that path added unnecessary memory pressure, although we still need to preserve PEFT behavior for disable/merge/unmerge.

I updated the implementation accordingly. FrodLayer now keeps non-registered references to the tuner-level V, s_indices, and s_size buffers, so those tensors remain shared by category instead of being duplicated per layer.

I also updated the sparse forward path to compute the COO activation update directly in the requested dtype when sparse addmm is unavailable, instead of forcing an unconditional fp32 path. For the base-weight memory issue, I added an opt-in runtime_offload_base_weight config. Since FRoD reconstructs the adapted weight directly, this option can move target base weights to CPU during the single-adapter/no-dropout active path. It defaults to False to preserve PEFT’s usual device invariant and existing test expectations.

Could you please retry method_comparison/image-gen/experiments/frod/flux2-klein-default/adapter_config.json with the latest changes? This should be a much more representative memory test after the sharing and sparse-forward fixes.

@BenjaminBossan BenjaminBossan left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing the feedback and for investigating the cause of the memory issue.

Unfortunately, when I tried to re-run the image generation benchmark, I still got OOM. Most other PEFT methods manage to stay below 12 GB. Can you replicate this issue? Note that for image generation, we have enabled gradient checkpointing, just mentioning it in case it matters. The MetaMath benchmark also OOMs with 24 GB, but that one skirts much closer to that number, so it doesn't necessarily mean something is wrong there.

Comment thread src/peft/tuners/frod/layer.py Outdated
Comment on lines +102 to +104
self.__dict__["frod_V"] = frod_V
self.__dict__["frod_s_indices"] = frod_s_indices
self.__dict__["frod_s_size"] = frod_s_size

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I wonder why we need __dict__ assignment here but not for other PEFT methods like VeRA?

"projection_prng_key": 0,
"regularization_alpha": 0.001,
"revision": null,
"runtime_offload_base_weight": true,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make sense to add this option to the MetaMath experiments too.

Comment thread src/peft/tuners/frod/layer.py Outdated
Comment on lines +186 to +190
if merge:
self.merge(safe_merge=safe_merge, adapter_names=adapter_names)
self._move_base_weight_to_device_of_adapter(self.active_adapters[0] if self.active_adapters else None)
else:
self._move_base_weight_to_device_of_adapter(self.active_adapters[0] if self.active_adapters else None)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should do the same, right?

Suggested change
if merge:
self.merge(safe_merge=safe_merge, adapter_names=adapter_names)
self._move_base_weight_to_device_of_adapter(self.active_adapters[0] if self.active_adapters else None)
else:
self._move_base_weight_to_device_of_adapter(self.active_adapters[0] if self.active_adapters else None)
if merge:
self.merge(safe_merge=safe_merge, adapter_names=adapter_names)
self._move_base_weight_to_device_of_adapter(self.active_adapters[0] if self.active_adapters else None)

Also, please add a comment why _move_base_weight_to_device_of_adapter needs to be called.

self._move_base_weight_to_device(x.device)
result = self.base_layer(x, *args, **kwargs)
elif self.merged:
self._move_base_weight_to_device(x.device)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the _move_base_weight_to_device calls needed?

)
},
)
runtime_offload_base_weight: bool = field(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of adding this as an option. However, it wouldn't make sense to have multiple adapters on the same base model where some have runtime_offload_base_weight=True and others have False, right? I think we should enforce the same value for all.

The best way to achieve this is to override _check_new_adapter_config on FrodModel. This check comes from the parent class, so don't forget to call super(). Then, if there are multiple configs with different runtime_offload_base_weight values, an error should be raised. Add a test to tests/test_initialization.py.

@Bane-Elvin

Copy link
Copy Markdown
Contributor Author

Thanks for re-running this and for pointing out the gradient checkpointing detail. I could reproduce the unexpectedly high memory behavior with the previous implementation, and it was not expected.

I found and fixed a few FRoD-specific memory issues. First, the shared FRoD projection buffers (V, s_indices, and s_size) were initialized once per module category, but were then registered again inside each wrapped FrodLayer, so they were duplicated on GPU after model.to(device). Second, the sparse forward path was unnecessarily forcing part of the computation through fp32; after re-testing fp16/bf16 in the supported setup, this was not needed. Third, since FRoD reconstructs the adapted weight directly, the active single-adapter/no-dropout path does not need the target base weight on GPU, so I added runtime_offload_base_weight for this case.

I re-ran the image generation benchmark with the existing benchmark script/settings, without disabling gradient checkpointing. The FRoD config I tested was:

  • model: black-forest-labs/FLUX.2-klein-base-4B
  • dataset: peft-internal-testing/cat-image-dataset
  • target modules: to_v, add_v_proj, to_qkv_mlp_proj, linear_out, to_k, to_add_out, add_q_proj, linear_in, add_k_proj, to_q, to_out.0
  • sparse_rate=0.01
  • frod_dropout=0.0
  • regularization_alpha=0.001
  • projection_prng_key=0
  • save_projection=true
  • runtime_offload_base_weight=true
  • optimizer lr: 1e-3
  • max_steps=20, eval_steps=20, valid_size=1, test_size=1

After the fix, I could not reproduce the training OOM. The benchmark-reported training memory was:

  • mem allocated: 9,128,246,938 bytes, about 8.50 GiB
  • mem reserved: 10,524,137,882 bytes, about 9.80 GiB
  • accelerator memory max: 10042 MB

For comparison, I also ran LoRA r=8, lora_alpha=16 with the same 20-step benchmark setup. It reported accelerator memory max: 9608 MB, so FRoD is now close to LoRA r=8 and below 12 GB under the benchmark training-memory metric.

One caveat: if I watch nvidia-smi during the final test/evaluation generation after Training finished after 20 steps, evaluation on test set follows, I can still see the process go to around 17 GB. That appears to be the final eval/generation phase rather than the benchmark's training memory measurement. If the OOM you saw happens specifically during that final generation/eval phase, I can investigate that separately.

@BenjaminBossan BenjaminBossan left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for investigating further to fix the memory issue. I can confirm that the benchmarks now both run on my machine.

One issue I noticed, however, is that the benchmarks run extremely slowly. Did you observe the same behavior? E.g. for the image benchmark, it takes ~600 sec for 100 steps to train and ~100 sec for the eval step. As a point of comparison, LoRA requires ~70 sec and ~13 sec, respectively.

I checked if it could be related to offloading, but after disabling it, the picture was the same. Then I changed the chunk size, e.g. by increasing it 8x. This might have had a small effect but at the cost of higher memory. If you have time, it would be great if you could investigate once more if there is something obviously wrong in the PEFT integration compared to your original code. If not, it's also okay to leave it as is, saving memory is more important for PEFT than good runtimes.

Comment thread tests/test_frod.py Outdated
assert torch.allclose(z.grad, z_expected.grad, atol=1e-3, rtol=1e-3)

@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA is not available")
def test_frod_runtime_offload_keeps_base_weight_on_cpu_after_cuda(self, mlp):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this test requires a GPU to run, could you please move it to tests/test_gpu_examples.py?

Comment thread src/peft/tuners/frod/layer.py Outdated
class FrodLayer(BaseTunerLayer):
adapter_layer_names = ("frod_lambda_l", "frod_lambda_s_values")
other_param_names = ("frod_V", "frod_U", "frod_s_indices", "frod_s_size", "runtime_offload_base_weight")
_frod_sparse_chunk_size = 16_384

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to make this a config parameter for users to tune? It should mention the tradeoffs when increasing/decreasing that value.

args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["validation"],
tokenizer=tokenizer,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tokenizer=tokenizer,
processing_class=tokenizer,

@Bane-Elvin

Copy link
Copy Markdown
Contributor Author

I investigated this and found that the slowdown came from the custom chunked COO activation path, not from base-weight offloading. The chunked implementation reduced temporary memory, but it used Python-level index_select/index_add_ loops and became very slow for the large image-gen projection categories, especially the 9216-dimensional linear_out category.

I changed the fp16/bf16 sparse activation path to densify only the sparse S factor temporarily and then use dense GEMM. This keeps the trainable COO parameterization and gradients to the sparse values, but avoids the slow chunked Python loop. In my local image benchmark check with bf16, resolution 512, batch size 2, and gradient checkpointing enabled, 20 training steps took 21.2s and peak reserved memory was 10200MB. This should bring the 100-step training time much closer to the LoRA comparison while staying below 12GB.

@Bane-Elvin

Copy link
Copy Markdown
Contributor Author

I investigated the runtime issue and found that the slowdown came from the custom chunked COO activation path, not from base-weight offloading. I replaced that path with a faster implementation that temporarily densifies only the sparse S factor and then uses dense GEMM. This keeps the trainable COO parameterization and gradients to the sparse values, but avoids the Python-level index_select / index_add_ loop.

I pushed the fix in b102e705. I also re-ran the local image-generation benchmark with bf16, resolution 512, batch size 2, and gradient checkpointing enabled. For 100 training steps, I got 106.2s train time and 18.0s eval time, with peak reserved memory around 10.78GB and accelerator max memory 10284MB.

I also re-ran the relevant checks: tests/test_frod.py and the FRoD GPU test passed, and both make style and make quality passed.

@BenjaminBossan BenjaminBossan left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bane-Elvin Thank you for your further experiments. I can confirm that FRoD now runs fast and in a memory efficient way. The fact that the code is even simplified is a nice bonus. From my side, there is nothing more to add and we can go ahead and merge.

Just one more thing that we try to test out: Would you like to provide a tweet-sized blurb for FRoD? Something we can put into the release notes and that allows the interested user to quickly understand when they should use FRoD. So basically: What it is, what it's especially good at, when not to use it.

@Bane-Elvin

Copy link
Copy Markdown
Contributor Author

@BenjaminBossan Thank you very much for the careful review, the additional benchmarking, and all the valuable suggestions. They helped uncover several important implementation issues and made the final FRoD integration both simpler and much more practical.

Tweet-sized blurb: FRoD is a full-rank, replacement-style PEFT method from FRoD: Full-Rank Efficient Fine-Tuning with Rotational Degrees for Fast Convergence. Instead of adding low-rank deltas, it reconstructs selected weights with shared rotational subspaces and sparse trainable coefficients. It is especially useful when fast convergence and a higher full-rank capacity ceiling are important, and its large sparse rotational subspace may also be promising for model merging. The main tradeoffs are the costly joint-decomposition initialization and slightly slower forward/backward passes than LoRA due to the sparse structured factors, so it may be less attractive for a one-off single-task fine-tune.

@BenjaminBossan BenjaminBossan merged commit 53ce53f into huggingface:main Jun 19, 2026
18 of 19 checks passed
@BenjaminBossan

Copy link
Copy Markdown
Member

Thanks a lot @Bane-Elvin for this great work and your patience with the reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants