Skip to content

Unable to do fine tuning of parler-tts-large on float16 #2

@ysant77

Description

@ysant77

Hi There,

Thank you for providing the fine tuning code and the script. I am facing following issue in fine tuning parler-tts-large on my custom dataset. Error message:

01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - Passing a tuple of past_key_values is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of EncoderDecoderCache instead, e.g. past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values).
01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - prompt_attention_mask is specified but attention_mask is not. A full attention_mask will be created. Make sure this is the intended behaviour.
01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... 01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - Passing a tuple of past_key_valuesis deprecated and will be removed in Transformers v4.43.0. You should pass an instance ofEncoderDecoderCacheinstead, e.g.past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values). 01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - prompt_attention_maskis specified butattention_maskis not. A fullattention_maskwill be created. Make sure this is the intended behaviour. 01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts -use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1307, in
[rank0]: main()
[rank0]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1106, in main
[rank0]: optimizer.step()
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/optimizer.py", line 165, in step
[rank0]: self.scaler.step(self.optimizer, closure)
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 451, in step
[rank0]: self.unscale_(optimizer)
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
[rank0]: optimizer_state["found_inf_per_device"] = self.unscale_grads(
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 260, in unscale_grads
[rank0]: raise ValueError("Attempting to unscale FP16 gradients.")
[rank0]: ValueError: Attempting to unscale FP16 gradients.
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1307, in
[rank1]: main()
[rank1]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1106, in main
[rank1]: optimizer.step()
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/optimizer.py", line 165, in step
[rank1]: self.scaler.step(self.optimizer, closure)
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 451, in step
[rank1]: self.unscale_(optimizer)
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
[rank1]: optimizer_state["found_inf_per_device"] = self.unscale_grads(
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 260, in unscale_grads
[rank1]: raise ValueError("Attempting to unscale FP16 gradients.")
[rank1]: ValueError: Attempting to unscale FP16 gradients.
Train steps ... : 0%| | 0/52 [00:18<?, ?it/s]
E0114 02:59:22.646752 1129956 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1130038) of binary: /opt/conda/envs/new_audiomodel_env/bin/python3.9
Traceback (most recent call last):
File "/opt/conda/envs/new_audiomodel_env/bin/accelerate", line 8, in
sys.exit(main())
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./training/run_parler_tts_training.py FAILED

Failures:
[1]:
time : 2025-01-14_02:59:22
host : instance-20241203-030408.us-central1-a.c.macro-coil-438702-f5.internal
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1130039)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2025-01-14_02:59:22
host : instance-20241203-030408.us-central1-a.c.macro-coil-438702-f5.internal
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1130038)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

My config details:

GCP with 2 T4 instances
torch dtype: float16

Define optimizer, LR scheduler, collator

optimizer = torch.optim.AdamW(
    #params=model.parameters(),
    params=[p for p in model.parameters() if p.requires_grad], #changed to fix the optimizer fp16 issue but not working!
    lr=training_args.learning_rate,
    betas=(training_args.adam_beta1, training_args.adam_beta2),
    eps=training_args.adam_epsilon,
    weight_decay=training_args.weight_decay,
)

I did try to make above change as well so that it skips gradient clipping but didn't work. Your help would be much appreciated. When I tried bfloat16, I got CUDA out of memory error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions