Thank you for providing the fine tuning code and the script. I am facing following issue in fine tuning parler-tts-large on my custom dataset. Error message:
optimizer = torch.optim.AdamW(
#params=model.parameters(),
params=[p for p in model.parameters() if p.requires_grad], #changed to fix the optimizer fp16 issue but not working!
lr=training_args.learning_rate,
betas=(training_args.adam_beta1, training_args.adam_beta2),
eps=training_args.adam_epsilon,
weight_decay=training_args.weight_decay,
)
I did try to make above change as well so that it skips gradient clipping but didn't work. Your help would be much appreciated. When I tried bfloat16, I got CUDA out of memory error.
Hi There,
Thank you for providing the fine tuning code and the script. I am facing following issue in fine tuning parler-tts-large on my custom dataset. Error message:
01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - Passing a tuple of
past_key_valuesis deprecated and will be removed in Transformers v4.43.0. You should pass an instance ofEncoderDecoderCacheinstead, e.g.past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values).01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts -
prompt_attention_maskis specified butattention_maskis not. A fullattention_maskwill be created. Make sure this is the intended behaviour.01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts -
use_cache=Trueis incompatible with gradient checkpointing. Settinguse_cache=False... 01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts - Passing a tuple ofpast_key_valuesis deprecated and will be removed in Transformers v4.43.0. You should pass an instance ofEncoderDecoderCacheinstead, e.g.past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values). 01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts -prompt_attention_maskis specified butattention_maskis not. A fullattention_maskwill be created. Make sure this is the intended behaviour. 01/14/2025 02:59:04 - WARNING - parler_tts.modeling_parler_tts -use_cache=Trueis incompatible with gradient checkpointing. Settinguse_cache=False...[rank0]: Traceback (most recent call last):
[rank0]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1307, in
[rank0]: main()
[rank0]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1106, in main
[rank0]: optimizer.step()
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/optimizer.py", line 165, in step
[rank0]: self.scaler.step(self.optimizer, closure)
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 451, in step
[rank0]: self.unscale_(optimizer)
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
[rank0]: optimizer_state["found_inf_per_device"] = self.unscale_grads(
[rank0]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 260, in unscale_grads
[rank0]: raise ValueError("Attempting to unscale FP16 gradients.")
[rank0]: ValueError: Attempting to unscale FP16 gradients.
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1307, in
[rank1]: main()
[rank1]: File "/home/intel/JellyBeanAI_sound/./training/run_parler_tts_training.py", line 1106, in main
[rank1]: optimizer.step()
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/optimizer.py", line 165, in step
[rank1]: self.scaler.step(self.optimizer, closure)
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 451, in step
[rank1]: self.unscale_(optimizer)
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
[rank1]: optimizer_state["found_inf_per_device"] = self.unscale_grads(
[rank1]: File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 260, in unscale_grads
[rank1]: raise ValueError("Attempting to unscale FP16 gradients.")
[rank1]: ValueError: Attempting to unscale FP16 gradients.
Train steps ... : 0%| | 0/52 [00:18<?, ?it/s]
E0114 02:59:22.646752 1129956 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1130038) of binary: /opt/conda/envs/new_audiomodel_env/bin/python3.9
Traceback (most recent call last):
File "/opt/conda/envs/new_audiomodel_env/bin/accelerate", line 8, in
sys.exit(main())
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/new_audiomodel_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./training/run_parler_tts_training.py FAILED
Failures:
[1]:
time : 2025-01-14_02:59:22
host : instance-20241203-030408.us-central1-a.c.macro-coil-438702-f5.internal
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1130039)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2025-01-14_02:59:22
host : instance-20241203-030408.us-central1-a.c.macro-coil-438702-f5.internal
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1130038)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
My config details:
GCP with 2 T4 instances
torch dtype: float16
Define optimizer, LR scheduler, collator