Multi-Node Training Support

First off, thanks for open-sourcing this amazing work!

I'm looking to scale `run_ce.sh` training across multiple nodes because my 4xA100 node runs out of memory.

Any pointers on how to get implicit PRM training with `run_ce.sh` to work on a multi-node setup?

Are the examples in the [OpenRLHF documentation](https://openrlhf.readthedocs.io/en/latest/multi-node.html#how-to-launch-ray-ppo-on-slurm) immediately applicable?

Any insights appreciated!

Best

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Node Training Support #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-Node Training Support #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions