Skip to content
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 102 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Examples

This repository contains examples for deploying and running distributed applications.

## Job Examples

### 1. Hello World Job
**Directory:** `01_job_hello_world/`

A simple "Hello World" example demonstrating how to submit and run basic jobs.

### 2. Image Processing
**Directory:** `image_processing/`

Process large-scale image datasets using Ray Data. This example demonstrates processing the ReLAION-2B dataset with over 2 billion rows.

### 3. Megatron + Ray Fault Tolerant Training
**Directory:** `megatron_ray_fault_tolerant/`

Implements PPO-style distributed training with Megatron and Ray, featuring comprehensive fault tolerance capabilities:
- Automatic actor recovery from failures
- Backup actor groups for seamless replacement
- Distributed checkpoint saving/loading
- Process group re-initialization after failures
- Support for tensor, pipeline, data, and context parallelism

## Service Examples

### 1. Hello World Service
**Directory:** `02_service_hello_world/`

A simple service deployment example demonstrating the basics of Ray Serve.

### 2. Deploy Llama 3.1 8B
**Directory:** `03_deploy_llama_3_8b/`

Deploy Llama 3.1 8B model using Ray Serve and vLLM with autoscaling capabilities.

### 3. Deploy Llama 3.1 70B
**Directory:** `deploy_llama_3_1_70b/`

Deploy the larger Llama 3.1 70B model with optimized serving configuration.

### 4. Tensor Parallel Serving
**Directory:** `serve_tensor_parallel/`

Demonstrates tensor parallelism for serving large language models across multiple GPUs.

### 5. FastVideo Generation
**Directory:** `video_generation_with_fastvideo/`

Deploy a video generation service using the FastVideo framework.

## Reinforcement Learning Examples

### SkyRL
**Directory:** `skyrl/`

Reinforcement learning training example using Ray and distributed computing.

## Getting Started

Most examples include their own README with specific instructions. Generally, you'll need:

1. Install the Anyscale CLI:
```bash
pip install -U anyscale
anyscale login
```

2. Navigate to the example directory:
```bash
cd <example_directory>
```

3. Deploy the service or submit the job:
```bash
# For services
anyscale service deploy -f service.yaml

# For jobs
anyscale job submit -f job.yaml
```

## Requirements

- Anyscale account and CLI access
- Appropriate cloud credentials configured
- GPU resources for ML/LLM examples

## Contributing

When adding new examples:
1. Create a descriptive directory name
2. Include a README.md with setup and usage instructions
3. Add appropriate YAML configuration files
4. Update this main README with your example

## License

See individual example directories for specific licensing information.

14 changes: 14 additions & 0 deletions image_processing/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
FROM anyscale/ray:2.51.1-slim-py312-cu128

# C compiler for Triton’s runtime build step (vLLM V1 engine)
# https://github.com/vllm-project/vllm/issues/2997
RUN sudo apt-get update && \
sudo apt-get install -y --no-install-recommends build-essential

RUN curl -LsSf https://astral.sh/uv/install.sh | sh

RUN uv pip install --system huggingface_hub boto3

RUN uv pip install --system vllm==0.11.0

RUN uv pip install --system transformers==4.57.1
9 changes: 9 additions & 0 deletions image_processing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Process images

This example uses Ray Data to process the [ReLAION-2B](https://huggingface.co/datasets/laion/relaion2B-en-research-safe) image dataset, which consists of over 2 billion rows. Each row consists of an image URL along with various metadata include a caption and image dimensions.

## Install the Anyscale CLI

```
anyscale job submit -f job.yaml --env HF_TOKEN=$HF_TOKEN
```
62 changes: 62 additions & 0 deletions image_processing/job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# View the docs https://docs.anyscale.com/reference/job-api#jobconfig.

name: process-images

# When empty, use the default image. This can be an Anyscale-provided base image
# like anyscale/ray:2.43.0-slim-py312-cu125, a user-provided base image (provided
# that it meets certain specs), or you can build new images using the Anyscale
# image builder at https://console.anyscale-staging.com/v2/container-images.
# image_uri: # anyscale/ray:2.43.0-slim-py312-cu125
containerfile: ./Dockerfile

# When empty, Anyscale will auto-select the instance types. You can also specify
# minimum and maximum resources.
compute_config:
# OPTION 1: Auto-selection (current - works on Anyscale-hosted)
# Uses default disk sizes (~100GB). Cannot customize disk with auto-selection.
min_resources:
CPU: 0
GPU: 0
max_resources:
CPU: 520
GPU: 128
auto_select_worker_config: true

# OPTION 2: Explicit config with custom disk (CUSTOMER-HOSTED ONLY)
# Uncomment below and comment out the auto-selection config above to use custom disk.
# NOTE: advanced_instance_config only works on customer-hosted AWS accounts.
# See DISK_SIZE_OPTIONS.md for details.
#
# head_node:
# instance_type: m5.2xlarge
# advanced_instance_config:
# BlockDeviceMappings:
# - DeviceName: /dev/sda1
# Ebs:
# VolumeSize: 500
# VolumeType: gp3
# worker_nodes:
# - instance_type: m5.16xlarge
# min_nodes: 0
# max_nodes: 100
# advanced_instance_config:
# BlockDeviceMappings:
# - DeviceName: /dev/sda1
# Ebs:
# VolumeSize: 500
# VolumeType: gp3

# Path to a local directory or a remote URI to a .zip file (S3, GS, HTTP) that
# will be the working directory for the job. The files in the directory will be
# automatically uploaded to the job environment in Anyscale.
working_dir: .

# When empty, this uses the default Anyscale Cloud in your organization.
cloud:

# The script to run in your job. You can also do "uv run main.py" if you have a
# pyproject.toml file in your working_dir.
entrypoint: python process_images.py

# If there is an error, do not retry.
max_retries: 0
Loading