anyscale · xyuzh · Aug 24, 2025 · Aug 25, 2025 · Nov 3, 2025 · Sep 12, 2025
diff --git a/README.md b/README.md
@@ -0,0 +1,102 @@
+# Examples
+
+This repository contains examples for deploying and running distributed applications.
+
+## Job Examples
+
+### 1. Hello World Job
+**Directory:** `01_job_hello_world/`
+
+A simple "Hello World" example demonstrating how to submit and run basic jobs.
+
+### 2. Image Processing
+**Directory:** `image_processing/`
+
+Process large-scale image datasets using Ray Data. This example demonstrates processing the ReLAION-2B dataset with over 2 billion rows.
+
+### 3. Megatron + Ray Fault Tolerant Training
+**Directory:** `megatron_ray_fault_tolerant/`
+
+Implements PPO-style distributed training with Megatron and Ray, featuring comprehensive fault tolerance capabilities:
+- Automatic actor recovery from failures
+- Backup actor groups for seamless replacement
+- Distributed checkpoint saving/loading
+- Process group re-initialization after failures
+- Support for tensor, pipeline, data, and context parallelism
+
+## Service Examples
+
+### 1. Hello World Service
+**Directory:** `02_service_hello_world/`
+
+A simple service deployment example demonstrating the basics of Ray Serve.
+
+### 2. Deploy Llama 3.1 8B
+**Directory:** `03_deploy_llama_3_8b/`
+
+Deploy Llama 3.1 8B model using Ray Serve and vLLM with autoscaling capabilities.
+
+### 3. Deploy Llama 3.1 70B
+**Directory:** `deploy_llama_3_1_70b/`
+
+Deploy the larger Llama 3.1 70B model with optimized serving configuration.
+
+### 4. Tensor Parallel Serving
+**Directory:** `serve_tensor_parallel/`
+
+Demonstrates tensor parallelism for serving large language models across multiple GPUs.
+
+### 5. FastVideo Generation
+**Directory:** `video_generation_with_fastvideo/`
+
+Deploy a video generation service using the FastVideo framework.
+
+## Reinforcement Learning Examples
+
+### SkyRL
+**Directory:** `skyrl/`
+
+Reinforcement learning training example using Ray and distributed computing.
+
+## Getting Started
+
+Most examples include their own README with specific instructions. Generally, you'll need:
+
+1. Install the Anyscale CLI:
+```bash
+pip install -U anyscale
+anyscale login
+```
+
+2. Navigate to the example directory:
+```bash
+cd <example_directory>
+```
+
+3. Deploy the service or submit the job:
+```bash
+# For services
+anyscale service deploy -f service.yaml
+
+# For jobs
+anyscale job submit -f job.yaml
+```
+
+## Requirements
+
+- Anyscale account and CLI access
+- Appropriate cloud credentials configured
+- GPU resources for ML/LLM examples
+
+## Contributing
+
+When adding new examples:
+1. Create a descriptive directory name
+2. Include a README.md with setup and usage instructions
+3. Add appropriate YAML configuration files
+4. Update this main README with your example
+
+## License
+
+See individual example directories for specific licensing information.
+
diff --git a/image_processing/Dockerfile b/image_processing/Dockerfile
@@ -0,0 +1,14 @@
+FROM anyscale/ray:2.51.1-slim-py312-cu128
+
+# C compiler for Triton’s runtime build step (vLLM V1 engine)
+# https://github.com/vllm-project/vllm/issues/2997
+RUN sudo apt-get update && \
+    sudo apt-get install -y --no-install-recommends build-essential
+
+RUN curl -LsSf https://astral.sh/uv/install.sh | sh
+
+RUN uv pip install --system huggingface_hub boto3
+
+RUN uv pip install --system vllm==0.11.0
+
+RUN uv pip install --system transformers==4.57.1
diff --git a/image_processing/README.md b/image_processing/README.md
@@ -0,0 +1,9 @@
+# Process images
+
+This example uses Ray Data to process the [ReLAION-2B](https://huggingface.co/datasets/laion/relaion2B-en-research-safe) image dataset, which consists of over 2 billion rows. Each row consists of an image URL along with various metadata include a caption and image dimensions.
+
+## Install the Anyscale CLI
+
+```
+anyscale job submit -f job.yaml --env HF_TOKEN=$HF_TOKEN
+```
diff --git a/image_processing/job.yaml b/image_processing/job.yaml
@@ -0,0 +1,62 @@
+# View the docs https://docs.anyscale.com/reference/job-api#jobconfig.
+
+name: process-images
+
+# When empty, use the default image. This can be an Anyscale-provided base image
+# like anyscale/ray:2.43.0-slim-py312-cu125, a user-provided base image (provided
+# that it meets certain specs), or you can build new images using the Anyscale
+# image builder at https://console.anyscale-staging.com/v2/container-images.
+# image_uri:  # anyscale/ray:2.43.0-slim-py312-cu125
+containerfile: ./Dockerfile
+
+# When empty, Anyscale will auto-select the instance types. You can also specify
+# minimum and maximum resources.
+compute_config:
+  # OPTION 1: Auto-selection (current - works on Anyscale-hosted)
+  # Uses default disk sizes (~100GB). Cannot customize disk with auto-selection.
+  min_resources:
+    CPU: 0
+    GPU: 0
+  max_resources:
+    CPU: 520
+    GPU: 128
+  auto_select_worker_config: true
+
+  # OPTION 2: Explicit config with custom disk (CUSTOMER-HOSTED ONLY)
+  # Uncomment below and comment out the auto-selection config above to use custom disk.
+  # NOTE: advanced_instance_config only works on customer-hosted AWS accounts.
+  # See DISK_SIZE_OPTIONS.md for details.
+  #
+  # head_node:
+  #   instance_type: m5.2xlarge
+  #   advanced_instance_config:
+  #     BlockDeviceMappings:
+  #       - DeviceName: /dev/sda1
+  #         Ebs:
+  #           VolumeSize: 500
+  #           VolumeType: gp3
+  # worker_nodes:
+  #   - instance_type: m5.16xlarge
+  #     min_nodes: 0
+  #     max_nodes: 100
+  #     advanced_instance_config:
+  #       BlockDeviceMappings:
+  #         - DeviceName: /dev/sda1
+  #           Ebs:
+  #             VolumeSize: 500
+  #             VolumeType: gp3
+
+# Path to a local directory or a remote URI to a .zip file (S3, GS, HTTP) that
+# will be the working directory for the job. The files in the directory will be
+# automatically uploaded to the job environment in Anyscale.
+working_dir: .
+
+# When empty, this uses the default Anyscale Cloud in your organization.
+cloud:
+
+# The script to run in your job. You can also do "uv run main.py" if you have a
+# pyproject.toml file in your working_dir.
+entrypoint: python process_images.py
+
+# If there is an error, do not retry.
+max_retries: 0