Skip to content

Conversation

@andrewsykim
Copy link
Member

Why are these changes needed?

Cherry-picks for v1.5.1

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

400Ping and others added 25 commits October 29, 2025 18:51
ray-project#4141)

* [Bug] Sidecar mode shouldn't restart head pod when head pod is deleted

Signed-off-by: 400Ping <[email protected]>

* [Fix] Fix e2e error

Signed-off-by: 400Ping <[email protected]>

* [Fix] fix according to rueian's comment

Signed-off-by: 400Ping <[email protected]>

* [Chore] fix ci error

Signed-off-by: 400Ping <[email protected]>

* Update ray-operator/controllers/ray/raycluster_controller.go

Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
Signed-off-by: Ping <[email protected]>

* Update ray-operator/controllers/ray/rayjob_controller.go

Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
Signed-off-by: Ping <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* Trigger CI

Signed-off-by: Future-Outlier <[email protected]>

---------

Signed-off-by: 400Ping <[email protected]>
Signed-off-by: Ping <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
…i-slice (ray-project#4163)

* [Feature Enhancement] Set ordered replica index label to support multi-slice

Signed-off-by: Ryan O'Leary <[email protected]>

* rename replica-id -> replica-name

Signed-off-by: Ryan O'Leary <[email protected]>

* Separate replica index feature gate logic

Signed-off-by: Ryan O'Leary <[email protected]>

* remove index arg in createWorkerPod

Signed-off-by: Ryan O'Leary <[email protected]>

---------

Signed-off-by: Ryan O'Leary <[email protected]>
…, CMD JSON args) (ray-project#4167)

* [ray-project#4166] improvement: Fix Dockerfile warnings (ENV format, CMD JSON args)

* extract the hostname from CMD

Signed-off-by: Neo Chien <[email protected]>

---------

Signed-off-by: Neo Chien <[email protected]>
Co-authored-by: cchung100m <[email protected]>
ray-project#4158)

* [Fix] Resolve int32 overflow by having the calculation in int64 and cap it if the count is over math.MaxInt32

Signed-off-by: justinyeh1995 <[email protected]>

* [Test] Add unit tests for CalculateReadyReplicas

Signed-off-by: justinyeh1995 <[email protected]>

* [Fix] Add a nosec comment to pass the Lint (pre-commit) test

Signed-off-by: justinyeh1995 <[email protected]>

* [Refactor] Add CapInt64ToInt32 to replace #nosec directives

Signed-off-by: justinyeh1995 <[email protected]>

* [Refactor] Rename function to SafeInt64ToInt32 and add a underflowing prevention (it also help pass the lint test)

Signed-off-by: justinyeh1995 <[email protected]>

* [Refactor] Remove the early return as SafeInt64ToInt32 handles the int32 overflow and underflow checking.

Signed-off-by: justinyeh1995 <[email protected]>

---------

Signed-off-by: justinyeh1995 <[email protected]>
…y-project#4195)

* Make replicas configurable for kuberay-operator ray-project#4180

* Make replicas configurable for kuberay-operator ray-project#4180
* feat: check if raycluster status update in rayjob

* test: e2e test to check the rayjob raycluster status update
* Add support for Ray token auth

Signed-off-by: Andrew Sy Kim <[email protected]>

* add e2e test for Ray cluster auth

Signed-off-by: Andrew Sy Kim <[email protected]>

* address nits from Ruiean

Signed-off-by: Andrew Sy Kim <[email protected]>

* update RAY_auth_mode -> RAY_AUTH_MODE

Signed-off-by: Andrew Sy Kim <[email protected]>

* configure auth for Ray autoscaler

Signed-off-by: Andrew Sy Kim <[email protected]>

---------

Signed-off-by: Andrew Sy Kim <[email protected]>
Bumps [js-yaml](https://github.com/nodeca/js-yaml) from 4.1.0 to 4.1.1.
- [Changelog](https://github.com/nodeca/js-yaml/blob/master/CHANGELOG.md)
- [Commits](nodeca/js-yaml@4.1.0...4.1.1)

---
updated-dependencies:
- dependency-name: js-yaml
  dependency-version: 4.1.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
ray-project#4201)

* update minimum Ray version required for token authentication to 2.52.0

Signed-off-by: Andrew Sy Kim <[email protected]>

* update RayCluster auth e2e test to use Ray v2.52

Signed-off-by: Andrew Sy Kim <[email protected]>

---------

Signed-off-by: Andrew Sy Kim <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
)

* dashboard client authentication support

Signed-off-by: Future-Outlier <[email protected]>

* support rayjob

Signed-off-by: Future-Outlier <[email protected]>

* update to fix api serverr err

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* updarte

Signed-off-by: Future-Outlier <[email protected]>

* Rayjob sidecar mode auth token mode support

Signed-off-by: Future-Outlier <[email protected]>

* RayJob support k8s job mode

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* Address Andrew's advice

Signed-off-by: Future-Outlier <[email protected]>

* add todo x-ray-authorization comments

Signed-off-by: Future-Outlier <[email protected]>

---------

Signed-off-by: Future-Outlier <[email protected]>
… verbs (ray-project#4202)

* Add authentication secret reconciliation support

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* fix flaky test

Signed-off-by: Future-Outlier <[email protected]>

* remove test fix

Signed-off-by: Rueian <[email protected]>

---------

Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Rueian <[email protected]>
Co-authored-by: Rueian <[email protected]>
justinyeh1995 and others added 9 commits November 19, 2025 20:50
…ay-project#4144)

* [Docs] Add the draft description about feature intro, configurations, and usecases

Signed-off-by: justinyeh1995 <[email protected]>

* [Fix] Update the retry walk-through

Signed-off-by: justinyeh1995 <[email protected]>

* [Doc] rewrite the first 2 sections

Signed-off-by: justinyeh1995 <[email protected]>

* [Doc] Revise documentation wording and add Observing Retry Behavior section

Signed-off-by: justinyeh1995 <[email protected]>

* [Fix] fix linting issue by running pre-commit run berfore commiting

Signed-off-by: justinyeh1995 <[email protected]>

* [Fix] fix linting errors in the Markdown linting

Signed-off-by: justinyeh1995 <[email protected]>

* [Fix] Clean up the math equation

Signed-off-by: justinyeh1995 <[email protected]>

* Update the math formula of Backoff calculation.

Co-authored-by: Nary Yeh <[email protected]>
Signed-off-by: JustinYeh <[email protected]>

* [Fix] Explicitly mentioned exponential backoff and removed the customization parts

Signed-off-by: justinyeh1995 <[email protected]>

* [Docs] Clarify naming by replacing “APIServer” with “KubeRay APIServer”

Co-authored-by: Cheng-Yeh Chung <[email protected]>
Signed-off-by: JustinYeh <[email protected]>

* [Docs] Rename retry-configuration.md to retry-behavior.md for accuracy

Signed-off-by: justinyeh1995 <[email protected]>

* Update Title to KubeRay APIServer Retry Behavior

Co-authored-by: Cheng-Yeh Chung <[email protected]>
Signed-off-by: JustinYeh <[email protected]>

* [Docs] Add a note about the limitation of retry configuration

Signed-off-by: justinyeh1995 <[email protected]>

---------

Signed-off-by: justinyeh1995 <[email protected]>
Signed-off-by: JustinYeh <[email protected]>
Co-authored-by: Nary Yeh <[email protected]>
Co-authored-by: Cheng-Yeh Chung <[email protected]>
…via proxy (ray-project#4213)

* Support X-Ray-Authorization fallback header for accepting auth token in dashboard

Signed-off-by: Future-Outlier <[email protected]>

* remove todo comment

Signed-off-by: Future-Outlier <[email protected]>

---------

Signed-off-by: Future-Outlier <[email protected]>
…ect#4196)

* [RayCluster] Status includes head containter status message

Signed-off-by: Spencer Peterson <[email protected]>

* lint

Signed-off-by: Spencer Peterson <[email protected]>

* [RayCluster] Containers not ready status reflects structured reason

Signed-off-by: Spencer Peterson <[email protected]>

* nit

Signed-off-by: Spencer Peterson <[email protected]>

---------

Signed-off-by: Spencer Peterson <[email protected]>
…ter (ray-project#4215)

* [RayJob] light weight job submitter auth token support

Signed-off-by: Future-Outlier <[email protected]>

* X-Ray-Authorization

Signed-off-by: Rueian <[email protected]>

---------

Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Rueian <[email protected]>
Co-authored-by: Rueian <[email protected]>
* feat: kubectl ray get token command

Signed-off-by: Rueian <[email protected]>

* Update kubectl-plugin/pkg/cmd/get/get_token_test.go

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Rueian <[email protected]>

* Update kubectl-plugin/pkg/cmd/get/get_token.go

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Rueian <[email protected]>

* make sure the raycluster exists before getting the secret

Signed-off-by: Rueian <[email protected]>

* better ux

Signed-off-by: Rueian <[email protected]>

* Update kubectl-plugin/pkg/cmd/get/get_token.go

Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
Signed-off-by: Rueian <[email protected]>

---------

Signed-off-by: Rueian <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my plan for testing this branch
all combinations from here #4203
with and without using kubernetes proxy in kuberay

@Future-Outlier
Copy link
Member

Future-Outlier commented Nov 21, 2025

my test when using kubernetes proxy in kuberay

args: "-leader-election-namespace default -use-kubernetes-proxy"
branch: this one
image: rayproject/ray:2.52.0.9527a5-extra-py310-cpu

HTTP mode, k8s job mode, cluster selector, and sidecar mode
image

light weight job submitter (k8s mode)
image

rayservice
image
image

my test when not using kubernetes proxy in kuberay

image image image image image image

my example

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample-http-mode-v6
spec:
  # submissionMode specifies how RayJob submits the Ray job to the RayCluster.
  # The default value is "K8sJobMode", meaning RayJob will submit the Ray job via a submitter Kubernetes Job.
  # The alternative value is "HTTPMode", indicating that KubeRay will submit the Ray job by sending an HTTP request to the RayCluster.
  submissionMode: "HTTPMode"
  entrypoint: python /home/ray/samples/sample_code.py
  # shutdownAfterJobFinishes specifies whether the RayCluster should be deleted after the RayJob finishes. Default is false.
  # shutdownAfterJobFinishes: false

  # ttlSecondsAfterFinished specifies the number of seconds after which the RayCluster will be deleted after the RayJob finishes.
  # ttlSecondsAfterFinished: 10

  # activeDeadlineSeconds is the duration in seconds that the RayJob may be active before
  # KubeRay actively tries to terminate the RayJob; value must be positive integer.
  # activeDeadlineSeconds: 120

  # RuntimeEnvYAML represents the runtime environment configuration provided as a multi-line YAML string.
  # See https://docs.ray.io/en/latest/ray-core/handling-dependencies.html for details.
  # (New in KubeRay version 1.0.)
  runtimeEnvYAML: |
    pip:
      - requests==2.26.0
      - pendulum==2.1.2
    env_vars:
      counter_name: "test_counter"

  # Suspend specifies whether the RayJob controller should create a RayCluster instance.
  # If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false.
  # If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluster will be created.
  # suspend: false

  # rayClusterSpec specifies the RayCluster instance to be created by the RayJob controller.
  rayClusterSpec:
    rayVersion: "2.52.0" # should match the Ray version in the image of the containers+
    authOptions:
      mode: "token"
    # Ray head pod template
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams: {}
      #pod template
      template:
        spec:
          containers:
          - name: ray-head
            # image: rayproject/ray:nightly-py311-cpu
            image: rayproject/ray:2.52.0.9527a5-extra-py310-cpu
            ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265 # Ray dashboard
              name: dashboard
            - containerPort: 10001
              name: client
            resources:
              requests:
                cpu: "2"
            volumeMounts:
            - mountPath: /home/ray/samples
              name: code-sample
          volumes:
          # You set volumes at the Pod level, then mount them into containers inside that Pod
          - name: code-sample
            configMap:
              # Provide the name of the ConfigMap you want to mount.
              name: ray-job-code-sample
              # An array of keys from the ConfigMap to create as files
              items:
              - key: sample_code.py
                path: sample_code.py
    workerGroupSpecs:
    # the pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 1
      maxReplicas: 5
      # logical group name, for this called small-group, also can be functional
      groupName: small-group
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams: {}
      #pod template
      template:
        spec:
          containers:
          - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
            # image: rayproject/ray:nightly-py311-cpu
            image: rayproject/ray:2.52.0.9527a5-extra-py310-cpu
            resources:
              requests:
                cpu: "2"

  # SubmitterPodTemplate is the template for the pod that will run the `ray job submit` command against the RayCluster.
  # If SubmitterPodTemplate is specified, the first container is assumed to be the submitter container.
  # submitterPodTemplate:
  #   spec:
  #     restartPolicy: Never
  #     containers:
  #     - name: my-custom-rayjob-submitter-pod
  #       image: rayproject/ray:2.46.0
  #       # If Command is not specified, the correct command will be supplied at runtime using the RayJob spec `entrypoint` field.
  #       # Specifying Command is not recommended.
  #       # command: ["sh", "-c", "ray job submit --address=http://$RAY_DASHBOARD_ADDRESS --submission-id=$RAY_JOB_SUBMISSION_ID -- echo hello world"]


######################Ray code sample#################################
# this sample is from https://docs.ray.io/en/latest/cluster/job-submission.html#quick-start-example
# it is mounted into the container and executed to show the Ray job at work
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  sample_code.py: |
    import ray
    import os
    import requests

    ray.init()

    @ray.remote
    class Counter:
        def __init__(self):
            # Used to verify runtimeEnv
            self.name = os.getenv("counter_name")
            assert self.name == "test_counter"
            self.counter = 0

        def inc(self):
            self.counter += 1

        def get_counter(self):
            return "{} got {}".format(self.name, self.counter)

    counter = Counter.remote()

    for _ in range(5):
        ray.get(counter.inc.remote())
        print(ray.get(counter.get_counter.remote()))

    # Verify that the correct runtime env was used for the job.
    assert requests.__version__ == "2.26.0"
apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-use-existing-raycluster-auth-token
spec:
  clusterSelector:
    ray.io/cluster: rayjob-sample-spn4v
  entrypoint: python -c "import ray; ray.init(); print(ray.cluster_resources())"
  runtimeEnvYAML: |
    pip:
      - requests==2.26.0
# Make sure to increase resource requests and limits before using this example in production.
# For examples with more realistic resource configuration, see
# ray-cluster.complete.large.yaml and
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: ray-service-auth-token-3
spec:
  # serveConfigV2 takes a yaml multi-line scalar, which should be a Ray Serve multi-application config. See https://docs.ray.io/en/latest/serve/multi-app.html.
  serveConfigV2: |
    applications:
      - name: fruit_app
        import_path: fruit.deployment_graph
        route_prefix: /fruit
        runtime_env:
          working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
        deployments:
          - name: MangoStand
            num_replicas: 2
            max_replicas_per_node: 1
            user_config:
              price: 3
            ray_actor_options:
              num_cpus: 0.1
          - name: OrangeStand
            num_replicas: 1
            user_config:
              price: 2
            ray_actor_options:
              num_cpus: 0.1
          - name: PearStand
            num_replicas: 1
            user_config:
              price: 1
            ray_actor_options:
              num_cpus: 0.1
          - name: FruitMarket
            num_replicas: 1
            ray_actor_options:
              num_cpus: 0.1
      - name: math_app
        import_path: conditional_dag.serve_dag
        route_prefix: /calc
        runtime_env:
          working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
        deployments:
          - name: Adder
            num_replicas: 1
            user_config:
              increment: 3
            ray_actor_options:
              num_cpus: 0.1
          - name: Multiplier
            num_replicas: 1
            user_config:
              factor: 5
            ray_actor_options:
              num_cpus: 0.1
          - name: Router
            num_replicas: 1
  rayClusterConfig:
    rayVersion: '2.52.0' # should match the Ray version in the image of the containers
    enableInTreeAutoscaling: true
    authOptions:
      mode: token
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams: {}
      #pod template
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray:2.52.0.9527a5-py312-cpu
            resources:
              requests:
                cpu: 3
                memory: 4Gi      # Increased from 2Gi
    workerGroupSpecs:
    # the pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 1
      maxReplicas: 5
      # logical group name, for this called small-group, also can be functional
      groupName: small-group
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams: {}
      #pod template
      template:
        spec:
          containers:
          - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
            image: rayproject/ray:2.52.0.9527a5-py312-cpu
            resources:
              requests:
                cpu: 3
                memory: 6Gi      # Increased from 2Gi

Copy link
Collaborator

@rueian rueian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andrewsykim andrewsykim merged commit f68857e into ray-project:release-1.5 Nov 21, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.