Skip to content

Conversation

@javanlacerda
Copy link
Contributor

@javanlacerda javanlacerda commented Jan 8, 2026

This PR introduces full support for scheduling and managing fuzzing tasks on Kubernetes clusters,
specifically targeting GKE. It implements a new KubernetesService to
handle batch job creation, supports Kata Containers for isolation, and includes robust testing
and configuration mechanisms.

Key Features:

  • Kubernetes Service: A new backend for RemoteTaskInterface that schedules tasks as Kubernetes
    Jobs. It supports both standard and Kata Container runtimes, automatic Service Account
    creation with Workload Identity, and intelligent job limiting to prevent cluster overload.
  • Traffic Shifting (RemoteTaskGate): A new gating mechanism (RemoteTaskGate) that intelligently
    routes tasks between the legacy GCP Batch service and the new Kubernetes service based on
    configurable probabilities, allowing for a gradual, controlled migration.
  • Feature Flags: A new dynamic configuration system backed by Datastore to control runtime
    behaviors like job concurrency limits.

Detailed Changes by Module:

  • Kubernetes Integration (src/clusterfuzz/_internal/k8s/):

    • service.py: Implemented KubernetesService for job lifecycle management (creation,
      monitoring, limiting). Includes GKE credential loading, Kata Container spec generation,
      and Service Account provisioning.
    • Tests: Added k8s_service_test.py (unit), k8s_service_limit_test.py (limits), and
      k8s_service_e2e_test.py (integration on Kind).
  • Remote Task Management (src/clusterfuzz/_internal/remote_task/):

    • init.py: Introduced RemoteTaskGate, a smart router that implements
      RemoteTaskInterface. It initializes both GcpBatchService and KubernetesService and
      distributes tasks between them based on probabilities defined in job_frequency.py. This
      enables traffic splitting (e.g., 10% to K8s, 90% to Batch) for safe rollout.
    • job_frequency.py: Added logic to manage task scheduling frequency and split ratios.
    • Refactored core task logic to use the generic RemoteTask and RemoteTaskInterface
      abstractions.
  • Datastore & Configuration (src/clusterfuzz/_internal/datastore/):

    • data_types.py: Added FeatureFlag model to store configuration dynamically.
    • feature_flags.py: Added FeatureFlags enum/helper for type-safe access to flags (e.g.,
      K8S_PENDING_JOBS_LIMITER).
  • Batch & Legacy Refactoring (src/clusterfuzz/_internal/batch/):

    • service.py: Updated to align with the new RemoteTask interface.
    • Removed obsolete gcp.py and google_cloud_utils/batch.py utilities in favor of the new
      structure.
  • Infrastructure & CI:

    • .github/workflows/kubernetes-e2e-tests.yaml: New workflow for running E2E tests on a Kind
      cluster.
    • Pipfile / src/Pipfile: Added kubernetes client and updated Google Cloud dependencies.
  • Bot & Metrics:

    • src/python/bot/startup/run_bot.py: Updates to support K8s-based bot execution via the new
      gate.
    • src/clusterfuzz/_internal/metrics/: Enhanced logging and monitoring for remote tasks.

Evidences:

image Batch and Kata containers fuzzing hours, proving the Remote Gate, the Batch and Kubernetes services are working properly. The Feature Flag is used to set the job_frequency, then it proves the feature flag and its usage is working as well.

@javanlacerda javanlacerda changed the title Pr/dependencies Kubernetes Job service Jan 8, 2026
@javanlacerda javanlacerda force-pushed the pr/dependencies branch 5 times, most recently from 7edfb48 to b271b83 Compare January 10, 2026 22:46
@javanlacerda javanlacerda marked this pull request as ready for review January 10, 2026 22:50
@javanlacerda javanlacerda requested review from ViniciustCosta, decoNR, hunsche and jonathanmetzman and removed request for ViniciustCosta and hunsche January 10, 2026 22:50
Copy link
Collaborator

@jonathanmetzman jonathanmetzman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some very surface level comments.

@javanlacerda javanlacerda force-pushed the pr/dependencies branch 4 times, most recently from d5684e5 to 44737d3 Compare January 15, 2026 00:44
This commit introduces the Kubernetes job client and service, providing a mechanism to schedule tasks on Kubernetes clusters (including GKE and Kind), supporting both standard and Kata Containers.

Key Features & Changes:
- **Kubernetes Service**: Implemented `KubernetesService` in `clusterfuzz._internal.k8s.service` to manage job creation.
- **Kata Support**: Added specialized job creation for Kata Containers (`create_kata_container_job`) with required security context (`privileged`, `capabilities: ALL`), networking (`hostNetwork: True`), and environment variables (`HOST_UID`).
- **Dependency Management**: Added `kubernetes` and necessary Google Cloud dependencies (`google-api-python-client`, `google-cloud-storage`, `google-cloud-ndb`, etc.) to `Pipfile`.
- **E2E Testing**:
    - Created `tests.core.k8s.k8s_service_e2e_test` to verify job lifecycle on a local Kind cluster.
    - Updated `local/tests/kubernetes_e2e_test.bash` to provision the test environment.
    - Updated CI workflow (`.github/workflows/kubernetes-e2e-tests.yaml`) to install JDK 21 (required for Datastore emulator).
    - Tests now verify job "Running" status to avoid timeouts with long-running commands.
    - `KubernetesService` skips default credential loading when `K8S_E2E` is set to utilize the test-provided kubeconfig.
- **Unit Tests**: Added comprehensive unit tests in `tests.core.k8s.k8s_service_test` and `tests.core.kubernetes.kubernetes_test`, including mocking of `load_kube_config` and `_load_gke_credentials` to ensure robust testing without external dependencies.
Signed-off-by: Javan Lacerda <[email protected]>
Signed-off-by: Javan Lacerda <[email protected]>
Signed-off-by: Javan Lacerda <[email protected]>
Signed-off-by: Javan Lacerda <[email protected]>
Signed-off-by: Javan Lacerda <[email protected]>
Signed-off-by: Javan Lacerda <[email protected]>
Signed-off-by: Javan Lacerda <[email protected]>
Signed-off-by: Javan Lacerda <[email protected]>
Comment on lines +18 to +25

pip install pipenv

# Install dependencies.
pipenv --python 3.11
pipenv install

class KubernetesJobClient(RemoteTaskInterface):
"""A remote task execution client for Kubernetes.
This class is a placeholder for a future implementation of a remote task
execution client that uses Kubernetes. It is not yet implemented.
"""
./local/install_deps.bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only intended to be used in CI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!


# If we get here the task succeeded in running. Acknowledge the message.
self._pubsub_message.ack()
if not self.do_not_ack:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its part of the job limiter for the Kubernetes service, we can probably use this for implement the job limiter for Batch as well, using the new feature they implemented for us. The rationale behind is if the task cannot be scheduled for Kubernetes because it already reached the limit of jobs, the message should not be acked, allowing the other adapter, such as Batch, to process the message.

@@ -0,0 +1,61 @@
# Copyright 2026 Google LLC
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's difference between thsi and the next template?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might have been a good idea to consider knative instead of rebuilding batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I created different templates for raw kubernetes jobs and for Jobs over Kata, but I updated it for having a single template with conditionals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the Knative, it seems good, but I wound't like to tackle it in this PR as it's working fine as is, but we should definetly explore it.

@jonathanmetzman
Copy link
Collaborator

This is cool. I maybe would tried cloud run before kata because 1. It is probably less management? 2. It might be more performant because as far as I know doesn't use nested virt.

@jonathanmetzman
Copy link
Collaborator

Are we using preemptibles btw?

Signed-off-by: Javan Lacerda <[email protected]>
Signed-off-by: Javan Lacerda <[email protected]>
Signed-off-by: Javan Lacerda <[email protected]>

# If we get here the task succeeded in running. Acknowledge the message.
self._pubsub_message.ack()
if not self.do_not_ack:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO readability would be improved by using ack instead of do_not_ack (go/tott/764).

Signed-off-by: Javan Lacerda <[email protected]>
Signed-off-by: Javan Lacerda <[email protected]>
Signed-off-by: Javan Lacerda <[email protected]>
Signed-off-by: Javan Lacerda <[email protected]>
Signed-off-by: Javan Lacerda <[email protected]>
Signed-off-by: Javan Lacerda <[email protected]>
@javanlacerda javanlacerda force-pushed the pr/dependencies branch 2 times, most recently from 30cc0ae to 506f583 Compare January 20, 2026 21:10
@javanlacerda
Copy link
Contributor Author

Are we using preemptibles btw?

We're not, and will it's not part of the plan using it for the clusters. You can see more details on go/clusterfuzz-to-kubernetes

Signed-off-by: Javan Lacerda <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants