Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,9 @@ under the License.
Spark Operator supports different ways to configure the behavior:

* **spark-operator.properties** provided when deploying the operator. In addition to the
[property file](../build-tools/helm/spark-kubernetes-operator/conf/spark-operator.
properties), it is also possible to override or append config properties in helm [Values
files](../build-tools/helm/spark-kubernetes-operator/values.yaml).
[property file](../build-tools/helm/spark-kubernetes-operator/conf/spark-operator.properties),
it is also possible to override or append config properties in helm
[Values files](../build-tools/helm/spark-kubernetes-operator/values.yaml).
* **System Properties** : when provided as system properties (e.g. via -D options to the
operator JVM), it overrides the values provided in property file.
* **Hot property loading** : when enabled, a
Expand Down
12 changes: 6 additions & 6 deletions docs/operations.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ under the License.

## Compatibility

- Java 21, 25 and 26
- Java 21 or newer
- Kubernetes version compatibility:
- k8s version >= 1.34 is recommended. Operator attempts to be API compatible as possible, but
- k8s version >= 1.34 is recommended. Operator attempts to be as API compatible as possible, but
patch support will not be performed on k8s versions that reached EOL.
- Spark versions 3.5 or above.

Expand Down Expand Up @@ -122,7 +122,7 @@ following table:
| operatorConfiguration.spark-operator.properties | The default operator configuration. | |
| operatorConfiguration.metrics.properties | The default operator metrics (sink) configuration. | |
| operatorConfiguration.dynamicConfig.create | If set to true, a config map would be created & watched by operator as source of truth for hot properties loading. | false |
| operatorConfiguration.dynamicConfig.enable | If set to true, operator would honor the created config mapas source of truth for hot properties loading. | false |
| operatorConfiguration.dynamicConfig.enable | If set to true, operator would honor the created config map as source of truth for hot properties loading. | false |
| operatorConfiguration.dynamicConfig.annotations | Annotations to be applied for the dynamicConfig resources. | `"helm.sh/resource-policy": keep` |
| operatorConfiguration.dynamicConfig.data | Data field (key-value pairs) that acts as hot properties in the config map. | `spark.kubernetes.operator.reconciler.intervalSeconds: "60"` |

Expand Down Expand Up @@ -172,9 +172,9 @@ Check installation.

```bash
$ helm list -A
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
us-west-1 us-west-1 1 2025-10-08 22:04:45.530136 -0700 PDT deployed spark-kubernetes-operator-1.3.0 0.5.0
us-west-2 us-west-2 1 2025-10-08 22:04:48.747434 -0700 PDT deployed spark-kubernetes-operator-1.3.0 0.5.0
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
us-west-1 us-west-1 1 2026-05-06 10:00:00.000000 -0700 PDT deployed spark-kubernetes-operator-1.7.0-dev 0.9.0-SNAPSHOT
us-west-2 us-west-2 1 2026-05-06 10:00:03.000000 -0700 PDT deployed spark-kubernetes-operator-1.7.0-dev 0.9.0-SNAPSHOT
```

Launch `pi.yaml` at `us-west-1` and `us-west-2` namespaces.
Expand Down
34 changes: 17 additions & 17 deletions docs/spark_custom_resources.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,20 +40,20 @@ kind: SparkApplication
metadata:
name: pi
spec:
# Entry point for the app
# Entry point for the app
mainClass: "org.apache.spark.examples.SparkPi"
jars: "local:///opt/spark/examples/jars/spark-examples.jar"
sparkConf:
spark.dynamicAllocation.enabled: "true"
spark.dynamicAllocation.shuffleTracking.enabled: "true"
spark.dynamicAllocation.maxExecutors: "3"
spark.kubernetes.authenticate.driver.serviceAccountName: "spark"
spark.kubernetes.container.image: "apache/spark:4.0.0"
spark.kubernetes.container.image: "apache/spark:4.1.1-scala"
applicationTolerations:
resourceRetainPolicy: OnFailure
ttlAfterStopMillis: 10000
runtimeVersions:
scalaVersion: "2.13"
sparkVersion: "4.0.0"
sparkVersion: "4.1.1"
```

After application is submitted, Operator will add status information to your application based on
Expand Down Expand Up @@ -204,13 +204,12 @@ are creating / managing SparkApplications with external microservices or workflo
Spark Operator recognizes "infrastructure failure" in the best effort way. It is possible to
configure different restart policy on general failure(s) vs. on potential infrastructure
failure(s). For example, you may configure the app to restart only upon infrastructure
failures. If Spark application fails as a result of `DriverStartTimedOut`,
`ExecutorsStartTimedOut`, `SchedulingFailure`.

It is more likely that the app failed as a result of infrastructure reason(s), including
scenarios like driver or executors cannot be scheduled or cannot initialize in configured
time window for scheduler reasons, as a result of insufficient capacity, cannot get IP
allocated, cannot pull images, or k8s API server issue at scheduling .etc.
failures. If a Spark application fails with `DriverStartTimedOut`, `ExecutorsStartTimedOut`,
or `SchedulingFailure`, it is more likely that the app failed as a result of infrastructure
reason(s), including scenarios like driver or executors cannot be scheduled or cannot
initialize in configured time window for scheduler reasons, as a result of insufficient
capacity, cannot get IP allocated, cannot pull images, or k8s API server issues at
scheduling, etc.

Please be advised that this is a best-effort failure identification. You may still need to
debug actual failure from the driver pods. Spark Operator would stage the last observed
Expand Down Expand Up @@ -250,11 +249,12 @@ The operator maintains multiple counters to track different types of restarts:
- Consecutive failure tracking: The failure-specific counters track consecutive failures
of the app, distinguishing between persistent failures (requiring intervention) and
transient issues (safe for retry).
- For Example: With `restartPolicy=Always`, `maxRestartAttempts=5` and `maxRestartOnFailure=2`:
- The app would tolerate at maximum of 3 consecutive failures, with maximal of 5 restarts
- In other words, sequence F -> F -> F would stop.
- sequence F -> S -> F -> S -> F would continue with the 5th restart as the succeeded attempts
reset the failure counter
- Example: with `restartPolicy=Always`, `maxRestartAttempts=5`, and `maxRestartOnFailure=2`:
- The app tolerates at most 2 consecutive failures; the 3rd consecutive failure stops it,
within an overall cap of 5 total restarts.
- In other words, the sequence F -> F -> F stops on the 3rd F.
- The sequence F -> S -> F -> S -> F continues, because each successful attempt
resets the consecutive-failure counter.
- Granular control over `SchedulingFailure`: similarly, it's possible to control the maximal
restart and backoff interval for consecutive `SchedulingFailure` attempts, as it can be highly
associated with API server rejections, quota exceeded, resource constraints.
Expand Down Expand Up @@ -438,7 +438,7 @@ For example, if an app with below configuration:
applicationTolerations:
restartConfig:
restartPolicy: OnFailure
maxRestartAttempts: 1
maxRestartAttempts: 1
resourceRetainPolicy: Always
resourceRetainDurationMillis: 30000
ttlAfterStopMillis: 60000
Expand Down
Loading