Skip to content

perf(pts_association): simplify cluster + tune partitioning#186

Merged
DSuveges merged 2 commits into
devfrom
association-perf-tier1
May 11, 2026
Merged

perf(pts_association): simplify cluster + tune partitioning#186
DSuveges merged 2 commits into
devfrom
association-perf-tier1

Conversation

@d0choa
Copy link
Copy Markdown
Collaborator

@d0choa d0choa commented Apr 28, 2026

Summary

Parallel solution to opentargets/issues#4375 (#185 is the other in-flight proposal). The companion algorithmic refactor lives in opentargets/pts#114. This PR is just the cluster definition.

Cluster simplification (Tier 1)

Replaces the previous 35-property block with a minimal definition that relies on Dataproc 2.2 defaults:

  • 1 master + 2 primary workers, both n2d-highmem-32
  • otg-etl-25-secondary autoscaling policy (existing)
  • master 512 GB, worker 256 GB disks
  • idle_delete_ttl: 3600, internal_ip_only: false
  • No explicit Spark/YARN/dynamic-allocation tuning

Targeted sizing overrides (Tier 2, second commit)

  • spark.sql.shuffle.partitions=2000
  • spark.sql.adaptive.advisoryPartitionSizeInBytes=64MB

These two properties give finer-grained tasks, which generates more YARN backlog so the autoscaler ramps more aggressively, and reduces per-task memory footprint.

Manual benchmark on 26.03-ppp.1

Run Config Wall time
Reference Legacy pts_association (35 properties) ~34 min
Run-001 This PR Tier 1 only 14m48s
Run-003 This PR (Tier 1 + Tier 2 properties) 11m41s

Per-stage breakdown (Run-001 → Run-003) shows the indirect aggregations save ~1min each from the finer partitioning. Direct stages are essentially flat.

Test plan

  • Cluster create succeeds with these properties (verified manually on open-targets-eu-dev).
  • PTS association job completes successfully against 26.03-ppp.1 inputs.
  • Output validates structurally (rows, schema, composite keys) against the 26.03-ppp.1 reference.
  • Companion algorithmic refactor proven deterministic across cluster topologies (Run-002 vs Run-003 = 0 mismatches across 56M rows).

Companion PR

opentargets/pts#114 — algorithmic refactor + determinism fix in _aggregate_associations. Should land together with this one.

d0choa added 2 commits May 11, 2026 15:42
…park overrides

Replaces the 25-property cluster definition with a Tier 1 minimal config
relying on Dataproc 2.2 defaults (AQE on, dynamic allocation on, auto
executor sizing). 1 master + 2 n2d-highmem-32 workers + otg-etl-25-secondary.
Companion to PTS PR (algorithmic refactor of compute_novelty).
Add two targeted Spark properties to the pts_association cluster:
- spark.sql.shuffle.partitions=2000
- spark.sql.adaptive.advisoryPartitionSizeInBytes=64MB

Manual benchmark on the 26.03-ppp.1 dataset:
- Tier 1 (defaults): 14m48s
- Tier 2 (these properties): 11m41s (-22%)

Heaviest stages (indirect by_datasource, indirect by_datatype, indirect
overall) saved ~1min each from the finer partitioning, which also
relieves memory pressure observed in the YARN node manager metrics.
@DSuveges DSuveges force-pushed the association-perf-tier1 branch from 76677dc to bc50776 Compare May 11, 2026 14:43
@DSuveges DSuveges marked this pull request as ready for review May 11, 2026 14:45
@DSuveges DSuveges merged commit 138373e into dev May 11, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants