perf(pts_association): simplify cluster + tune partitioning by d0choa · Pull Request #186 · opentargets/orchestration

d0choa · 2026-04-28T19:57:22Z

Summary

Parallel solution to opentargets/issues#4375 (#185 is the other in-flight proposal). The companion algorithmic refactor lives in opentargets/pts#114. This PR is just the cluster definition.

Cluster simplification (Tier 1)

Replaces the previous 35-property block with a minimal definition that relies on Dataproc 2.2 defaults:

1 master + 2 primary workers, both n2d-highmem-32
otg-etl-25-secondary autoscaling policy (existing)
master 512 GB, worker 256 GB disks
idle_delete_ttl: 3600, internal_ip_only: false
No explicit Spark/YARN/dynamic-allocation tuning

Targeted sizing overrides (Tier 2, second commit)

spark.sql.shuffle.partitions=2000
spark.sql.adaptive.advisoryPartitionSizeInBytes=64MB

These two properties give finer-grained tasks, which generates more YARN backlog so the autoscaler ramps more aggressively, and reduces per-task memory footprint.

Manual benchmark on 26.03-ppp.1

Run	Config	Wall time
Reference	Legacy `pts_association` (35 properties)	~34 min
Run-001	This PR Tier 1 only	14m48s
Run-003	This PR (Tier 1 + Tier 2 properties)	11m41s

Per-stage breakdown (Run-001 → Run-003) shows the indirect aggregations save ~1min each from the finer partitioning. Direct stages are essentially flat.

Test plan

Cluster create succeeds with these properties (verified manually on open-targets-eu-dev).
PTS association job completes successfully against 26.03-ppp.1 inputs.
Output validates structurally (rows, schema, composite keys) against the 26.03-ppp.1 reference.
Companion algorithmic refactor proven deterministic across cluster topologies (Run-002 vs Run-003 = 0 mismatches across 56M rows).

Companion PR

opentargets/pts#114 — algorithmic refactor + determinism fix in _aggregate_associations. Should land together with this one.

…park overrides Replaces the 25-property cluster definition with a Tier 1 minimal config relying on Dataproc 2.2 defaults (AQE on, dynamic allocation on, auto executor sizing). 1 master + 2 n2d-highmem-32 workers + otg-etl-25-secondary. Companion to PTS PR (algorithmic refactor of compute_novelty).

Add two targeted Spark properties to the pts_association cluster: - spark.sql.shuffle.partitions=2000 - spark.sql.adaptive.advisoryPartitionSizeInBytes=64MB Manual benchmark on the 26.03-ppp.1 dataset: - Tier 1 (defaults): 14m48s - Tier 2 (these properties): 11m41s (-22%) Heaviest stages (indirect by_datasource, indirect by_datatype, indirect overall) saved ~1min each from the finer partitioning, which also relieves memory pressure observed in the YARN node manager metrics.

d0choa mentioned this pull request Apr 28, 2026

perf(association): algorithmic refactor + cluster simplification opentargets/pts#114

Merged

5 tasks

d0choa changed the base branch from main to dev April 28, 2026 20:20

d0choa added 2 commits May 11, 2026 15:42

DSuveges force-pushed the association-perf-tier1 branch from 76677dc to bc50776 Compare May 11, 2026 14:43

DSuveges marked this pull request as ready for review May 11, 2026 14:45

DSuveges approved these changes May 11, 2026

View reviewed changes

DSuveges merged commit 138373e into dev May 11, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(pts_association): simplify cluster + tune partitioning#186

perf(pts_association): simplify cluster + tune partitioning#186
DSuveges merged 2 commits into
devfrom
association-perf-tier1

d0choa commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

d0choa commented Apr 28, 2026

Summary

Cluster simplification (Tier 1)

Targeted sizing overrides (Tier 2, second commit)

Manual benchmark on 26.03-ppp.1

Test plan

Companion PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants