fix: suppress SFN for dry-run pipelines and deduplicate JOB_COMPLETED#67
Merged
dwsmith1983 merged 4 commits intomainfrom Mar 13, 2026
Merged
fix: suppress SFN for dry-run pipelines and deduplicate JOB_COMPLETED#67dwsmith1983 merged 4 commits intomainfrom
dwsmith1983 merged 4 commits intomainfrom
Conversation
…ilure paths handleRerunRequest and handleJobFailure did not check cfg.DryRun before calling startSFNWithName, allowing rerun requests and job failure retries to start real Step Function executions for observation-only pipelines. Added dry-run guards in both handlers, defense-in-depth in startSFNWithName, and watchdog reconciliation skip to prevent orphaned trigger locks.
handleCheckJob published JOB_COMPLETED directly when polling detected success, but the stream-router's handleJobSuccess also published it when the JOB# record arrived via DynamoDB stream. This caused duplicate Slack alerts for polled jobs (Glue/EMR) while sync jobs only got one. The stream-router is now the single canonical source for JOB_COMPLETED across all job types (sync and polled).
Add DRY_RUN_COMPLETED terminal event after WOULD_TRIGGER + SLA_PROJECTION to close the observation loop for each evaluation period. Carries SLA verdict (met/breach/n/a) so operators see each period resolve. Add cfg.DryRun guards to all seven watchdog functions: scheduleSLAAlerts, detectMissedSchedules, detectMissedInclusionSchedules, checkTriggerDeadlines, detectMissingPostRunSensors, detectRelativeSLABreaches, and detectStaleTriggers. Without these, dry-run pipelines received real SLA_WARNING/SLA_BREACH alerts via EventBridge Scheduler. Harden triggeredAt parse in late-data path to warn and return on bad data instead of silently producing garbage durations.
Add v0.8.0 events (SENSOR_DEADLINE_EXPIRED, IRREGULAR_SCHEDULE_MISSED, RELATIVE_SLA_WARNING, RELATIVE_SLA_BREACH) and all DRY_RUN_* events to the alert rule so they route to SQS and reach Slack via alert-dispatcher.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
handleRerunRequestandhandleJobFailurelackedcfg.DryRunchecks, allowing rerun requests and job failure retries to start real SFN executions for dry-run pipelines. Added guards in both handlers, defense-in-depth instartSFNWithName, and watchdog reconciliation skip to prevent orphaned trigger locks.handleCheckJobin the orchestrator publishedJOB_COMPLETEDwhen polling detected success, but the stream-router also published it when the JOB# record arrived via DynamoDB stream — causing duplicate Slack alerts for polled jobs (Glue/EMR). Removed the orchestrator emission; stream-router is now the single canonical source.