Skip to content

Add per-job hourly log quota enforced on runner#3668

Open
peterschmidt85 wants to merge 2 commits intomasterfrom
log-quota-per-job-hour
Open

Add per-job hourly log quota enforced on runner#3668
peterschmidt85 wants to merge 2 commits intomasterfrom
log-quota-per-job-hour

Conversation

@peterschmidt85
Copy link
Contributor

Summary

  • Adds a per-job hourly log quota (default 50MB) enforced on the runner side, preventing runaway log costs (e.g., the $194+/day CloudWatch incident from excessive training job logs)
  • Quota is configurable via DSTACK_SERVER_LOG_QUOTA_PER_JOB_HOUR env var (bytes, 0 disables)
  • Jobs exceeding the quota are killed with status error and reason log quota exceeded
  • Byte counting happens post-ANSI-stripping, matching what gets stored in CloudWatch

Changes

Go (runner)

  • types.go: Add TerminationReasonLogQuotaExceeded constant
  • schemas.go: Add LogQuotaHour field to JobSpec
  • logs.go: Add quota tracking to appendWriter with out-of-band signaling via channel (needed because ansistrip is async and swallows downstream errors)
  • executor.go: Add copyOutputWithQuota() method, wire quota via SetJob(), handle quota error in Run()
  • executor_test.go: Add TestExecutor_LogQuota

Python (server)

  • runs.py: Add LOG_QUOTA_EXCEEDED to JobTerminationReason (maps to FAILED), add log_quota_hour to JobSpec
  • settings.py: Add SERVER_LOG_QUOTA_PER_JOB_HOUR setting (default 50MB)
  • runner.py: Add log_quota_hour to SubmitBody include set
  • base.py: Add _log_quota_hour() method, wire into _get_job_spec()

Test plan

  • TestExecutor_LogQuota passes
  • All Go tests pass (go test ./...)
  • All Python tests pass (2222 passed)
  • E2E with remote backend: 1000-byte quota — job terminates immediately with log quota exceeded
  • E2E with remote backend: 50MB quota — job runs ~5 min, terminates at ~52MB with log quota exceeded
  • dstack ps -v shows status error and error log quota exceeded
  • dstack logs shows partial logs captured before termination

🤖 Generated with Claude Code

Prevents runaway log costs by limiting log output per job per calendar hour.
Default quota is 50MB/hour, configurable via DSTACK_SERVER_LOG_QUOTA_PER_JOB_HOUR
(0 disables). Jobs exceeding the quota are terminated with reason log_quota_exceeded.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@peterschmidt85 peterschmidt85 requested a review from un-def March 16, 2026 11:41
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant