Skip to content

[2186](WIP) scaling plots branch#2231

Draft
florianscheidl wants to merge 45 commits intoecmwf:developfrom
florianscheidl:ekfs/scaling-plots-20260417
Draft

[2186](WIP) scaling plots branch#2231
florianscheidl wants to merge 45 commits intoecmwf:developfrom
florianscheidl:ekfs/scaling-plots-20260417

Conversation

@florianscheidl
Copy link
Copy Markdown
Contributor

@florianscheidl florianscheidl commented Apr 17, 2026

NO REVIEW REQUESTED

Description

Branch for April 2026 scaling plots (no merge).

Issue Number

#2186

Is this PR a draft? Mark it as draft.

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

- Track startup_time_seconds: time from run() start to training loop
- Track total_training_time_seconds: time in training/validation cycles
- Track overall_time_seconds: total wall-clock time from launch to finish
- All metrics logged only on root rank to avoid file contention
- Metrics written to metrics.json, automatically uploaded to MLflow
- Console logs show timing summaries for quick monitoring
- Created .hermes/ directory with skills/, tasks/, docs/ subfolders
- Added skills overview (README.md) with task-type skills
- Implemented 'planning' and 'metrics' skills
- Documented timing metrics task in tasks/2026-04-17-timing-metrics/
- Added agent structure documentation
- Updated .gitignore with optional .hermes/ entry
- Added 2-3 month review cycle recommendation
- Defined criteria for skill consolidation
- Included usage frequency thresholds
- Documented when to merge or remove skills
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant