Skip to content

Add winstress_vm, linstress_vm workloads and WORKLOAD_CONFIG override#1252

Open
ebattat wants to merge 3 commits into
mainfrom
winstress-linstress
Open

Add winstress_vm, linstress_vm workloads and WORKLOAD_CONFIG override#1252
ebattat wants to merge 3 commits into
mainfrom
winstress-linstress

Conversation

@ebattat

@ebattat ebattat commented Jul 2, 2026

Copy link
Copy Markdown
Member

Summary

  • Add winstress_vm workload: Windows CPU+memory stress test with Python multiprocessing, supports scale via BootstormVM, configurable stress_cpu/stress_memory/stress_duration
  • Add linstress_vm workload: Linux Fedora equivalent using cloud-init embedded stress script, supports scale via BootstormVM, same metrics for direct Windows/Linux comparison
  • Add WORKLOAD_CONFIG env var to override template defaults (requests_cpu, requests_memory, sockets, cores, threads) for any workload at runtime
  • Add pin_node0 to pin logic for node selector support

Test plan

  • make golden_files — golden files regenerated
  • make test_golden_files — golden file tests pass
  • PYTHONPATH=. python3 -m pytest -v tests/unittest/ — all 26 unit tests pass
  • Manual test: WORKLOAD=winstress_vm with scale on Windows VMs
  • Manual test: WORKLOAD=linstress_vm with scale on Fedora VMs
  • Verify WORKLOAD_CONFIG override works for different workloads

🤖 Assisted-by: Claude Code

Summary by CodeRabbit

  • New Features
    • Added VM-based stress workloads for Linux (linstress) and Windows (winstress), including new workload templates and end-to-end runners.
    • Introduced an optional workload configuration override via an environment variable to customize rendered templates.
  • Bug Fixes
    • Improved node pinning behavior so pinning activates when either node pin setting is provided.
    • Updated workload-to-namespace mapping for the new Linux/Windows VM runs.
  • Tests
    • Added/updated golden manifests and unit test fixtures for the new templates and scenarios.

- Add winstress_vm: Windows CPU+memory stress test with Python multiprocessing,
  supports scale via BootstormVM, configurable stress_cpu/stress_memory/stress_duration
- Add linstress_vm: Linux Fedora equivalent using cloud-init embedded stress script,
  supports scale via BootstormVM, same metrics for direct Win/Lin comparison
- Add WORKLOAD_CONFIG env var to override template defaults (requests_cpu,
  requests_memory, sockets, cores, threads) for any workload at runtime
- Add pin_node0 to pin logic for node selector support

Assisted-by: Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci openshift-ci Bot requested a review from RobertKrawitz July 2, 2026 12:10
@openshift-ci

openshift-ci Bot commented Jul 2, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ebattat

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved label Jul 2, 2026
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The PR adds linstress and winstress VM workload support across environment wiring, template rendering, workload templates, workload runners, and updated golden manifests for the generated Kubernetes and KubeVirt resources.

Changes

VM workload support

Layer / File(s) Summary
Rendering and environment wiring
benchmark_runner/main/environment_variables.py, benchmark_runner/common/template_operations/template_operations.py
Template rendering accepts WORKLOAD_CONFIG overrides, and the new VM workloads, namespaces, and pinning behavior are registered in environment variables.
Linstress templates
benchmark_runner/common/template_operations/templates/linstress/...
The linstress template data and VM template define the cloud-init stress workload, VM topology, and secret-backed boot wiring.
Winstress templates
benchmark_runner/common/template_operations/templates/winstress/...
The winstress template data, PowerShell stress script, DataVolume, and VM template define the Windows stress workload and storage wiring.
Linstress runner
benchmark_runner/workloads/linstress_vm.py
The linstress runner creates VMs, collects stress reports, normalizes results, uploads Elasticsearch data, and handles cleanup.
Winstress runner
benchmark_runner/workloads/winstress_vm.py
The winstress runner copies the stress script, prepares the VM, collects results, uploads Elasticsearch data, and orchestrates cleanup.
Linstress and winstress golden manifests
tests/unittest/benchmark_runner/common/template_operations/golden_files/...
The golden fixtures add the namespace, DataVolume, Secret, and VM manifests for the linstress and winstress scenarios across the CI, perf_ci, chaos_ci, func_ci, test_ci, and release variants.

Estimated code review effort: 5 (Critical) | ~120 minutes

Possibly related PRs

Suggested labels: enhancement

Suggested reviewers: RobertKrawitz, arpsharm

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: two new workloads plus the WORKLOAD_CONFIG override.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch winstress-linstress

Comment @coderabbitai help to get the list of available commands.

@ebattat ebattat self-assigned this Jul 2, 2026
@ebattat ebattat requested a review from rishubhjain July 2, 2026 12:12
ebattat and others added 2 commits July 2, 2026 15:38
Assisted-by: Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…om linstress

Assisted-by: Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/unittest/benchmark_runner/common/template_operations/golden_files/perf_ci_linstress_vm_ODF_PVC_False/linstress_vm.yaml (1)

9-14: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Remove the hardcoded VM login credential. benchmark_runner/common/template_operations/templates/linstress/internal_data/linstress_vm_template.yaml still sets password: fedora with ssh_pwauth: true; use key-based auth or a per-run secret instead of a fixed, well-known password.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/unittest/benchmark_runner/common/template_operations/golden_files/perf_ci_linstress_vm_ODF_PVC_False/linstress_vm.yaml`
around lines 9 - 14, The VM template still hardcodes a known login password in
the linstress cloud-init data, so update the relevant template generation path
in linstress_vm_template.yaml to remove the fixed password and disable password
auth by default. Use key-based authentication or inject a per-run secret through
the template parameters instead, and ensure the generated golden file reflects
the updated behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In
`@tests/unittest/benchmark_runner/common/template_operations/golden_files/perf_ci_linstress_vm_ODF_PVC_False/linstress_vm.yaml`:
- Around line 9-14: The VM template still hardcodes a known login password in
the linstress cloud-init data, so update the relevant template generation path
in linstress_vm_template.yaml to remove the fixed password and disable password
auth by default. Use key-based authentication or inject a per-run secret through
the template parameters instead, and ensure the generated golden file reflects
the updated behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5db5e8e5-18b9-4126-8efc-8c48195cf61b

📥 Commits

Reviewing files that changed from the base of the PR and between ee29469 and dfbc898.

📒 Files selected for processing (4)
  • benchmark_runner/common/template_operations/templates/linstress/linstress_data_template.yaml
  • benchmark_runner/main/environment_variables.py
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/perf_ci_linstress_vm_ODF_PVC_False/linstress_vm.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/perf_ci_linstress_vm_ODF_PVC_True/linstress_vm.yaml
💤 Files with no reviewable changes (1)
  • benchmark_runner/common/template_operations/templates/linstress/linstress_data_template.yaml
✅ Files skipped from review due to trivial changes (1)
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/perf_ci_linstress_vm_ODF_PVC_True/linstress_vm.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • benchmark_runner/main/environment_variables.py

@ebattat

ebattat commented Jul 2, 2026

Copy link
Copy Markdown
Member Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🧹 Nitpick comments (6)
benchmark_runner/common/template_operations/template_operations.py (1)

156-173: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Duplicate override logic — extract a shared helper.

The new workload_config block (Line 165-173) duplicates the hammerdb_config block (Line 156-163) almost verbatim (filter empties, diff unknown keys, warn, log, merge). Consider factoring this into a single private helper to avoid future divergence between the two override mechanisms.

♻️ Suggested refactor
+    `@staticmethod`
+    def __apply_config_override(config: dict, config_name: str, render_data: dict):
+        config = {k: v for k, v in config.items() if v != ''}
+        if not config:
+            return
+        unknown_keys = set(config.keys()) - set(render_data.keys())
+        if unknown_keys:
+            logger.warning(f'{config_name} unknown keys (will be ignored): {unknown_keys}')
+        logger.info(f'{config_name} override: {config}')
+        render_data.update(config)
+
         hammerdb_config = self.__environment_variables_dict.get('hammerdb_config', {})
         if hammerdb_config and ('hammerdb' in self.__workload_name or 'winmssql' in self.__workload_name):
-            hammerdb_config = {k: v for k, v in hammerdb_config.items() if v != ''}
-            unknown_keys = set(hammerdb_config.keys()) - set(render_data.keys())
-            if unknown_keys:
-                logger.warning(f'HAMMERDB_CONFIG unknown keys (will be ignored): {unknown_keys}')
-            logger.info(f'HAMMERDB_CONFIG override: {hammerdb_config}')
-            render_data.update(hammerdb_config)
+            self.__apply_config_override(hammerdb_config, 'HAMMERDB_CONFIG', render_data)

         workload_config = self.__environment_variables_dict.get('workload_config', {})
-        if workload_config:
-            workload_config = {k: v for k, v in workload_config.items() if v != ''}
-            unknown_keys = set(workload_config.keys()) - set(render_data.keys())
-            if unknown_keys:
-                logger.warning(f'WORKLOAD_CONFIG unknown keys (will be ignored): {unknown_keys}')
-            logger.info(f'WORKLOAD_CONFIG override: {workload_config}')
-            render_data.update(workload_config)
+        if workload_config:
+            self.__apply_config_override(workload_config, 'WORKLOAD_CONFIG', render_data)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark_runner/common/template_operations/template_operations.py` around
lines 156 - 173, The new workload_config handling duplicates the existing
hammerdb_config override flow in TemplateOperations, so extract the shared
“filter empty values, detect unknown keys, log, and merge into render_data”
behavior into a private helper. Reuse that helper from the hammerdb_config and
workload_config branches in template_operations.py so both paths stay consistent
and future changes only need to be made once. Keep the workload_name condition
for hammerdb_config in place, and preserve the existing
logger.warning/logger.info behavior through the shared helper.
benchmark_runner/common/template_operations/templates/winstress/internal_data/winstress_vm_template.yaml (1)

4-12: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Repeated name-computation logic across the template (DRY).

The {% if scale %}...{% else %}...{% endif %} pattern for VM name/labels and root-disk name is duplicated across 4+ locations. Consider computing these once via Jinja {% set %} at the top and reusing the variables, reducing risk of divergence on future edits.

♻️ Example refactor
 metadata:
-  {% if scale -%}
-  name: winstress-{{ kind }}-{{ trunc_uuid }}-{{ scale }}
-  labels:
-    kubevirt.io/vm: winstress-{{ kind }}-{{ trunc_uuid }}-{{ scale }}
-  {%- else -%}
-  name: winstress-{{ kind }}-{{ trunc_uuid }}
-  labels:
-    kubevirt.io/vm: winstress-{{ kind }}-{{ trunc_uuid }}
-  {%- endif %}
+  {%- set vm_name = 'winstress-' ~ kind ~ '-' ~ trunc_uuid ~ (('-' ~ scale) if scale else '') -%}
+  {%- set root_disk_name = 'winstress-' ~ kind ~ '-root-disk-' ~ trunc_uuid ~ (('-' ~ scale) if scale else '') -%}
+  name: {{ vm_name }}
+  labels:
+    kubevirt.io/vm: {{ vm_name }}

Then reference {{ vm_name }} / {{ root_disk_name }} at the other occurrences (labels, disks, volumes, dataVolumeTemplates).

Also applies to: 23-27, 53-57, 113-122, 135-139

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@benchmark_runner/common/template_operations/templates/winstress/internal_data/winstress_vm_template.yaml`
around lines 4 - 12, The VM name/label logic in the template is duplicated in
multiple places and should be centralized for consistency. In
winstress_vm_template.yaml, add Jinja {% set %} variables near the top for the
computed VM name and root disk name, then replace the repeated {% if scale
%}...{% else %}...{% endif %} blocks in the template sections with those shared
variables (for labels, disks, volumes, and dataVolumeTemplates). Use the
existing template symbols like kind, trunc_uuid, and scale to build vm_name and
root_disk_name once, then reference them everywhere else.
benchmark_runner/workloads/linstress_vm.py (1)

53-53: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Remove unnecessary f-string prefix.

Ruff flags F541: the string has no placeholders.

-            check_cmd = f'test -f /tmp/stress_report.json && echo found'
+            check_cmd = 'test -f /tmp/stress_report.json && echo found'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark_runner/workloads/linstress_vm.py` at line 53, The check_cmd
assignment in linstress_vm.py is using an unnecessary f-string prefix even
though there are no placeholders. Update the check_cmd string literal in the
workload setup code to a plain string so it no longer triggers Ruff F541,
keeping the same shell command content.

Source: Linters/SAST tools

benchmark_runner/main/environment_variables.py (1)

107-128: 📐 Maintainability & Code Quality | 🔵 Trivial

Document new workloads in help text/README.

winstress_vm and linstress_vm were added to environment_variables_dict['workloads'], but the workload list documented in the coding guidelines' README content doesn't mention them yet. As per coding guidelines, **/*.py: "Document supported workload types in code comments or help text for WORKLOAD parameter (workloads: stressng_pod, stressng_vm, uperf_pod, uperf_vm, sysbench_pod, sysbench_vm, hammerdb variants, vdbench variants, fio variants, winfio_vm, bootstorm_vm, windows_vm, winmssql_vm, clusterbuster)" — please update the corresponding documentation to include the two new workload names.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark_runner/main/environment_variables.py` around lines 107 - 128, The
workload list in the help text/README is missing the newly supported
`winstress_vm` and `linstress_vm` entries added in `EnvironmentVariables`;
update the documentation that describes the WORKLOAD parameter so it matches the
values in `environment_variables_dict['workloads']`. Locate the workload
documentation section referenced by the coding guidelines and add these two
names alongside the existing workload variants, keeping the list consistent with
the implementation.

Source: Coding guidelines

tests/unittest/benchmark_runner/common/template_operations/golden_files/chaos_ci_winstress_vm_ODF_PVC_False/winstress_vm.yaml (1)

96-102: 🔒 Security & Privacy | 🔵 Trivial | ⚖️ Poor tradeoff

Temporary Everyone:F grant on SSH key directory.

Line 100 grants Everyone:F on .ssh and authorized_keys (needed to take ownership/delete a pre-existing file) before line 101 restores restrictive ACLs. This creates a brief window where any local principal can read/write the SSH authorized_keys file during boot. Consider using a SYSTEM-owned/administrator-only takeover approach (e.g., takeown as SYSTEM without the Everyone:F grant) to avoid the permissive window.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/unittest/benchmark_runner/common/template_operations/golden_files/chaos_ci_winstress_vm_ODF_PVC_False/winstress_vm.yaml`
around lines 96 - 102, The `winstress_vm.yaml` cloud-init `runcmd` block briefly
opens `.ssh` and `authorized_keys` to `Everyone:F`, which should be removed.
Update the `userData` setup steps so the existing file can be taken over and
deleted using a SYSTEM/Administrator-only approach in the same `runcmd`
sequence, then keep the final restrictive ACL restoration in place. Reference
the `userData` SSH setup commands to locate the temporary permission grant and
replace it with a non-permissive ownership reset flow.
tests/unittest/benchmark_runner/common/template_operations/golden_files/func_ci_linstress_vm_ODF_PVC_True/linstress_vm.yaml (1)

126-129: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Redundant package install.

dnf install -y python3-psutil || true duplicates the packages: - python3-psutil cloud-config directive above; if intentional as a fallback for module-ordering issues, a brief comment would help future readers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/unittest/benchmark_runner/common/template_operations/golden_files/func_ci_linstress_vm_ODF_PVC_True/linstress_vm.yaml`
around lines 126 - 129, The runcmd in the linstress_vm.yaml template is
redundantly reinstalling python3-psutil even though it is already declared in
the packages section. Update the runcmd block in the linstress_vm template to
remove the duplicate dnf install step, or keep it only if truly needed as a
fallback and add a brief inline comment explaining why. Use the runcmd and
packages entries in func_ci_linstress_vm_ODF_PVC_True/linstress_vm.yaml to
locate the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@benchmark_runner/common/template_operations/templates/linstress/internal_data/linstress_vm_template.yaml`:
- Around line 78-96: The monitoring loop in the linstress template is
oversleeping for short runs because `intervals = max(1, duration // 30)` always
forces at least one 30-second wait, so `burn_cpu` and `burn_memory` can finish
before sampling starts. Update the loop in the monitoring section to use
elapsed-time tracking against the requested `duration` instead of a fixed
interval count, and only sleep for the remaining time needed before each sample.
Keep the sampling logic and `cpu_procs`/`mem_proc` joins in place, but make sure
the first sample is taken within the actual stress window even when `duration <
30`.

In
`@benchmark_runner/common/template_operations/templates/winstress/internal_data/01_run_stress_template.ps1`:
- Line 6: The $reportFile variable in 01_run_stress_template.ps1 is assigned but
never reused, while the report path is hardcoded again later in the script.
Update the logic around $reportFile so the same variable is used wherever the
stress report path is needed, including the report-writing and report-reading
spots in the script, and remove the duplicated literal path to keep it
consistent with $stressDir.

In
`@benchmark_runner/common/template_operations/templates/winstress/internal_data/windows_dv_template.yaml`:
- Line 6: The DataVolume name in the Windows clone template is hardcoded and
shared across multiple Windows workloads, which can cause collisions during
concurrent runs. Update the name in windows_dv_template.yaml to be unique per
workload or per run by incorporating a suffix such as the existing template
uuid/identifier, and make sure the Windows clone path stays consistent in the
template that defines the DataVolume object.

In `@benchmark_runner/workloads/linstress_vm.py`:
- Around line 35-43: _remove the per-VM error-log call from the multiprocessing
phase helpers and let the top-level run() failure handling own it. In
_create_vm, _collect_results, and _delete_vm, avoid calling
self.save_error_logs() inside the child-process path; instead, just propagate
the exception and keep failure logging centralized in run()’s except block,
which already runs in the main process. Use the existing _run_parallel_phases
flow and the BootstormVM-style pattern as the reference for locating these step
functions._
- Around line 64-70: The SCP/report path in linstress_vm’s result handling is
swallowing failures by only logging a warning and returning from _scp_file
handling, which lets _upload_results skip failed VMs and still complete
successfully. Update the failure path around the stress report fetch in
linstress_vm to track missing-result/SCP failures per VM, and have the workload
lifecycle logic raise an exception or otherwise mark the run failed when no
usable results are produced. Make sure the fix is wired through the existing
_upload_results flow so repeated VM failures are not silently ignored.
- Around line 54-62: The polling loop in linstress_vm is using a range counter
with OC.DELAY, but the virtctl_ssh call can take much longer than that, so the
real wait can exceed self._timeout. Update the timeout logic in the stress
report check loop to use a wall-clock deadline or elapsed-time tracking around
virtctl_ssh and time.sleep, so the loop exits once the overall timeout is
reached. Keep the change localized to the polling block that calls
self._virtctl.virtctl_ssh and logs the “Waiting for stress_report.json” message.

In `@benchmark_runner/workloads/winstress_vm.py`:
- Around line 65-91: The failure handling in `_prepare_vm` and `_run_stress` is
inconsistent with `_create_vm`, `_collect_results`, and `_delete_vm`, because
these methods only log warnings and swallow exceptions. Update `_prepare_vm` and
`_run_stress` in `winstress_vm.py` to propagate failures by re-raising after
logging, so a VM that never becomes SSH-ready or fails during stress does not
continue into later phases and produce misleading results.
- Around line 98-106: The wait loop in winstress_vm’s stress_report.json polling
uses a fixed iteration count instead of real elapsed time, so the combined
virtctl_ssh timeout and sleep can exceed the intended self._timeout budget.
Update the polling in the method that builds check_cmd and calls
virtctl.virtctl_ssh to track wall-clock elapsed time (or a deadline) and stop
once the total time reaches self._timeout, keeping the per-check SSH timeout
bounded as well. Ensure the logger message and break condition still reflect the
same stress_report.json check logic.
- Around line 57-63: The worker helper error handlers in _create_vm(),
_collect_results(), and _delete_vm() are doing failure logging via
save_error_logs(), which should not run inside multiprocessing.Process workers.
Remove the save_error_logs() call from those except blocks and keep them only
re-raising the exception, then update run() to perform the error logging after
_run_parallel_phases() returns so the Elasticsearch upload happens in the parent
process.

In
`@tests/unittest/benchmark_runner/common/template_operations/golden_files/func_ci_linstress_vm_ODF_PVC_False/linstress_vm.yaml`:
- Around line 8-12: The cloud-init settings in the linstress VM template use a
hardcoded `user`/`password` pair with `ssh_pwauth` enabled, which should be
removed from this benchmark fixture. Update the `linstress_vm.yaml` template to
stop using fixed credentials in this cloud-config block and instead follow the
same safer approach used by the Windows workload (prefer SSH public key
injection and disable password auth), or at minimum generate a per-run random
password rather than a literal default.

In
`@tests/unittest/benchmark_runner/common/template_operations/golden_files/func_ci_linstress_vm_ODF_PVC_True/linstress_vm.yaml`:
- Around line 8-12: The cloud-init rendered by the linstress VM template is
using hardcoded weak credentials and enabling SSH password login. Update the
linstress cloud-init/template that produces linstress_vm.yaml so it no longer
sets a fixed user/password pair and does not enable ssh_pwauth; switch the
template to key-based auth only or generate a unique per-run password. Use the
cloud-init generation path associated with linstress_vm and the template/data
source that fills user, password, and ssh_pwauth to locate the fix.

In
`@tests/unittest/benchmark_runner/common/template_operations/golden_files/perf_ci_linstress_vm_ODF_PVC_False/linstress_vm.yaml`:
- Around line 6-12: The VM userdata in the linstress template hardcodes the
fedora account password, keeps it non-expiring, and enables password SSH auth.
Update the cloud-config in the linstress VM template so it relies on the SSH key
flow used by linstress_vm.py, and remove or template the password-related fields
and ssh_pwauth setting instead of shipping fixed credentials.

In
`@tests/unittest/benchmark_runner/common/template_operations/golden_files/release_linstress_vm_ODF_PVC_False/linstress_vm.yaml`:
- Around line 9-12: The cloud-init template for the linstress VM is using
hardcoded weak SSH credentials and password auth, which should be removed for
benchmark_runner artifacts. Update the template behind the linstress_vm.yaml
asset so the user/password values are not static (for example, inject a per-run
secret through the data template or switch the VM to SSH key auth), and keep the
relevant cloud-init fields such as ssh_pwauth and chpasswd aligned with that
change.

---

Nitpick comments:
In `@benchmark_runner/common/template_operations/template_operations.py`:
- Around line 156-173: The new workload_config handling duplicates the existing
hammerdb_config override flow in TemplateOperations, so extract the shared
“filter empty values, detect unknown keys, log, and merge into render_data”
behavior into a private helper. Reuse that helper from the hammerdb_config and
workload_config branches in template_operations.py so both paths stay consistent
and future changes only need to be made once. Keep the workload_name condition
for hammerdb_config in place, and preserve the existing
logger.warning/logger.info behavior through the shared helper.

In
`@benchmark_runner/common/template_operations/templates/winstress/internal_data/winstress_vm_template.yaml`:
- Around line 4-12: The VM name/label logic in the template is duplicated in
multiple places and should be centralized for consistency. In
winstress_vm_template.yaml, add Jinja {% set %} variables near the top for the
computed VM name and root disk name, then replace the repeated {% if scale
%}...{% else %}...{% endif %} blocks in the template sections with those shared
variables (for labels, disks, volumes, and dataVolumeTemplates). Use the
existing template symbols like kind, trunc_uuid, and scale to build vm_name and
root_disk_name once, then reference them everywhere else.

In `@benchmark_runner/main/environment_variables.py`:
- Around line 107-128: The workload list in the help text/README is missing the
newly supported `winstress_vm` and `linstress_vm` entries added in
`EnvironmentVariables`; update the documentation that describes the WORKLOAD
parameter so it matches the values in `environment_variables_dict['workloads']`.
Locate the workload documentation section referenced by the coding guidelines
and add these two names alongside the existing workload variants, keeping the
list consistent with the implementation.

In `@benchmark_runner/workloads/linstress_vm.py`:
- Line 53: The check_cmd assignment in linstress_vm.py is using an unnecessary
f-string prefix even though there are no placeholders. Update the check_cmd
string literal in the workload setup code to a plain string so it no longer
triggers Ruff F541, keeping the same shell command content.

In
`@tests/unittest/benchmark_runner/common/template_operations/golden_files/chaos_ci_winstress_vm_ODF_PVC_False/winstress_vm.yaml`:
- Around line 96-102: The `winstress_vm.yaml` cloud-init `runcmd` block briefly
opens `.ssh` and `authorized_keys` to `Everyone:F`, which should be removed.
Update the `userData` setup steps so the existing file can be taken over and
deleted using a SYSTEM/Administrator-only approach in the same `runcmd`
sequence, then keep the final restrictive ACL restoration in place. Reference
the `userData` SSH setup commands to locate the temporary permission grant and
replace it with a non-permissive ownership reset flow.

In
`@tests/unittest/benchmark_runner/common/template_operations/golden_files/func_ci_linstress_vm_ODF_PVC_True/linstress_vm.yaml`:
- Around line 126-129: The runcmd in the linstress_vm.yaml template is
redundantly reinstalling python3-psutil even though it is already declared in
the packages section. Update the runcmd block in the linstress_vm template to
remove the duplicate dnf install step, or keep it only if truly needed as a
fallback and add a brief inline comment explaining why. Use the runcmd and
packages entries in func_ci_linstress_vm_ODF_PVC_True/linstress_vm.yaml to
locate the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5d3c4db4-ec96-4cf5-a93f-f2de947eb4b1

📥 Commits

Reviewing files that changed from the base of the PR and between f35145c and dfbc898.

📒 Files selected for processing (64)
  • benchmark_runner/common/template_operations/template_operations.py
  • benchmark_runner/common/template_operations/templates/linstress/__init__.py
  • benchmark_runner/common/template_operations/templates/linstress/internal_data/__init__.py
  • benchmark_runner/common/template_operations/templates/linstress/internal_data/linstress_vm_template.yaml
  • benchmark_runner/common/template_operations/templates/linstress/linstress_data_template.yaml
  • benchmark_runner/common/template_operations/templates/winstress/__init__.py
  • benchmark_runner/common/template_operations/templates/winstress/internal_data/01_run_stress_template.ps1
  • benchmark_runner/common/template_operations/templates/winstress/internal_data/__init__.py
  • benchmark_runner/common/template_operations/templates/winstress/internal_data/windows_dv_template.yaml
  • benchmark_runner/common/template_operations/templates/winstress/internal_data/winstress_vm_template.yaml
  • benchmark_runner/common/template_operations/templates/winstress/winstress_data_template.yaml
  • benchmark_runner/main/environment_variables.py
  • benchmark_runner/workloads/linstress_vm.py
  • benchmark_runner/workloads/winstress_vm.py
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/chaos_ci_linstress_vm_ODF_PVC_False/linstress_vm.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/chaos_ci_linstress_vm_ODF_PVC_False/namespace.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/chaos_ci_linstress_vm_ODF_PVC_True/linstress_vm.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/chaos_ci_linstress_vm_ODF_PVC_True/namespace.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/chaos_ci_winstress_vm_ODF_PVC_False/namespace.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/chaos_ci_winstress_vm_ODF_PVC_False/windows_dv.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/chaos_ci_winstress_vm_ODF_PVC_False/winstress_vm.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/chaos_ci_winstress_vm_ODF_PVC_True/namespace.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/chaos_ci_winstress_vm_ODF_PVC_True/windows_dv.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/chaos_ci_winstress_vm_ODF_PVC_True/winstress_vm.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/func_ci_linstress_vm_ODF_PVC_False/linstress_vm.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/func_ci_linstress_vm_ODF_PVC_False/namespace.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/func_ci_linstress_vm_ODF_PVC_True/linstress_vm.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/func_ci_linstress_vm_ODF_PVC_True/namespace.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/func_ci_winstress_vm_ODF_PVC_False/namespace.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/func_ci_winstress_vm_ODF_PVC_False/windows_dv.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/func_ci_winstress_vm_ODF_PVC_False/winstress_vm.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/func_ci_winstress_vm_ODF_PVC_True/namespace.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/func_ci_winstress_vm_ODF_PVC_True/windows_dv.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/func_ci_winstress_vm_ODF_PVC_True/winstress_vm.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/perf_ci_linstress_vm_ODF_PVC_False/linstress_vm.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/perf_ci_linstress_vm_ODF_PVC_False/namespace.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/perf_ci_linstress_vm_ODF_PVC_True/linstress_vm.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/perf_ci_linstress_vm_ODF_PVC_True/namespace.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/perf_ci_winstress_vm_ODF_PVC_False/namespace.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/perf_ci_winstress_vm_ODF_PVC_False/windows_dv.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/perf_ci_winstress_vm_ODF_PVC_False/winstress_vm.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/perf_ci_winstress_vm_ODF_PVC_True/namespace.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/perf_ci_winstress_vm_ODF_PVC_True/windows_dv.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/perf_ci_winstress_vm_ODF_PVC_True/winstress_vm.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/release_linstress_vm_ODF_PVC_False/linstress_vm.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/release_linstress_vm_ODF_PVC_False/namespace.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/release_linstress_vm_ODF_PVC_True/linstress_vm.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/release_linstress_vm_ODF_PVC_True/namespace.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/release_winstress_vm_ODF_PVC_False/namespace.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/release_winstress_vm_ODF_PVC_False/windows_dv.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/release_winstress_vm_ODF_PVC_False/winstress_vm.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/release_winstress_vm_ODF_PVC_True/namespace.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/release_winstress_vm_ODF_PVC_True/windows_dv.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/release_winstress_vm_ODF_PVC_True/winstress_vm.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/test_ci_linstress_vm_ODF_PVC_False/linstress_vm.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/test_ci_linstress_vm_ODF_PVC_False/namespace.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/test_ci_linstress_vm_ODF_PVC_True/linstress_vm.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/test_ci_linstress_vm_ODF_PVC_True/namespace.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/test_ci_winstress_vm_ODF_PVC_False/namespace.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/test_ci_winstress_vm_ODF_PVC_False/windows_dv.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/test_ci_winstress_vm_ODF_PVC_False/winstress_vm.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/test_ci_winstress_vm_ODF_PVC_True/namespace.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/test_ci_winstress_vm_ODF_PVC_True/windows_dv.yaml
  • tests/unittest/benchmark_runner/common/template_operations/golden_files/test_ci_winstress_vm_ODF_PVC_True/winstress_vm.yaml

Comment on lines +78 to +96
intervals = max(1, duration // 30)
samples = []
for i in range(intervals):
time.sleep(30)
mem = psutil.virtual_memory()
cpu_pct = psutil.cpu_percent(interval=1)
sample = {
'time_sec': (i+1)*30,
'cpu_percent': cpu_pct,
'mem_percent': mem.percent,
'mem_used_gb': round(mem.used / (1024**3), 1),
'mem_total_gb': round(mem.total / (1024**3), 1)
}
samples.append(sample)
print(f"At {sample['time_sec']}s: CPU={cpu_pct}% MEM={mem.percent}% ({sample['mem_used_gb']}GB/{sample['mem_total_gb']}GB)")

[p.join() for p in cpu_procs]
if mem_proc:
mem_proc.join()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Monitoring loop can sleep past the actual stress duration.

intervals = max(1, duration // 30) forces at least one time.sleep(30) iteration even when duration < 30. In that case burn_cpu/burn_memory workers finish long before the monitor wakes up, so the captured CPU/memory samples reflect post-stress idle state rather than the actual stress window — producing misleading benchmark data for short-duration runs (a common case for quick/test_ci iterations).

🐛 Proposed fix using elapsed-time tracking instead of a fixed iteration count
-              intervals = max(1, duration // 30)
-              samples = []
-              for i in range(intervals):
-                  time.sleep(30)
+              samples = []
+              elapsed_total = 0
+              while elapsed_total < duration:
+                  step = min(30, duration - elapsed_total)
+                  time.sleep(step)
+                  elapsed_total += step
                   mem = psutil.virtual_memory()
                   cpu_pct = psutil.cpu_percent(interval=1)
                   sample = {
-                      'time_sec': (i+1)*30,
+                      'time_sec': elapsed_total,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
intervals = max(1, duration // 30)
samples = []
for i in range(intervals):
time.sleep(30)
mem = psutil.virtual_memory()
cpu_pct = psutil.cpu_percent(interval=1)
sample = {
'time_sec': (i+1)*30,
'cpu_percent': cpu_pct,
'mem_percent': mem.percent,
'mem_used_gb': round(mem.used / (1024**3), 1),
'mem_total_gb': round(mem.total / (1024**3), 1)
}
samples.append(sample)
print(f"At {sample['time_sec']}s: CPU={cpu_pct}% MEM={mem.percent}% ({sample['mem_used_gb']}GB/{sample['mem_total_gb']}GB)")
[p.join() for p in cpu_procs]
if mem_proc:
mem_proc.join()
samples = []
elapsed_total = 0
while elapsed_total < duration:
step = min(30, duration - elapsed_total)
time.sleep(step)
elapsed_total += step
mem = psutil.virtual_memory()
cpu_pct = psutil.cpu_percent(interval=1)
sample = {
'time_sec': elapsed_total,
'cpu_percent': cpu_pct,
'mem_percent': mem.percent,
'mem_used_gb': round(mem.used / (1024**3), 1),
'mem_total_gb': round(mem.total / (1024**3), 1)
}
samples.append(sample)
print(f"At {sample['time_sec']}s: CPU={cpu_pct}% MEM={mem.percent}% ({sample['mem_used_gb']}GB/{sample['mem_total_gb']}GB)")
[p.join() for p in cpu_procs]
if mem_proc:
mem_proc.join()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@benchmark_runner/common/template_operations/templates/linstress/internal_data/linstress_vm_template.yaml`
around lines 78 - 96, The monitoring loop in the linstress template is
oversleeping for short runs because `intervals = max(1, duration // 30)` always
forces at least one 30-second wait, so `burn_cpu` and `burn_memory` can finish
before sampling starts. Update the loop in the monitoring section to use
elapsed-time tracking against the requested `duration` instead of a fixed
interval count, and only sleep for the remaining time needed before each sample.
Keep the sampling logic and `cpu_procs`/`mem_proc` joins in place, but make sure
the first sample is taken within the actual stress window even when `duration <
30`.

$stressDir = 'C:\tools\stress'
$pythonExe = 'C:\Program Files\Python312\python.exe'
$stressScript = "$stressDir\stress.py"
$reportFile = "$stressDir\stress_report.json"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Unused $reportFile variable; report path duplicated instead of reused.

Static analysis flags $reportFile as assigned but never used. The literal path C:\tools\stress\stress_report.json is hardcoded again at Lines 114 and 116, risking drift if $stressDir changes.

♻️ Proposed fix
-    with open(r'C:\tools\stress\stress_report.json', 'w') as f:
+    with open(r'$reportFile', 'w') as f:
         json.dump(report, f, indent=2)
-    print('Report saved to C:\\tools\\stress\\stress_report.json')
+    print('Report saved to $reportFile')
Based on learnings, PSScriptAnalyzer flagged: "The variable 'reportFile' is assigned but never used."

Also applies to: 114-116

🧰 Tools
🪛 PSScriptAnalyzer (1.25.0)

[warning] 6-6: The variable 'reportFile' is assigned but never used.

(PSUseDeclaredVarsMoreThanAssignments)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@benchmark_runner/common/template_operations/templates/winstress/internal_data/01_run_stress_template.ps1`
at line 6, The $reportFile variable in 01_run_stress_template.ps1 is assigned
but never reused, while the report path is hardcoded again later in the script.
Update the logic around $reportFile so the same variable is used wherever the
stress report path is needed, including the report-writing and report-reading
spots in the script, and remove the duplicated literal path to keep it
consistent with $stressDir.

Source: Linters/SAST tools

metadata:
annotations:
cdi.kubevirt.io/storage.deleteAfterCompletion: "false"
name: windows-clone-dv

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check whether another workload template defines a DV with the same name
rg -n "windows-clone-dv" benchmark_runner/common/template_operations/templates

Repository: redhat-performance/benchmark-runner

Length of output: 1243


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the relevant Windows templates for uniqueness/uuid usage
sed -n '1,220p' benchmark_runner/common/template_operations/templates/windows/internal_data/windows_dv_template.yaml
printf '\n---\n'
sed -n '1,240p' benchmark_runner/common/template_operations/templates/windows/internal_data/windows_vm_template.yaml
printf '\n---\n'
sed -n '1,220p' benchmark_runner/common/template_operations/templates/winstress/internal_data/windows_dv_template.yaml
printf '\n---\n'
sed -n '1,260p' benchmark_runner/common/template_operations/templates/winstress/internal_data/winstress_vm_template.yaml
printf '\n---\n'
sed -n '1,220p' benchmark_runner/common/template_operations/templates/winmssql/internal_data/windows_dv_template.yaml
printf '\n---\n'
sed -n '1,260p' benchmark_runner/common/template_operations/templates/winmssql/internal_data/winmssql_vm_template.yaml
printf '\n---\n'
sed -n '1,220p' benchmark_runner/common/template_operations/templates/winfio/internal_data/windows_dv_template.yaml
printf '\n---\n'
sed -n '1,260p' benchmark_runner/common/template_operations/templates/winfio/internal_data/winfio_vm_template.yaml

# Search for uuid usage in Windows templates
printf '\n=== uuid hits ===\n'
rg -n "uuid|windows-clone-dv" benchmark_runner/common/template_operations/templates/{windows,winstress,winmssql,winfio} -g '*.yaml'

Repository: redhat-performance/benchmark-runner

Length of output: 37077


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect Windows workload data templates for URL/uuid and whether the clone DV is intended to be shared
sed -n '1,220p' benchmark_runner/common/template_operations/templates/windows/windows_data_template.yaml
printf '\n---\n'
sed -n '1,220p' benchmark_runner/common/template_operations/templates/winstress/winstress_data_template.yaml
printf '\n---\n'
sed -n '1,220p' benchmark_runner/common/template_operations/templates/winmssql/winmssql_data_template.yaml
printf '\n---\n'
sed -n '1,220p' benchmark_runner/common/template_operations/templates/winfio/winfio_data_template.yaml

printf '\n=== references to windows-clone-dv in Python code ===\n'
rg -n "windows-clone-dv|dataVolumeTemplates|source:\s*$|source:\n\s*pvc:" benchmark_runner -g '*.py' -g '*.yaml'

printf '\n=== any cleanup/delete logic for windows clone DV ===\n'
rg -n "deleteAfterCompletion|delete.*DataVolume|DataVolume.*delete|windows-clone" benchmark_runner -g '*.py' -g '*.yaml'

Repository: redhat-performance/benchmark-runner

Length of output: 3365


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check whether namespace is run-specific or shared, and whether clone DV names are reused intentionally
rg -n "namespace:|name: windows-clone-dv|deleteAfterCompletion|uuid: \{\{ uuid \}\}" benchmark_runner/common/template_operations/templates -g '*.yaml' -g '*.py' -g '*.json'

printf '\n=== common template data / namespace sources ===\n'
sed -n '1,220p' benchmark_runner/common/template_operations/templates/common_template.yaml
printf '\n---\n'
rg -n "namespace" benchmark_runner/main benchmark_runner/common/template_operations -g '*.py' -g '*.yaml'

Repository: redhat-performance/benchmark-runner

Length of output: 32316


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect workload namespace grouping for the Windows workloads
sed -n '120,220p' benchmark_runner/main/environment_variables.py
printf '\n---\n'
sed -n '430,470p' benchmark_runner/main/environment_variables.py

printf '\n=== windows-related workload names ===\n'
rg -n "\"(windows|winstress|winmssql|winfio)[^\"]*\"" benchmark_runner/main/environment_variables.py

Repository: redhat-performance/benchmark-runner

Length of output: 8725


Make the Windows clone DataVolume name unique

windows-clone-dv is shared by windows, winstress, winmssql, and winfio in the same benchmark-runner namespace, so concurrent Windows runs will contend for the same DataVolume object and can fail. Add a workload/run suffix such as {{ uuid }} or otherwise make this name unique.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@benchmark_runner/common/template_operations/templates/winstress/internal_data/windows_dv_template.yaml`
at line 6, The DataVolume name in the Windows clone template is hardcoded and
shared across multiple Windows workloads, which can cause collisions during
concurrent runs. Update the name in windows_dv_template.yaml to be unique per
workload or per run by incorporating a suffix such as the existing template
uuid/identifier, and make sure the Windows clone path stays consistent in the
template that defines the DataVolume object.

Comment on lines +35 to +43
def _create_vm(self, vm_num: str):
try:
vm_name = self._get_vm_name(vm_num)
self._oc.create_async(yaml=self._get_vm_yaml(vm_num))
self._oc.wait_for_vm_create(vm_name=vm_name)
logger.info(f'VM {vm_name} created, stress script running via cloud-init')
except Exception as err:
self.save_error_logs()
raise err

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win

Don't call save_error_logs() from per-VM step functions run via _run_parallel_phases.

_create_vm, _collect_results, and _delete_vm are dispatched as parallel phases (likely multiprocessing child processes, same pattern as BootstormVM). Calling self.save_error_logs() from within these child-process functions risks the documented OpenSSL/SIGSEGV crash across fork boundaries. run()'s own except block (lines 170-172) already handles failure logging after phases complete, in the main process — the per-step calls are redundant and unsafe.

Based on learnings, "avoid calling _save_error_logs() from within multiprocessing child processes... this can trigger the documented SSL/SIGSEGV multiprocessing crash," and the recommended pattern is to do all HTTPS work in the main process after child processes complete or entirely within child processes consistently — not scattered ad hoc in both.

🔧 Proposed fix: drop per-step error-log calls, let `run()` handle it
     def _create_vm(self, vm_num: str):
         try:
             vm_name = self._get_vm_name(vm_num)
             self._oc.create_async(yaml=self._get_vm_yaml(vm_num))
             self._oc.wait_for_vm_create(vm_name=vm_name)
             logger.info(f'VM {vm_name} created, stress script running via cloud-init')
         except Exception as err:
-            self.save_error_logs()
             raise err
         except Exception as err:
-            self.save_error_logs()
             raise err

     def _delete_vm(self, vm_num: str):
         try:
             vm_name = self._get_vm_name(vm_num)
             self._oc.delete_vm_sync(yaml=self._get_vm_yaml(vm_num), vm_name=vm_name)
         except Exception as err:
-            self.save_error_logs()
             raise err

Also applies to: 103-105, 107-113

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark_runner/workloads/linstress_vm.py` around lines 35 - 43, _remove the
per-VM error-log call from the multiprocessing phase helpers and let the
top-level run() failure handling own it. In _create_vm, _collect_results, and
_delete_vm, avoid calling self.save_error_logs() inside the child-process path;
instead, just propagate the exception and keep failure logging centralized in
run()’s except block, which already runs in the main process. Use the existing
_run_parallel_phases flow and the BootstormVM-style pattern as the reference for
locating these step functions._

Source: Learnings

Comment on lines +54 to +62
for elapsed in range(0, self._timeout, OC.DELAY):
check = self._virtctl.virtctl_ssh(vm_name=vm_name, command=check_cmd,
namespace=self.__namespace, key_path=self.__ssh_key_path,
username=self.__username, timeout=60)
if check and 'found' in check:
break
if elapsed > 0 and elapsed % 300 == 0:
logger.info(f'Waiting for stress_report.json on VM {vm_name}... ({elapsed}s)')
time.sleep(OC.DELAY)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Polling loop can overrun the declared timeout.

Each iteration performs an SSH call (up to 60s) and then sleeps OC.DELAY, but the loop counter advances only by OC.DELAY per iteration — so if SSH calls are slow, actual wall-clock wait can far exceed self._timeout. As per path instructions, timeout handling for workloads should be checked carefully; here the bound is effectively iterations * (ssh_call_time + OC.DELAY) rather than self._timeout.

⏱️ Proposed fix: use wall-clock deadline instead of loop-counter
-            for elapsed in range(0, self._timeout, OC.DELAY):
-                check = self._virtctl.virtctl_ssh(vm_name=vm_name, command=check_cmd,
-                                                  namespace=self.__namespace, key_path=self.__ssh_key_path,
-                                                  username=self.__username, timeout=60)
-                if check and 'found' in check:
-                    break
-                if elapsed > 0 and elapsed % 300 == 0:
-                    logger.info(f'Waiting for stress_report.json on VM {vm_name}... ({elapsed}s)')
-                time.sleep(OC.DELAY)
+            deadline = time.time() + self._timeout
+            while time.time() < deadline:
+                check = self._virtctl.virtctl_ssh(vm_name=vm_name, command=check_cmd,
+                                                  namespace=self.__namespace, key_path=self.__ssh_key_path,
+                                                  username=self.__username, timeout=60)
+                if check and 'found' in check:
+                    break
+                logger.info(f'Waiting for stress_report.json on VM {vm_name}...')
+                time.sleep(OC.DELAY)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for elapsed in range(0, self._timeout, OC.DELAY):
check = self._virtctl.virtctl_ssh(vm_name=vm_name, command=check_cmd,
namespace=self.__namespace, key_path=self.__ssh_key_path,
username=self.__username, timeout=60)
if check and 'found' in check:
break
if elapsed > 0 and elapsed % 300 == 0:
logger.info(f'Waiting for stress_report.json on VM {vm_name}... ({elapsed}s)')
time.sleep(OC.DELAY)
deadline = time.time() + self._timeout
while time.time() < deadline:
check = self._virtctl.virtctl_ssh(vm_name=vm_name, command=check_cmd,
namespace=self.__namespace, key_path=self.__ssh_key_path,
username=self.__username, timeout=60)
if check and 'found' in check:
break
logger.info(f'Waiting for stress_report.json on VM {vm_name}...')
time.sleep(OC.DELAY)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark_runner/workloads/linstress_vm.py` around lines 54 - 62, The polling
loop in linstress_vm is using a range counter with OC.DELAY, but the virtctl_ssh
call can take much longer than that, so the real wait can exceed self._timeout.
Update the timeout logic in the stress report check loop to use a wall-clock
deadline or elapsed-time tracking around virtctl_ssh and time.sleep, so the loop
exits once the overall timeout is reached. Keep the change localized to the
polling block that calls self._virtctl.virtctl_ssh and logs the “Waiting for
stress_report.json” message.

Source: Path instructions

Comment on lines +98 to +106
check_cmd = f'powershell -Command "if (Test-Path \'{self.__remote_dir}/stress_report.json\') {{ echo found }}"'
for elapsed in range(0, self._timeout, OC.DELAY):
check = self._virtctl.virtctl_ssh(vm_name=vm_name, command=check_cmd,
namespace=self.__namespace, key_path=self.__ssh_key_path,
username=self.__username, timeout=60)
if check and 'found' in check:
break
logger.info(f'Waiting for stress_report.json on VM {vm_name}... ({elapsed}s)')
time.sleep(OC.DELAY)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Wait loop can exceed the intended self._timeout budget.

The loop iterates self._timeout // OC.DELAY times, but each iteration also issues a virtctl_ssh check with its own independent timeout=60 plus time.sleep(OC.DELAY). If the SSH check is slow/hanging, real elapsed time is iterations * (up_to_60 + OC.DELAY), which can far exceed self._timeout — the loop bound only counts iterations, not actual elapsed wall-clock time.

🕒 Proposed fix using actual elapsed time
-            check_cmd = f'powershell -Command "if (Test-Path \'{self.__remote_dir}/stress_report.json\') {{ echo found }}"'
-            for elapsed in range(0, self._timeout, OC.DELAY):
+            check_cmd = f'powershell -Command "if (Test-Path \'{self.__remote_dir}/stress_report.json\') {{ echo found }}"'
+            deadline = time.time() + self._timeout
+            while time.time() < deadline:
                 check = self._virtctl.virtctl_ssh(vm_name=vm_name, command=check_cmd,
                                                   namespace=self.__namespace, key_path=self.__ssh_key_path,
                                                   username=self.__username, timeout=60)
                 if check and 'found' in check:
                     break
-                logger.info(f'Waiting for stress_report.json on VM {vm_name}... ({elapsed}s)')
+                logger.info(f'Waiting for stress_report.json on VM {vm_name}...')
                 time.sleep(OC.DELAY)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
check_cmd = f'powershell -Command "if (Test-Path \'{self.__remote_dir}/stress_report.json\') {{ echo found }}"'
for elapsed in range(0, self._timeout, OC.DELAY):
check = self._virtctl.virtctl_ssh(vm_name=vm_name, command=check_cmd,
namespace=self.__namespace, key_path=self.__ssh_key_path,
username=self.__username, timeout=60)
if check and 'found' in check:
break
logger.info(f'Waiting for stress_report.json on VM {vm_name}... ({elapsed}s)')
time.sleep(OC.DELAY)
check_cmd = f'powershell -Command "if (Test-Path \'{self.__remote_dir}/stress_report.json\') {{ echo found }}"'
deadline = time.time() + self._timeout
while time.time() < deadline:
check = self._virtctl.virtctl_ssh(vm_name=vm_name, command=check_cmd,
namespace=self.__namespace, key_path=self.__ssh_key_path,
username=self.__username, timeout=60)
if check and 'found' in check:
break
logger.info(f'Waiting for stress_report.json on VM {vm_name}...')
time.sleep(OC.DELAY)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark_runner/workloads/winstress_vm.py` around lines 98 - 106, The wait
loop in winstress_vm’s stress_report.json polling uses a fixed iteration count
instead of real elapsed time, so the combined virtctl_ssh timeout and sleep can
exceed the intended self._timeout budget. Update the polling in the method that
builds check_cmd and calls virtctl.virtctl_ssh to track wall-clock elapsed time
(or a deadline) and stop once the total time reaches self._timeout, keeping the
per-check SSH timeout bounded as well. Ensure the logger message and break
condition still reflect the same stress_report.json check logic.

Source: Path instructions

Comment on lines +8 to +12
#cloud-config
user: fedora
password: fedora
chpasswd: { expire: False }
ssh_pwauth: true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Hardcoded weak default credentials with password SSH auth enabled.

Cloud-init sets user: fedora / password: fedora with ssh_pwauth: true, enabling password-based SSH login with a well-known default credential pair. Recommend disabling ssh_pwauth in favor of injecting an SSH public key (as done for the Windows workload), or at minimum generating a random per-run password rather than a fixed literal.

As per coding guidelines, benchmark_runner path instructions state to "Flag any hardcoded credentials or secrets."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/unittest/benchmark_runner/common/template_operations/golden_files/func_ci_linstress_vm_ODF_PVC_False/linstress_vm.yaml`
around lines 8 - 12, The cloud-init settings in the linstress VM template use a
hardcoded `user`/`password` pair with `ssh_pwauth` enabled, which should be
removed from this benchmark fixture. Update the `linstress_vm.yaml` template to
stop using fixed credentials in this cloud-config block and instead follow the
same safer approach used by the Windows workload (prefer SSH public key
injection and disable password auth), or at minimum generate a per-run random
password rather than a literal default.

Source: Path instructions

Comment on lines +8 to +12
#cloud-config
user: fedora
password: fedora
chpasswd: { expire: False }
ssh_pwauth: true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Hardcoded weak credentials + password SSH auth enabled.

The generated cloud-init sets user: fedora, password: fedora, and ssh_pwauth: true. This bakes a trivially guessable password into every VM instance with SSH password authentication enabled, exposing the VM to unauthorized access on the pod network. Since this is a rendered golden file, the root cause is the linstress data/cloud-init template — recommend switching to key-based auth only (disable ssh_pwauth) or generating a random per-run password.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/unittest/benchmark_runner/common/template_operations/golden_files/func_ci_linstress_vm_ODF_PVC_True/linstress_vm.yaml`
around lines 8 - 12, The cloud-init rendered by the linstress VM template is
using hardcoded weak credentials and enabling SSH password login. Update the
linstress cloud-init/template that produces linstress_vm.yaml so it no longer
sets a fixed user/password pair and does not enable ssh_pwauth; switch the
template to key-based auth only or generate a unique per-run password. Use the
cloud-init generation path associated with linstress_vm and the template/data
source that fills user, password, and ssh_pwauth to locate the fix.

Source: Path instructions

Comment on lines +6 to +12
stringData:
userdata: |
#cloud-config
user: fedora
password: fedora
chpasswd: { expire: False }
ssh_pwauth: true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate linstress-related templates and workload code
git ls-files | rg 'linstress|Linstress|lin_stress'

echo '--- template files ---'
git ls-files | rg 'benchmark_runner/common/template_operations/templates/.+linstress|benchmark_runner/common/template_operations/templates/.+lin_stress'

echo '--- workload files ---'
git ls-files | rg 'benchmark_runner/workloads/.+linstress|benchmark_runner/workloads/.+lin_stress'

echo '--- tests for golden files ---'
git ls-files | rg 'tests/.+linstress|tests/.+golden_files/.+linstress'

Repository: redhat-performance/benchmark-runner

Length of output: 3288


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '--- linstress_vm_template.yaml ---'
wc -l benchmark_runner/common/template_operations/templates/linstress/internal_data/linstress_vm_template.yaml
sed -n '1,220p' benchmark_runner/common/template_operations/templates/linstress/internal_data/linstress_vm_template.yaml | cat -n

echo '--- linstress_data_template.yaml ---'
wc -l benchmark_runner/common/template_operations/templates/linstress/linstress_data_template.yaml
sed -n '1,220p' benchmark_runner/common/template_operations/templates/linstress/linstress_data_template.yaml | cat -n

echo '--- linstress_vm.py ---'
wc -l benchmark_runner/workloads/linstress_vm.py
sed -n '1,260p' benchmark_runner/workloads/linstress_vm.py | cat -n

Repository: redhat-performance/benchmark-runner

Length of output: 19449


Hardcoded VM credentials and password SSH auth
benchmark_runner/common/template_operations/templates/linstress/internal_data/linstress_vm_template.yaml:8-13 hardcodes fedora/fedora, leaves the password non-expiring, and enables ssh_pwauth. If the VM only needs the SSH key path used by linstress_vm.py, drop password login or make these values templated.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/unittest/benchmark_runner/common/template_operations/golden_files/perf_ci_linstress_vm_ODF_PVC_False/linstress_vm.yaml`
around lines 6 - 12, The VM userdata in the linstress template hardcodes the
fedora account password, keeps it non-expiring, and enables password SSH auth.
Update the cloud-config in the linstress VM template so it relies on the SSH key
flow used by linstress_vm.py, and remove or template the password-related fields
and ssh_pwauth setting instead of shipping fixed credentials.

Source: Path instructions

Comment on lines +9 to +12
user: fedora
password: fedora
chpasswd: { expire: False }
ssh_pwauth: true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Hardcoded weak default credentials for a network-exposed VM.

The cloud-init template sets user: fedora / password: fedora and enables ssh_pwauth: true, meaning every linstress_vm deployment boots with a publicly-known, static SSH password. Since this VM has SSH exposed via a masquerade interface, this is a real attack surface, especially since these test clusters can be long-lived.

Recommend generating a random password per-run (like the deadbeef suffix used for names) and injecting it via the data template, or switching to SSH-key-based auth instead of password auth.

As per path instructions, "Flag any hardcoded credentials or secrets" for benchmark_runner/**.

🔒 Suggested direction
-    user: fedora
-    password: fedora
-    chpasswd: { expire: False }
-    ssh_pwauth: true
+    user: fedora
+    ssh_pwauth: false
+    ssh_authorized_keys:
+      - "{{ linstress_ssh_public_key }}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/unittest/benchmark_runner/common/template_operations/golden_files/release_linstress_vm_ODF_PVC_False/linstress_vm.yaml`
around lines 9 - 12, The cloud-init template for the linstress VM is using
hardcoded weak SSH credentials and password auth, which should be removed for
benchmark_runner artifacts. Update the template behind the linstress_vm.yaml
asset so the user/password values are not static (for example, inject a per-run
secret through the data template or switch the VM to SSH key auth), and keep the
relevant cloud-init fields such as ssh_pwauth and chpasswd aligned with that
change.

Source: Path instructions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant