Skip to content

Conversation

@bedroge
Copy link
Collaborator

@bedroge bedroge commented Nov 21, 2025

This will be fun.

@bedroge bedroge added the 2025.06-software.eessi.io 2025.06 version of software.eessi.io label Nov 21, 2025
@bedroge
Copy link
Collaborator Author

bedroge commented Nov 21, 2025

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace

@eessi-bot-jsc
Copy link

eessi-bot-jsc bot commented Nov 21, 2025

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace
Building for: aarch64/nvidia/grace
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2025.11/pr_1314/14243427

date job status comment
Nov 21 22:08:33 UTC 2025 submitted job id 14243427 awaits release by job manager
Nov 21 22:08:55 UTC 2025 released job awaits launch by Slurm scheduler
Nov 21 22:09:59 UTC 2025 running job 14243427 is running
Nov 22 22:10:03 UTC 2025 finished
🤷 UNKNOWN (click triangle for detailed information)
  • Job results file _bot_job14243427.result does not exist in job directory, or parsing it failed.
  • No artefacts were found/reported.
Nov 22 22:10:03 UTC 2025 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job14243427.test does not exist in job directory, or parsing it failed.

edit: job exceeded its 1-day time limit. According to https://gist.github.com/boegelbot/b64a7290ab9a66973b6aed13ec38a1dd, this could take ~2 days.

@bedroge
Copy link
Collaborator Author

bedroge commented Nov 24, 2025

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/amd/zen2

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Nov 24, 2025

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: amd-zen2
Building for: x86_64/amd/zen2
Job dir: /project/def-users/SHARED/jobs/2025.11/pr_1314/107414

date job status comment
Nov 24 08:37:16 UTC 2025 submitted job id 107414 awaits release by job manager
Nov 24 08:38:05 UTC 2025 released job awaits launch by Slurm scheduler
Nov 24 08:44:08 UTC 2025 running job 107414 is running
Nov 25 08:43:51 UTC 2025 finished
🤷 UNKNOWN (click triangle for detailed information)
  • Job results file _bot_job107414.result does not exist in job directory or reading it failed.
  • No artefacts were found/reported.
Nov 25 08:43:51 UTC 2025 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job107414.test does not exist in job directory or reading it failed.

@bedroge
Copy link
Collaborator Author

bedroge commented Nov 26, 2025

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/amd/zen2

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Nov 26, 2025

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: amd-zen2
Building for: x86_64/amd/zen2
Job dir: /project/def-users/SHARED/jobs/2025.11/pr_1314/108012

date job status comment
Nov 26 09:18:13 UTC 2025 submitted job id 108012 awaits release by job manager
Nov 26 09:18:58 UTC 2025 released job awaits launch by Slurm scheduler
Nov 26 09:27:16 UTC 2025 running job 108012 is running
Nov 27 15:18:12 UTC 2025 finished
🤷 UNKNOWN (click triangle for detailed information)
  • Job results file _bot_job108012.result does not exist in job directory or reading it failed.
  • No artefacts were found/reported.
Nov 27 15:18:12 UTC 2025 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job108012.test does not exist in job directory or reading it failed.
Nov 27 15:55:27 UTC 2025 released job awaits launch by Slurm scheduler
Nov 27 15:55:29 UTC 2025 running job 108012 is running
Nov 30 10:37:22 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-108012.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-amd-zen2-17644988710.tar.zstsize: 0 MiB (22 bytes)
entries: 0
modules under 2025.06/software/linux/x86_64/amd/zen2/modules/all
no module files in tarball
software under 2025.06/software/linux/x86_64/amd/zen2/software
no software packages in tarball
reprod directories under 2025.06/software/linux/x86_64/amd/zen2/reprod
no reprod directories in tarball
other under 2025.06/software/linux/x86_64/amd/zen2
no other files in tarball
Nov 30 10:37:22 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:x86_64_amd_zen2+default
P: latency: 1.33 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:x86_64_amd_zen2+default
P: latency: 2.03 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:x86_64_amd_zen2+default
P: latency: 0.18 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:x86_64_amd_zen2+default
P: bandwidth: 7944.29 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-108012.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Collaborator Author

bedroge commented Dec 1, 2025

= 1083 passed, 216 skipped, 2274 deselected, 33 xfailed in 74793.77s (20:46:33) =
The following tests failed and then succeeded when run in a new process['test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_nn_functional_max_pool2d_cpu_float32']

FINISHED PRINTING LOG FILE of inductor/test_torchinductor_opinfo 1/1 (test/test-reports/inductor.test_torchinductor_opinfo_1.1_596a3f74e8b7a107_.log)

Finished inductor/test_torchinductor_opinfo 1/1 in 4647 minutes
Running test batch 'tests to run' cost 343554.5 seconds
dynamo/test_inline_inbuilt_nn_modules 1/1 failed!
dynamo/test_misc 1/1 failed!
dynamo/test_dynamic_shapes 1/1 failed!
inductor/test_cpu_select_algorithm 1/1 failed!
inductor/test_aot_inductor_arrayref 1/1 failed!
inductor/test_minifier 1/1 failed!
inductor/test_torchinductor 1/1 failed!
inductor/test_torchinductor_codegen_dynamic_shapes 1/1 failed!
inductor/test_torchinductor_dynamic_shapes 1/1 failed!

I'm seeing lots of errors that look like the following one (i.e. with RuntimeError: Error in dlopen: /tmp/eb-t_juiyuo/eb-aellvs1_/tmpapw1hb5b/aoti_eager/aten/cpu/lib/c45k6o4iha4bbbscgaq7t2fd6caoqtits2gxv7jnufcgenkvpvl6/cdibapb6eia4jjqas5njvxctz6dw4dscm6y dmurozcgsbgn4pzxu.so: cannot enable executable stack as shared object requires: Invalid argument):

_____ DynamicShapesCpuTests.test_aoti_eager_with_scalar_dynamic_shapes_cpu _____
Traceback (most recent call last):
  File "/tmp/eb-t_juiyuo/eb-aellvs1_/tmp14e4l5qq/lib/python3.12/site-packages/torch/testing/_comparison.py", line 1232, in not_close_error_metas
    pair.compare()
  File "/tmp/eb-t_juiyuo/eb-aellvs1_/tmp14e4l5qq/lib/python3.12/site-packages/torch/testing/_comparison.py", line 711, in compare
    self._compare_values(actual, expected)
  File "/tmp/eb-t_juiyuo/eb-aellvs1_/tmp14e4l5qq/lib/python3.12/site-packages/torch/testing/_comparison.py", line 841, in _compare_values
    compare_fn(
  File "/tmp/eb-t_juiyuo/eb-aellvs1_/tmp14e4l5qq/lib/python3.12/site-packages/torch/testing/_comparison.py", line 1020, in _compare_regular_values_close
    matches = torch.isclose(
              ^^^^^^^^^^^^^^
RuntimeError: Error in dlopen: /tmp/eb-t_juiyuo/eb-aellvs1_/tmpapw1hb5b/aoti_eager/aten/cpu/lib/c45k6o4iha4bbbscgaq7t2fd6caoqtits2gxv7jnufcgenkvpvl6/cdibapb6eia4jjqas5njvxctz6dw4dscm6y
dmurozcgsbgn4pzxu.so: cannot enable executable stack as shared object requires: Invalid argument
Exception raised from DynamicLibrary at /tmp/bot/easybuild/build/PyTorch/2.6.0/foss-2024a/pytorch-v2.6.0/aten/src/ATen/DynamicLibrary.cpp:36 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::fun
ction<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 at::DynamicLibrary::DynamicLibrary(char const*, char const*, bool) [clone .cold] from DynamicLibrary.cpp:0
#7 torch::inductor::AOTIModelContainerRunner::AOTIModelContainerRunner(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, std::__cxx
11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0
#8 torch::inductor::AOTIModelContainerRunnerCpu::AOTIModelContainerRunnerCpu(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long) from
 ??:0
#9 torch::inductor::AOTIPythonKernelHolder::load_aoti_model_runner(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0
#10 torch::inductor::AOTIPythonKernelHolder::cache_miss(c10::OperatorHandle const&, c10::DispatchKeySet const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from ??:0
#11 torch::inductor::AOTIPythonKernelHolder::operator()(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from ??:0
#12 at::_ops::add_Scalar::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) from ??:0
#13 torch::autograd::VariableType::(anonymous namespace)::add_Scalar(c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) from VariableType_2.cpp:0
#14 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&, c10::Scalar const&), &torch::autograd::VariableType::(anonymous namespace)::add_Scalar>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&, c10::Scalar const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&, c10::Scalar const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) from VariableType_2.cpp:0
#15 at::_ops::add_Scalar::call(at::Tensor const&, c10::Scalar const&, c10::Scalar const&) from ??:0
#16 at::native::isclose(at::Tensor const&, at::Tensor const&, double, double, bool) from ??:0
#17 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, double, double, bool), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__isclose>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, double, double, bool> >, at::Tensor (at::Tensor const&, at::Tensor const&, double, double, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, double, double, bool) from RegisterCompositeImplicitAutograd.cpp:0
#18 at::_ops::isclose::call(at::Tensor const&, at::Tensor const&, double, double, bool) from ??:0
#19 torch::autograd::THPVariable_isclose(_object*, _object*, _object*) from python_torch_functions_0.cpp:0
#20 cfunction_call from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/methodobject.c:537
#21 _PyObject_MakeTpCall from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:240
#22 _PyEval_EvalFrameDefault from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Python/bytecodes.c:2706
#23 _PyFunction_Vectorcall from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:419
#24 PyCFunction_Call from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:387
#25 _PyFunction_Vectorcall from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:419
#26 _PyVectorcall_Call from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:283
#27 PyCFunction_Call from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:387
#28 _PyObject_FastCallDictTstate from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:144
#29 _PyObject_Call_Prepend from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:508
#30 slot_tp_call from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/typeobject.c:8770
#31 _PyObject_MakeTpCall from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:240
#32 _PyEval_EvalFrameDefault from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Python/bytecodes.c:2706
#33 _PyObject_FastCallDictTstate from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:144
#34 _PyObject_Call_Prepend from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:508
#35 slot_tp_call from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/typeobject.c:8770
#36 _PyObject_Call from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:367
#37 PyCFunction_Call from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:387
#38 _PyObject_FastCallDictTstate from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:144
#39 _PyObject_Call_Prepend from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:508
#40 slot_tp_call from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/typeobject.c:8770
#41 _PyObject_MakeTpCall from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:240
#42 _PyEval_EvalFrameDefault from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Python/bytecodes.c:2706
#43 _PyObject_FastCallDictTstate from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:144
#44 _PyObject_Call_Prepend from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:508
#45 slot_tp_call from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/typeobject.c:8770
#46 _PyObject_MakeTpCall from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:240
#47 _PyEval_EvalFrameDefault from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Python/bytecodes.c:2706
#48 _PyObject_FastCallDictTstate from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:144
#49 _PyObject_Call_Prepend from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:508
#50 slot_tp_call from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/typeobject.c:8770
#51 _PyObject_MakeTpCall from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Objects/call.c:240
#52 _PyEval_EvalFrameDefault from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Python/bytecodes.c:2706
#53 PyEval_EvalCode from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Python/ceval.c:578
#54 run_eval_code_obj from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Python/pythonrun.c:1722
#55 run_mod from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Python/pythonrun.c:1743
#56 pyrun_file from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Python/pythonrun.c:1643
#57 _PyRun_SimpleFileObject from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Python/pythonrun.c:433
#58 _PyRun_AnyFileObject from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Python/pythonrun.c:78
#59 pymain_run_file_obj from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Modules/main.c:360
#60 Py_BytesMain from /tmp/bot/easybuild/build/Python/3.12.3/GCCcore-13.3.0/Python-3.12.3/Modules/main.c:763
#61 __libc_init_first from ??:0
#62 __libc_start_main from ??:0
#63 _start from ??:0


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/cvmfs/software.eessi.io/versions/2025.06/software/linux/x86_64/amd/zen2/software/Python/3.12.3-GCCcore-13.3.0/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/cvmfs/software.eessi.io/versions/2025.06/software/linux/x86_64/amd/zen2/software/Python/3.12.3-GCCcore-13.3.0/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/cvmfs/software.eessi.io/versions/2025.06/software/linux/x86_64/amd/zen2/software/Python/3.12.3-GCCcore-13.3.0/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/tmp/eb-t_juiyuo/eb-aellvs1_/tmp14e4l5qq/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3108, in wrapper
    method(*args, **kwargs)
  File "/tmp/bot/easybuild/build/PyTorch/2.6.0/foss-2024a/pytorch-v2.6.0/test/inductor/test_torchinductor.py", line 11945, in new_test
    return value(self)
           ^^^^^^^^^^^
  File "/tmp/eb-t_juiyuo/eb-aellvs1_/tmp14e4l5qq/lib/python3.12/site-packages/torch/_dynamo/testing.py", line 413, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/tmp/eb-t_juiyuo/eb-aellvs1_/tmp14e4l5qq/lib/python3.12/site-packages/torch/testing/_internal/inductor_utils.py", line 95, in inner
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/bot/easybuild/build/PyTorch/2.6.0/foss-2024a/pytorch-v2.6.0/test/inductor/test_torchinductor.py", line 769, in wrapper
    return fn(self)
           ^^^^^^^^
  File "/tmp/bot/easybuild/build/PyTorch/2.6.0/foss-2024a/pytorch-v2.6.0/test/inductor/test_torchinductor.py", line 804, in wrapper
    return fn(self)
           ^^^^^^^^
  File "/tmp/eb-t_juiyuo/eb-aellvs1_/tmp14e4l5qq/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 1955, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/tmp/bot/easybuild/build/PyTorch/2.6.0/foss-2024a/pytorch-v2.6.0/test/inductor/test_torchinductor.py", line 1171, in test_aoti_eager_with_scalar
    self.assertEqual(ref_values, res_values)
  File "/tmp/eb-t_juiyuo/eb-aellvs1_/tmp14e4l5qq/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3977, in assertEqual
    error_metas = not_close_error_metas(
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/eb-t_juiyuo/eb-aellvs1_/tmp14e4l5qq/lib/python3.12/site-packages/torch/testing/_comparison.py", line 1238, in not_close_error_metas
    raise RuntimeError(
RuntimeError: Comparing

TensorOrArrayPair(
    id=(0,),
    actual=tensor([-0.3749, -1.0398, -1.4775,  0.0478,  1.3446,  0.9296, -0.6458, -1.9780,
        -2.2307,  1.5803, -1.4260, -1.2113,  0.1866, -1.9466,  1.1303,  0.1007,
         0.1859,  0.1307, -0.2456, -3.3054, -0.9886, -0.2530,  0.5858, -0.3569,
        -1.6881, -1.8160,  1.1661, -1.9253, -2.1287,  0.2440, -0.8848, -0.8743,
        -1.6719, -0.0171, -2.1621, -0.1776, -0.2201,  0.0083, -2.2329,  0.9893,
        -0.9836, -1.3098,  0.9148, -0.4542,  1.8239, -1.3004,  0.5746,  0.1402,
         1.7108, -0.7070, -2.1513, -2.7559, -1.1750,  0.1005, -1.8574,  1.5740,
         2.0443,  2.0384,  0.4070,  1.5856, -0.0286,  0.4480, -2.8703,  2.1453,
         0.1312,  1.5391, -0.0905,  1.5824,  1.7087, -0.4310,  0.4309, -0.2101,
         0.7917, -1.0633, -0.2346, -0.1480,  0.7494, -1.0332, -1.8817, -1.1881,
         0.3736, -1.7951, -1.8632, -1.4780, -0.5484,  1.1147, -0.4983,  0.3642,
         1.5814, -1.8542,  0.7143, -1.4033,  1.1840,  4.1290, -1.8060, -0.0688,
         1.3019,  1.6532, -0.2587,  2.7877,  1.5062,  1.4293, -0.0522, -0.2692,
        -0.3193, -2.4854, -3.0724,  2.0743,  0.3470,  0.1846, -1.1006,  1.1392,
         0.7374,  0.0552,  2.7090,  2.1603, -0.4780,  0.0124,  0.8840, -0.1060,
         0.1078, -0.8637, -1.4056,  1.6709, -0.3648, -0.7818,  1.7600,  0.0783]),
    expected=tensor([-0.3749, -1.0398, -1.4775,  0.0478,  1.3446,  0.9296, -0.6458, -1.9780,
        -2.2307,  1.5803, -1.4260, -1.2113,  0.1866, -1.9466,  1.1303,  0.1007,
         0.1859,  0.1307, -0.2456, -3.3054, -0.9886, -0.2530,  0.5858, -0.3569,
        -1.6881, -1.8160,  1.1661, -1.9253, -2.1287,  0.2440, -0.8848, -0.8743,
        -1.6719, -0.0171, -2.1621, -0.1776, -0.2201,  0.0083, -2.2329,  0.9893,
        -0.9836, -1.3098,  0.9148, -0.4542,  1.8239, -1.3004,  0.5746,  0.1402,
         1.7108, -0.7070, -2.1513, -2.7559, -1.1750,  0.1005, -1.8574,  1.5740,
         2.0443,  2.0384,  0.4070,  1.5856, -0.0286,  0.4480, -2.8703,  2.1453,
         0.1312,  1.5391, -0.0905,  1.5824,  1.7087, -0.4310,  0.4309, -0.2101,
         0.7917, -1.0633, -0.2346, -0.1480,  0.7494, -1.0332, -1.8817, -1.1881,
         0.3736, -1.7951, -1.8632, -1.4780, -0.5484,  1.1147, -0.4983,  0.3642,
         1.5814, -1.8542,  0.7143, -1.4033,  1.1840,  4.1290, -1.8060, -0.0688,
         1.3019,  1.6532, -0.2587,  2.7877,  1.5062,  1.4293, -0.0522, -0.2692,
        -0.3193, -2.4854, -3.0724,  2.0743,  0.3470,  0.1846, -1.1006,  1.1392,
         0.7374,  0.0552,  2.7090,  2.1603, -0.4780,  0.0124,  0.8840, -0.1060,
         0.1078, -0.8637, -1.4056,  1.6709, -0.3648, -0.7818,  1.7600,  0.0783]),
    rtol=1.3e-06,
    atol=1e-05,
    equal_nan=True,
    check_device=False,
    check_dtype=True,
    check_layout=False,
    check_stride=False,
)

resulted in the unexpected exception above. If you are a user and see this message during normal operation please file an issue at https://github.com/pytorch/pytorch/issues. If you are a developer and working on the comparison functions, please except the previous error and raise an expressive `ErrorMeta` instead.

To execute this test, run the following from the base repo dir:
    python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesCpuTests.test_aoti_eager_with_scalar_dynamic_shapes_cpu

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

@Flamefire have you seen such errors before by any chance?

@bedroge
Copy link
Collaborator Author

bedroge commented Dec 2, 2025

May be due to having a too new glibc, according to conda-forge/pytorch-cpu-feedstock#350 (comment)?

@Flamefire
Copy link
Contributor

RuntimeError: Error in dlopen: /tmp/eb-t_juiyuo/eb-aellvs1_/tmpapw1hb5b/aoti_eager/aten/cpu/lib/c45k6o4iha4bbbscgaq7t2fd6caoqtits2gxv7jnufcgenkvpvl6/cdibapb6eia4jjqas5njvxctz6dw4dscm6y dmurozcgsbgn4pzxu.so

is /tmp mounted with noexec? That would be similar to failures I've seen

What is really strange: The dlopen error leads to a tensor comparison error which doesn't make sense to me

You could also try with 2.7.1: easybuilders/easybuild-easyconfigs#23923

@bedroge
Copy link
Collaborator Author

bedroge commented Dec 2, 2025

is /tmp mounted with noexec? That would be similar to failures I've seen

I checked in our build container, but it doesn't seem to use that mount option

/dev/mapper/rocky-root on /tmp type xfs (rw,nosuid,nodev,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/mapper/rocky-root on /tmp/bedroge/EESSI/bot_job_tmp_Gyw type xfs (rw,nosuid,nodev,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)

@bedroge
Copy link
Collaborator Author

bedroge commented Dec 2, 2025

Also see a lot of these:

/cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/usr/bin/ld: warning: /tmp/eb-t_juiyuo/eb-aellvs1_/tmp849zzs9r/aoti_eager/aten/cpu/lib/c7x4oj4izckovwev6hadgddxoy6btcszbrgfthgctqkjpdxrxo3t/c2yzlj3ykdhw6x6cfv6hvl2y7guossiczufdgvvt5x4xdyr6tfb3/cgxre62qccm2hvjt32u3oobekucw5zks6ukxlzod6km7a4telymf.o: missing .note.GNU-stack section implies executable stack
/cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker

@bedroge
Copy link
Collaborator Author

bedroge commented Dec 2, 2025

You could also try with 2.7.1: easybuilders/easybuild-easyconfigs#23923

That will require some more work, e.g. #1278 needs to be deployed first. We don't have a CPU-only version of 2.7.1 as far as I can see?

@Flamefire
Copy link
Contributor

Also see a lot of these:

/cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/usr/bin/ld: warning: /tmp/eb-t_juiyuo/eb-aellvs1_/tmp849zzs9r/aoti_eager/aten/cpu/lib/c7x4oj4izckovwev6hadgddxoy6btcszbrgfthgctqkjpdxrxo3t/c2yzlj3ykdhw6x6cfv6hvl2y7guossiczufdgvvt5x4xdyr6tfb3/cgxre62qccm2hvjt32u3oobekucw5zks6ukxlzod6km7a4telymf.o: missing .note.GNU-stack section implies executable stack
/cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker

This could be the culprit: I see some google results suggesting " cannot enable executable stack as shared object requires: Invalid argument" can be fixed with execstack that patches(?) a binary

ld -z noexecstack seems to be an option to avoid it

But isn't ld supposed to be from the binutils module?

@Flamefire
Copy link
Contributor

That will require some more work, e.g. #1278 needs to be deployed first. We don't have a CPU-only version of 2.7.1 as far as I can see?

I dropped creating a CPU-only version after user complaints of the "strong(ly) GPU-accelerated" module doesn't support GPUs at all.
As the GPU version works on CPU-only machines IMO there isn't much reason for the work required to remove the CUDA dependencies and making sure tests can handle that.

@bedroge
Copy link
Collaborator Author

bedroge commented Dec 2, 2025

Also see a lot of these:

/cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/usr/bin/ld: warning: /tmp/eb-t_juiyuo/eb-aellvs1_/tmp849zzs9r/aoti_eager/aten/cpu/lib/c7x4oj4izckovwev6hadgddxoy6btcszbrgfthgctqkjpdxrxo3t/c2yzlj3ykdhw6x6cfv6hvl2y7guossiczufdgvvt5x4xdyr6tfb3/cgxre62qccm2hvjt32u3oobekucw5zks6ukxlzod6km7a4telymf.o: missing .note.GNU-stack section implies executable stack
/cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker

This could be the culprit: I see some google results suggesting " cannot enable executable stack as shared object requires: Invalid argument" can be fixed with execstack that patches(?) a binary

ld -z noexecstack seems to be an option to avoid it

I also found similar results, and when searching in the PyTorch repo I also found a commit that adds that flag here:
pytorch/pytorch@a2d0ef2#diff-4ee3c0dada05422c0338ca187f9157805e616f6f5bece2fa0752fdef16cc3733R185
But since that's part of convert_cubin_to_obj, I guess it may not be relevant for non-CUDA builds (?).

But isn't ld supposed to be from the binutils module?

We filter bintuils in EESSI (https://github.com/EESSI/software-layer-scripts/blob/main/EESSI-extend-easybuild.eb#L48), so in that sense it's correct that it's picking up this ld.

I dropped creating a CPU-only version after user complaints of the "strong(ly) GPU-accelerated" module doesn't support GPUs at all.
As the GPU version works on CPU-only machines IMO there isn't much reason for the work required to remove the CUDA dependencies and making sure tests can handle that.

That definitely makes sense! I wanted to try the CPU-only version first, as I imagined it would cause fewer build issues 😅 . But then I'll wait until the CUDA for 2025.06 is ingested, and will then give 2.7.1 a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2025.06-software.eessi.io 2025.06 version of software.eessi.io

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants