Skip to content

Check if the compute capability of the host GPU matches the requested one#258

Open
bedroge wants to merge 16 commits into
EESSI:mainfrom
bedroge:accel_cc_check
Open

Check if the compute capability of the host GPU matches the requested one#258
bedroge wants to merge 16 commits into
EESSI:mainfrom
bedroge:accel_cc_check

Conversation

@bedroge

@bedroge bedroge commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Solves https://gitlab.com/eessi/support/-/work_items/257.

While I was working on this, I also found that there was a small issue with the way how multiple accelerator builds are done: each next one appends a --resume flag, as they keep adding that to BUILD_STEP_ARGS. That's solved now by resetting that variable for every iteration, and inside the loop the accelerator-specif flags are then added.

@casparvl

Copy link
Copy Markdown
Contributor

Can you provide a way to test this PR? E.g. add an easystack for which the build was failing without this feature, and then prove that it works with this feature?

@casparvl

casparvl commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Support meeting:
@bedroge initially had the issue on an older version of GROMACS, but that PR is already merged - so we can't use it for testing. We can try to trigger EESSI/software-layer#1524 on grace + CC70, that should then hopefully fail. If that's true, we can use the same easyconfig to test the fix here. Also, in the logs, it should be clear that the -nv flag is then only passed for native builds, and not for cross-compiled builds.

@bedroge : what if you're running a slurm job in which you request 1 GPU, but then in your job you see all? For now, it just checks the first one - that's probably the one you'd run on anyway.

@casparvl

Copy link
Copy Markdown
Contributor

Could not reproduce the original issue with EESSI/software-layer#1524 (comment)

Oh well, I guess we then just check if the -nv flag is indeed only set for non-cross-compiled targets. Let's try...

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc on:arch=aarch64/nvidia/grace for:arch=aarch64/nvidia/grace,accel=nvidia/cc70
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace,accel=nvidia/cc90

@eessi-bot-jsc

eessi-bot-jsc Bot commented Jun 23, 2026

Copy link
Copy Markdown

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace
Building for: aarch64/nvidia/grace and accelerator nvidia/cc70
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/15245239

date job status comment
Jun 23 14:06:41 UTC 2026 submitted job id 15245239 awaits release by job manager
Jun 23 14:07:39 UTC 2026 released job awaits launch by Slurm scheduler
Jun 23 14:08:44 UTC 2026 running job 15245239 is running
Jun 23 14:14:12 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-15245239.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-accel-nvidia-cc70-17822237420.tar.gzsize: 0 MiB (2176 bytes)
entries: 1
modules under 2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc70/modules/all
no module files in tarball
software under 2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc70/software
no software packages in tarball
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc70/reprod
no reprod directories in tarball
other under 2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc70
no other files in tarball
Jun 23 14:14:12 UTC 2026 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite produced failures.
ReFrame Summary
[ FAILED ] Ran 17/29 test case(s) from 29 check(s) (4 failure(s), 12 skipped, 0 aborted)
Details
✅ job output file slurm-15245239.out
❌ found message matching ERROR:
❌ found message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot-jsc

eessi-bot-jsc Bot commented Jun 23, 2026

Copy link
Copy Markdown

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace and accelerator nvidia/cc90
Building for: aarch64/nvidia/grace and accelerator nvidia/cc90
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/15245240

date job status comment
Jun 23 14:06:47 UTC 2026 submitted job id 15245240 awaits release by job manager
Jun 23 14:07:35 UTC 2026 released job awaits launch by Slurm scheduler
Jun 23 14:08:48 UTC 2026 running job 15245240 is running
Jun 23 14:13:07 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-15245240.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-accel-nvidia-cc90-17822237300.tar.gzsize: 0 MiB (2176 bytes)
entries: 1
modules under 2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/modules/all
no module files in tarball
software under 2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/software
no software packages in tarball
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/reprod
no reprod directories in tarball
other under 2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90
no other files in tarball
Jun 23 14:13:07 UTC 2026 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite produced failures.
ReFrame Summary
[ FAILED ] Ran 17/29 test case(s) from 29 check(s) (4 failure(s), 12 skipped, 0 aborted)
Details
✅ job output file slurm-15245240.out
❌ found message matching ERROR:
❌ found message matching [\s*FAILED\s*].*Ran .* test case

@casparvl casparvl left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the grace + CC70 run:

NVIDIA-SMI version  : 595.71.05
NVML version        : 595.71
DRIVER version      : 595.71.05
CUDA Version        : 13.2
Command 'nvidia-smi' found.
ESC[31mError: the compute capability of the GPU (90) does not match the requested compute capability (70).ESC[0m
bot/build.sh: EESSI_ACCELERATOR_TARGET_OVERRIDE='accel/nvidia/cc70'
Executing command to build software:
/p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/event_bea374b0-6f0c-11f1-92d2-7f511b57d0af/run_000/aarch64/nvidia/grace/nvidia/cc70/eessi.io-2025.06-software/eessi_container.sh --verbose --access rw --mode run --container docker://ghcr.io/eessi/build-node:debian12 --repository eessi.io-2025.06-software --extra-bind-paths /p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/event_bea374b0-6f0c-11f1-92d2-7f511b57d0af/run_000/aarch64/nvidia/grace/nvidia/cc70/eessi.io-2025.06-software,/dev --pass-through --contain --save /p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/event_bea374b0-6f0c-11f1-92d2-7f511b57d0af/run_000/aarch64/nvidia/grace/nvidia/cc70/eessi.io-2025.06-software/previous_tmp/build_step --storage /local/scratch/eessibot/EESSI/eessi_job.Ya4mxkDSZO --host-injections /p/project1/ceasybuilders/eessibot/shared_fs_path/host-injections --nvidia install
                     -- /p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/event_bea374b0-6f0c-11f1-92d2-7f511b57d0af/run_000/aarch64/nvidia/grace/nvidia/cc70/eessi.io-2025.06-software/install_software_layer.sh "--build-logs-dir /p/project1/ceasybuilders/eessibot/build_logs_dir --shared-fs-path /p/project1/ceasybuilders/eessibot/shared_fs_path" "" 2>&1 | tee -a build.outerr.5jTj
...
singularity  exec --contain --fusemount container:cvmfs2 cvmfs-config.cern.ch /cvmfs/cvmfs-config.cern.ch --fusemount container:cvmfs2 software.eessi.io /cvmfs_ro/software.eessi.io --fusemount container:unionfs -o cow /tmp/software.eessi.io/overlay-upper=RW:/cvmfs_ro/software.eessi.io=
RO /cvmfs/software.eessi.io /local/scratch/eessibot/EESSI/eessi_job.Ya4mxkDSZO/eessi.gzS0v4LKEe/ghcr.io_eessi_build_node_debian12.sif /p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/event_bea374b0-6f0c-11f1-92d2-7f511b57d0af/run_000/aarch64/nvidia/grace/nvidia/cc70/eessi.io-2025
.06-software/install_software_layer.sh --build-logs-dir /p/project1/ceasybuilders/eessibot/build_logs_dir --shared-fs-path /p/project1/ceasybuilders/eessibot/shared_fs_path

On the Grace + CC90 run:

bot/build.sh: EESSI_VERSION_OVERRIDE: 2025.06
NVIDIA-SMI version  : 595.71.05
NVML version        : 595.71
DRIVER version      : 595.71.05
CUDA Version        : 13.2
Command 'nvidia-smi' found.
ESC[32mRequested compute capability matches the one from the GPU.ESC[0m
bot/build.sh: EESSI_ACCELERATOR_TARGET_OVERRIDE='accel/nvidia/cc90'
Executing command to build software:
/p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/event_bea374b0-6f0c-11f1-92d2-7f511b57d0af/run_001/aarch64/nvidia/grace/nvidia/cc90/eessi.io-2025.06-software/eessi_container.sh --verbose --access rw --mode run --container docker://ghcr.io/eessi/build-node:debian12 --repository eessi.io-2025.06-software --extra-bind-paths /p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/event_bea374b0-6f0c-11f1-92d2-7f511b57d0af/run_001/aarch64/nvidia/grace/nvidia/cc90/eessi.io-2025.06-software,/dev --pass-through --contain --save /p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/event_bea374b0-6f0c-11f1-92d2-7f511b57d0af/run_001/aarch64/nvidia/grace/nvidia/cc90/eessi.io-2025.06-software/previous_tmp/build_step --storage /local/scratch/eessibot/EESSI/eessi_job.l3MgdWYlh0 --host-injections /p/project1/ceasybuilders/eessibot/shared_fs_path/host-injections --nvidia all
                     -- /p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/event_bea374b0-6f0c-11f1-92d2-7f511b57d0af/run_001/aarch64/nvidia/grace/nvidia/cc90/eessi.io-2025.06-software/install_software_layer.sh "--build-logs-dir /p/project1/ceasybuilders/eessibot/build_logs_dir --shared-fs-path /p/project1/ceasybuilders/eessibot/shared_fs_path" "" 2>&1 | tee -a build.outerr.NLi0
...
singularity  exec --nv --contain --fusemount container:cvmfs2 cvmfs-config.cern.ch /cvmfs/cvmfs-config.cern.ch --fusemount container:cvmfs2 software.eessi.io /cvmfs_ro/software.eessi.io --fusemount container:unionfs -o cow /tmp/software.eessi.io/overlay-upper=RW:/cvmfs_ro/software.eess
i.io=RO /cvmfs/software.eessi.io /local/scratch/eessibot/EESSI/eessi_job.l3MgdWYlh0/eessi.df8oEcdzqY/ghcr.io_eessi_build_node_debian12.sif /p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/event_bea374b0-6f0c-11f1-92d2-7f511b57d0af/run_001/aarch64/nvidia/grace/nvidia/cc90/eessi.io
-2025.06-software/install_software_layer.sh --build-logs-dir /p/project1/ceasybuilders/eessibot/build_logs_dir --shared-fs-path /p/project1/ceasybuilders/eessibot/shared_fs_path

I think that looks pretty much as intended.

One thing I don't fully understand though is that you print 'Error: the compute capability...' to stdout and this should return in the bot returning a FAILURE for the build stage - yet it doesn't. Maybe because it's Error: and not ERROR:, I'm not sure how specific the pattern is that the bot is searching for. In any case: I don't think we should print this with echo_red as an error. Yes, having your script return a non-zero exit is fine. But to have such a string in the stdout feels misleading. It is not an error. The build procedure behaves exactly as intended.

I suggest using echo_yellow and print a warning that this build will proceed without making the GPU (of architecture ABC) available in the build container, and will thus effectively be a cross-compilation for target architecture XYZ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants