Check if the compute capability of the host GPU matches the requested one#258
Check if the compute capability of the host GPU matches the requested one#258bedroge wants to merge 16 commits into
Conversation
|
Can you provide a way to test this PR? E.g. add an easystack for which the build was failing without this feature, and then prove that it works with this feature? |
|
Support meeting: @bedroge : what if you're running a slurm job in which you request 1 GPU, but then in your job you see all? For now, it just checks the first one - that's probably the one you'd run on anyway. |
|
Could not reproduce the original issue with EESSI/software-layer#1524 (comment) Oh well, I guess we then just check if the bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc on:arch=aarch64/nvidia/grace for:arch=aarch64/nvidia/grace,accel=nvidia/cc70 |
|
New job on instance
|
|
New job on instance
|
casparvl
left a comment
There was a problem hiding this comment.
On the grace + CC70 run:
NVIDIA-SMI version : 595.71.05
NVML version : 595.71
DRIVER version : 595.71.05
CUDA Version : 13.2
Command 'nvidia-smi' found.
ESC[31mError: the compute capability of the GPU (90) does not match the requested compute capability (70).ESC[0m
bot/build.sh: EESSI_ACCELERATOR_TARGET_OVERRIDE='accel/nvidia/cc70'
Executing command to build software:
/p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/event_bea374b0-6f0c-11f1-92d2-7f511b57d0af/run_000/aarch64/nvidia/grace/nvidia/cc70/eessi.io-2025.06-software/eessi_container.sh --verbose --access rw --mode run --container docker://ghcr.io/eessi/build-node:debian12 --repository eessi.io-2025.06-software --extra-bind-paths /p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/event_bea374b0-6f0c-11f1-92d2-7f511b57d0af/run_000/aarch64/nvidia/grace/nvidia/cc70/eessi.io-2025.06-software,/dev --pass-through --contain --save /p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/event_bea374b0-6f0c-11f1-92d2-7f511b57d0af/run_000/aarch64/nvidia/grace/nvidia/cc70/eessi.io-2025.06-software/previous_tmp/build_step --storage /local/scratch/eessibot/EESSI/eessi_job.Ya4mxkDSZO --host-injections /p/project1/ceasybuilders/eessibot/shared_fs_path/host-injections --nvidia install
-- /p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/event_bea374b0-6f0c-11f1-92d2-7f511b57d0af/run_000/aarch64/nvidia/grace/nvidia/cc70/eessi.io-2025.06-software/install_software_layer.sh "--build-logs-dir /p/project1/ceasybuilders/eessibot/build_logs_dir --shared-fs-path /p/project1/ceasybuilders/eessibot/shared_fs_path" "" 2>&1 | tee -a build.outerr.5jTj
...
singularity exec --contain --fusemount container:cvmfs2 cvmfs-config.cern.ch /cvmfs/cvmfs-config.cern.ch --fusemount container:cvmfs2 software.eessi.io /cvmfs_ro/software.eessi.io --fusemount container:unionfs -o cow /tmp/software.eessi.io/overlay-upper=RW:/cvmfs_ro/software.eessi.io=
RO /cvmfs/software.eessi.io /local/scratch/eessibot/EESSI/eessi_job.Ya4mxkDSZO/eessi.gzS0v4LKEe/ghcr.io_eessi_build_node_debian12.sif /p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/event_bea374b0-6f0c-11f1-92d2-7f511b57d0af/run_000/aarch64/nvidia/grace/nvidia/cc70/eessi.io-2025
.06-software/install_software_layer.sh --build-logs-dir /p/project1/ceasybuilders/eessibot/build_logs_dir --shared-fs-path /p/project1/ceasybuilders/eessibot/shared_fs_path
On the Grace + CC90 run:
bot/build.sh: EESSI_VERSION_OVERRIDE: 2025.06
NVIDIA-SMI version : 595.71.05
NVML version : 595.71
DRIVER version : 595.71.05
CUDA Version : 13.2
Command 'nvidia-smi' found.
ESC[32mRequested compute capability matches the one from the GPU.ESC[0m
bot/build.sh: EESSI_ACCELERATOR_TARGET_OVERRIDE='accel/nvidia/cc90'
Executing command to build software:
/p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/event_bea374b0-6f0c-11f1-92d2-7f511b57d0af/run_001/aarch64/nvidia/grace/nvidia/cc90/eessi.io-2025.06-software/eessi_container.sh --verbose --access rw --mode run --container docker://ghcr.io/eessi/build-node:debian12 --repository eessi.io-2025.06-software --extra-bind-paths /p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/event_bea374b0-6f0c-11f1-92d2-7f511b57d0af/run_001/aarch64/nvidia/grace/nvidia/cc90/eessi.io-2025.06-software,/dev --pass-through --contain --save /p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/event_bea374b0-6f0c-11f1-92d2-7f511b57d0af/run_001/aarch64/nvidia/grace/nvidia/cc90/eessi.io-2025.06-software/previous_tmp/build_step --storage /local/scratch/eessibot/EESSI/eessi_job.l3MgdWYlh0 --host-injections /p/project1/ceasybuilders/eessibot/shared_fs_path/host-injections --nvidia all
-- /p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/event_bea374b0-6f0c-11f1-92d2-7f511b57d0af/run_001/aarch64/nvidia/grace/nvidia/cc90/eessi.io-2025.06-software/install_software_layer.sh "--build-logs-dir /p/project1/ceasybuilders/eessibot/build_logs_dir --shared-fs-path /p/project1/ceasybuilders/eessibot/shared_fs_path" "" 2>&1 | tee -a build.outerr.NLi0
...
singularity exec --nv --contain --fusemount container:cvmfs2 cvmfs-config.cern.ch /cvmfs/cvmfs-config.cern.ch --fusemount container:cvmfs2 software.eessi.io /cvmfs_ro/software.eessi.io --fusemount container:unionfs -o cow /tmp/software.eessi.io/overlay-upper=RW:/cvmfs_ro/software.eess
i.io=RO /cvmfs/software.eessi.io /local/scratch/eessibot/EESSI/eessi_job.l3MgdWYlh0/eessi.df8oEcdzqY/ghcr.io_eessi_build_node_debian12.sif /p/project1/ceasybuilders/eessibot/jobs/2026.06/pr_258/event_bea374b0-6f0c-11f1-92d2-7f511b57d0af/run_001/aarch64/nvidia/grace/nvidia/cc90/eessi.io
-2025.06-software/install_software_layer.sh --build-logs-dir /p/project1/ceasybuilders/eessibot/build_logs_dir --shared-fs-path /p/project1/ceasybuilders/eessibot/shared_fs_path
I think that looks pretty much as intended.
One thing I don't fully understand though is that you print 'Error: the compute capability...' to stdout and this should return in the bot returning a FAILURE for the build stage - yet it doesn't. Maybe because it's Error: and not ERROR:, I'm not sure how specific the pattern is that the bot is searching for. In any case: I don't think we should print this with echo_red as an error. Yes, having your script return a non-zero exit is fine. But to have such a string in the stdout feels misleading. It is not an error. The build procedure behaves exactly as intended.
I suggest using echo_yellow and print a warning that this build will proceed without making the GPU (of architecture ABC) available in the build container, and will thus effectively be a cross-compilation for target architecture XYZ.
Solves https://gitlab.com/eessi/support/-/work_items/257.
While I was working on this, I also found that there was a small issue with the way how multiple accelerator builds are done: each next one appends a
--resumeflag, as they keep adding that toBUILD_STEP_ARGS. That's solved now by resetting that variable for every iteration, and inside the loop the accelerator-specif flags are then added.