{2025.06}[2024a] PyTorch 2.9.1#1389
Conversation
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/amd/zen4 |
|
New job on instance
|
|
New job on instance
|
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/amd/zen4 |
|
New job on instance
|
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/amd/zen4 |
|
New job on instance
|
Errors are quite similar to the ones observed in #1314, many of these: |
|
Updated hooks file with a fix for PyTorch has been ingested (EESSI/software-layer-scripts#172), let's try again. bot: build repo:eessi.io-2025.06-software instance:eessi-bot-aws-eu-south for:arch=x86_64/amd/zen5 |
|
New job on instance
|
|
New job on instance
|
|
New job on instance
|
|
The neoverse v1 build ran out of memory: |
|
I've modified bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/neoverse_v1 |
|
New job on instance
|
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace |
|
New job on instance
|
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/generic |
|
New job on instance
|
|
New job on instance
|
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/neoverse_v1 |
|
New job on instance
|
|
No more memory issues for the neoverse v1 build, but too many failing tests: |
That's the major blocker. Anything in the logs?
That's the one with the most failures (so far). Do you have a built without ACL to compare? Can you attach the log to check if it is possibly a single issue over many tests? |
Here's the log of the grace build. I'll have to check if I can still access the detailed log file of a successful build without ACL. |
|
Found it, here's the log of the successful build without ACL (#1389 (comment)). |
|
Ok, thanks. I found the origin of 54 new failures which is a single test with variations. Added a patch in the PR
I'm now just skipping the first 2 on AARCH64 as they don't seem to get tested upstream either. |
How would I do that exactly? I tried pip installing that version, extracting the source tarball, and running |
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace |
|
New job on instance
|
Exactly like that. The error is a conflict with our hypothesis version or so. I guess you can just apply |
|
Thanks, that solved the issue. Now I get this: |
|
Ok, then I consider this a known issue and skip the test. PR updated |
|
Thanks! Let me try again: bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace |
|
Oops, forgot to update the commit, but looks like the bot is down anyway. Will retry later. |
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace |
|
New job on instance
|
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx |
|
New job on instance
|
|
41 test failures now for a64fx: |
I would call that a spectacular success... That's literally 99.98% of tests passing, on a platform that we know is notoriously difficult to build stuff on. |
|
I've updated the hook in EESSI/software-layer-scripts#181 to allow for 41 failing tests. |
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/neoverse_v1 |
|
New job on instance
|
@bedroge Can you attach the log, or at least the names of those failures? But yeah, we can consider this a success |
Sure, here it is. |
|
There are a couple weird failures. The 20 And for
That's an internal assert which I guess should never fail. If those failures do not occur without ACL we may consider removing it again. Or at least test the PyPI version to check if they fail too to reason on how serious those might be. |
No description provided.