Skip to content

AArch64: NEON omatcopy CT/RT kernels (s/d)#5843

Merged
martin-frbg merged 1 commit into
OpenMathLib:developfrom
artem-dmitriev:omatcopy
Jun 21, 2026
Merged

AArch64: NEON omatcopy CT/RT kernels (s/d)#5843
martin-frbg merged 1 commit into
OpenMathLib:developfrom
artem-dmitriev:omatcopy

Conversation

@artem-dmitriev

Copy link
Copy Markdown
Contributor

AArch64 has no vectorized transpose copy - all variants hit the scalar generic. Adds NEON ct/rt kernels for s/d (register transpose + stnp)
Passes the utest extension suite (1460/1460)
Bench on Neoverse-N1 (domatcopy, 1thread): the scalar path degrades with matrix size while the NEON kernel stays flat, giving roughly 1.2x at 2k up to ~4.5x at 18k. Single precision gap is larger.

@martin-frbg martin-frbg added this to the 0.3.34 milestone Jun 21, 2026
@martin-frbg

Copy link
Copy Markdown
Collaborator

Thank you

@martin-frbg martin-frbg merged commit f986fd3 into OpenMathLib:develop Jun 21, 2026
179 of 180 checks passed
@martin-frbg

martin-frbg commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator

(the addition probably needs copying to kernel/arm64/KERNEL to actually make it available on NEOVERSEN1 and others that don't include KERNEL.ARMV8 - I'll see to it tomorrow after further testing)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants