AArch64: NEON omatcopy CT/RT kernels (s/d) by artem-dmitriev · Pull Request #5843 · OpenMathLib/OpenBLAS

artem-dmitriev · 2026-06-20T20:12:55Z

AArch64 has no vectorized transpose copy - all variants hit the scalar generic. Adds NEON ct/rt kernels for s/d (register transpose + stnp)
Passes the utest extension suite (1460/1460)
Bench on Neoverse-N1 (domatcopy, 1thread): the scalar path degrades with matrix size while the NEON kernel stays flat, giving roughly 1.2x at 2k up to ~4.5x at 18k. Single precision gap is larger.

martin-frbg · 2026-06-21T21:09:39Z

Thank you

martin-frbg · 2026-06-21T21:16:10Z

(the addition probably needs copying to kernel/arm64/KERNEL to actually make it available on NEOVERSEN1 and others that don't include KERNEL.ARMV8 - I'll see to it tomorrow after further testing)

AArch64: NEON omatcopy CT/RT kernels

46aa158

martin-frbg added this to the 0.3.34 milestone Jun 21, 2026

martin-frbg merged commit f986fd3 into OpenMathLib:develop Jun 21, 2026
179 of 180 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AArch64: NEON omatcopy CT/RT kernels (s/d)#5843

AArch64: NEON omatcopy CT/RT kernels (s/d)#5843
martin-frbg merged 1 commit into
OpenMathLib:developfrom
artem-dmitriev:omatcopy

artem-dmitriev commented Jun 20, 2026

Uh oh!

martin-frbg commented Jun 21, 2026

Uh oh!

Uh oh!

martin-frbg commented Jun 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

artem-dmitriev commented Jun 20, 2026

Uh oh!

martin-frbg commented Jun 21, 2026

Uh oh!

Uh oh!

martin-frbg commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

martin-frbg commented Jun 21, 2026 •

edited

Loading