

Though multithreading on some systems can alleviate this without AVX512 or even AVX2. With AVX2-only triple/quad channel systems with all slots busy are not expected to be faster because to load them fully you need to load/store more than 32 bytes at once (48 bytes for triple- and 64-bytes for quad-channel systems), while AVX2 can load/store no more than 32 bytes at once. For triple-(quad-)channel memory systems, you can get further 1.5(2.0) times faster memory copying if the code is extended to analogous AVX512 code. you have all 4 DDR4 slots busy, you may get further 2 times faster memory copying. If you fill both memory channels with 2 DDR4 modules, i.e. On Ryzen 1800X with single memory channel filled completely (2 slots, 16 GB DDR4 in each), the following code is 1.56 times faster than memcpy() on MSVC++2017 compiler. Though something similar may apply for ARM/AArch64 with SIMD.

This is an answer for x86_64 with AVX2 instruction set present.
