Most optimized function to copy/store every 4th byte (SSE2)

amyipdev · Post by **amyipdev** » Thu Dec 16, 2021 3:49 pm

Essentially, I'm looking for the most optimized way (within SSE2) to copy and re-store every fourth byte of a data stream.

If the data stream was:
ABCDABCDABCDABCD

Then the new four outputs would be:
AAAA
BBBB
CCCC
DDDD

Likewise, if I had four data streams:
EEEE
FFFF
GGGG
HHHH

They could output as:
EFGHEFGHEFGHEFGH

Octocontrabass · Post by **Octocontrabass** » Thu Dec 16, 2021 8:25 pm

Why do you want to do this?

If it has anything to do with your other question about image processing, you shouldn't separate the color channels at all - process them in parallel instead.

amyipdev · Post by **amyipdev** » Fri Dec 17, 2021 12:32 am

Some of the algorithms I've found (like the Gaussian blur one) work on single color channels; I could modify them to work with offsets for each channel, but I can also see this being useful in the future for other algorithms.

Solar · Post by **Solar** » Fri Dec 17, 2021 7:05 am

The most optimized way to copy is to avoid the copy, so... yes, context matters.

Gigasoft · Post by **Gigasoft** » Fri Dec 17, 2021 10:50 am

This is the best I could come up with.

Extracting:

Code: Select all

; Initial conditions: ecx = pixel count, esi = source, edx = channel 1, ebx = channel 2,
; edi = channel 3, ebp = channel 4
shr ecx, 2
sub ebx, edx
sub edi, edx
sub ebp, edx
sub edx, 4
mov eax, 0ffh
movd mm4, eax
punpckldq mm4, mm4
punpckldq xmm4, xmm4
extractloop:
add edx, 4
movdqa xmm0, [esi]
movdqa xmm1, xmm0
movdqa xmm2, xmm0
movdqa xmm3, xmm0
psrld xmm1, 8
psrld xmm2, 16
psrld xmm3, 24
pand xmm0, xmm4
pand xmm1, xmm4
pand xmm2, xmm4
pand xmm3, xmm4
packssdw xmm0, xmm0
packssdw xmm1, xmm1
packssdw xmm2, xmm2
packssdw xmm3, xmm3
packuswb xmm0, xmm0
packuswb xmm1, xmm1
packuswb xmm2, xmm2
packuswb xmm3, xmm3
add esi, 16
dec ecx
movd [edx], mm0
movd [edx+ebx], mm1
movd [edx+edi], mm2
movd [edx+ebp], mm3
jnz extractloop

Merging:

Code: Select all

; Initial conditions: ecx = pixel count, esi = destination, edx = channel 1, ebx = channel 2,
; edi = channel 3, ebp = channel 4
shr ecx, 2
sub esi, 16
sub ebx, edx
sub edi, edx
sub ebp, edx
mergeloop:
add esi, 16
movd mm0, [edx]
punpcklbw mm0, [edx+ebx]
movd mm1, [edx+edi]
punpcklbw mm1, [edx+ebp]
punpcklwd xmm0, xmm1
movdqa [esi], xmm0
add edx, 4
dec ecx
jnz mergeloop

amyipdev · Post by **amyipdev** » Fri Dec 17, 2021 12:35 pm

Gigasoft wrote:This is the best I could come up with.

Thank you so much!

OSDev.org

Most optimized function to copy/store every 4th byte (SSE2)

Most optimized function to copy/store every 4th byte (SSE2)

Re: Most optimized function to copy/store every 4th byte (SS

Re: Most optimized function to copy/store every 4th byte (SS

Re: Most optimized function to copy/store every 4th byte (SS

Re: Most optimized function to copy/store every 4th byte (SS

Re: Most optimized function to copy/store every 4th byte (SS