Most optimized function to copy/store every 4th byte (SSE2)

Programming, for all ages and all languages.
Post Reply
amyipdev
Posts: 18
Joined: Tue Nov 09, 2021 11:40 am

Most optimized function to copy/store every 4th byte (SSE2)

Post by amyipdev »

Essentially, I'm looking for the most optimized way (within SSE2) to copy and re-store every fourth byte of a data stream.

If the data stream was:
ABCDABCDABCDABCD

Then the new four outputs would be:
AAAA
BBBB
CCCC
DDDD

Likewise, if I had four data streams:
EEEE
FFFF
GGGG
HHHH

They could output as:
EFGHEFGHEFGHEFGH
Octocontrabass
Member
Member
Posts: 5568
Joined: Mon Mar 25, 2013 7:01 pm

Re: Most optimized function to copy/store every 4th byte (SS

Post by Octocontrabass »

Why do you want to do this?

If it has anything to do with your other question about image processing, you shouldn't separate the color channels at all - process them in parallel instead.
amyipdev
Posts: 18
Joined: Tue Nov 09, 2021 11:40 am

Re: Most optimized function to copy/store every 4th byte (SS

Post by amyipdev »

Some of the algorithms I've found (like the Gaussian blur one) work on single color channels; I could modify them to work with offsets for each channel, but I can also see this being useful in the future for other algorithms.
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: Most optimized function to copy/store every 4th byte (SS

Post by Solar »

The most optimized way to copy is to avoid the copy, so... yes, context matters. 8)
Every good solution is obvious once you've found it.
Gigasoft
Member
Member
Posts: 856
Joined: Sat Nov 21, 2009 5:11 pm

Re: Most optimized function to copy/store every 4th byte (SS

Post by Gigasoft »

This is the best I could come up with.

Extracting:

Code: Select all

; Initial conditions: ecx = pixel count, esi = source, edx = channel 1, ebx = channel 2,
; edi = channel 3, ebp = channel 4
shr ecx, 2
sub ebx, edx
sub edi, edx
sub ebp, edx
sub edx, 4
mov eax, 0ffh
movd mm4, eax
punpckldq mm4, mm4
punpckldq xmm4, xmm4
extractloop:
add edx, 4
movdqa xmm0, [esi]
movdqa xmm1, xmm0
movdqa xmm2, xmm0
movdqa xmm3, xmm0
psrld xmm1, 8
psrld xmm2, 16
psrld xmm3, 24
pand xmm0, xmm4
pand xmm1, xmm4
pand xmm2, xmm4
pand xmm3, xmm4
packssdw xmm0, xmm0
packssdw xmm1, xmm1
packssdw xmm2, xmm2
packssdw xmm3, xmm3
packuswb xmm0, xmm0
packuswb xmm1, xmm1
packuswb xmm2, xmm2
packuswb xmm3, xmm3
add esi, 16
dec ecx
movd [edx], mm0
movd [edx+ebx], mm1
movd [edx+edi], mm2
movd [edx+ebp], mm3
jnz extractloop
Merging:

Code: Select all

; Initial conditions: ecx = pixel count, esi = destination, edx = channel 1, ebx = channel 2,
; edi = channel 3, ebp = channel 4
shr ecx, 2
sub esi, 16
sub ebx, edx
sub edi, edx
sub ebp, edx
mergeloop:
add esi, 16
movd mm0, [edx]
punpcklbw mm0, [edx+ebx]
movd mm1, [edx+edi]
punpcklbw mm1, [edx+ebp]
punpcklwd xmm0, xmm1
movdqa [esi], xmm0
add edx, 4
dec ecx
jnz mergeloop
amyipdev
Posts: 18
Joined: Tue Nov 09, 2021 11:40 am

Re: Most optimized function to copy/store every 4th byte (SS

Post by amyipdev »

Gigasoft wrote:This is the best I could come up with.
Thank you so much!
Post Reply