Fast copy using unrolled ldi loop

Página 1/2
| 2

Por aoineko

Paladin (693)

imagem de aoineko

06-07-2022, 01:17

Hello,

Reading MAP's article about fast copies using unrolled ldi loop and the provided example with code auto-modification, I decided to make a version that also works in a ROM for the MSXgl C library.

Here is the function:

void Mem_FastCopy(const void* src, void* dest, u16 size)
{
    src;    // HL
    dest;   // DE
    size;   // SP+2
    __asm
        // Get parameters
        ld      iy, #2
        add     iy, sp
        ld      c, 0(iy)
        ld      b, 1(iy)
        // Handle size that is not multiples of 16
        ld      a, c                    //  5 cc
        and     #0x0F                   //  8 cc
        jp      z, mem_fastcopy_loop    // 11 cc - total 24 cc (break-even at 6 loops)
        neg                             // 10 cc
        add     #16                     //  8 cc
        add     a                       //  5 cc
        exx                             //  5 cc
        ld      b, #0                   //  8 cc
        ld      c, a                    //  5 cc
        ld      hl, #mem_fastcopy_loop  // 11 cc
        add     hl, bc                  // 12 cc
        push    hl                      // 12 cc
        exx                             //  5 cc
        pop     iy                      // 16 cc
        jp      (iy)                    // 10 cc - total 131 cc (break-even at 31 loops)
        // Fast LDIR (with 16x unrolled LDI)
    mem_fastcopy_loop:
        .rept 16
        ldi                             // 18 cc
        .endm
        jp      pe, mem_fastcopy_loop   // 11 cc (0,6875 cc per ldi)
    __endasm;
}

This function converges to a speed gain of 18,75% compare to classic ldir loop (we gain 4.31 cc per loop).
The break-even is reached at 6 loops for multiple of 16 size or at 31 loops otherwise.

I'm not very good in assembler, so it's likely that we can do even faster (at least for the initialization part).
So I'm interested in your proposals.

Of course, one solution would be to say that this function only copies sizes that are multiples of 16, but that's not the point here (even if it's a good point ^^).

Entrar ou registrar-se para comentar

Por Grauw

Ascended (10633)

imagem de Grauw

06-07-2022, 01:48

Instead of pop iy, jp iy you can do ret.

Instead of ld a,c, and 15, neg, jp z, add a,16 you can do xor a, sub c, and 15, jr z.

Lastly add a,mem_fastcopy_loop & 0ffh, ld l,a, ld a,0, adc a,mem_fastcopy_loop >> 8, ld h,a is minimally faster than the 16-bit addition.

Por aoineko

Paladin (693)

imagem de aoineko

06-07-2022, 02:02

Grauw wrote:

Instead of pop iy, jp iy you can do ret.

I don't understand this one. :-/

EDIT: Oh... ret is "ld PC, (SP); SP += 2"... understood. Smile

Por gdx

Enlighted (5822)

imagem de gdx

06-07-2022, 02:26

aoineko wrote:

This function converges to a speed gain of 18,75% compare to classic ldir loop (we gain 4.31 cc per loop).

The gain must depend on the size of the block to be copied, isn't it?

Por aoineko

Paladin (693)

imagem de aoineko

06-07-2022, 02:45

Yes. The larger the size, the closer this gain is to this value.

@Grauw: It's working just fine. Thank you!
The new version is 101 cc long instead of 131 cc.
The break-even for multiple of 16 size is reached at 7 loops (instead of 6) but the otherwise break-even is at 24 loops (instead of 31).

Por Metalion

Paragon (1587)

imagem de Metalion

06-07-2022, 08:24

The problem with variable unrolling is that you never know in advance if you gain or lose by using the subroutine. Unless you put a test on the length, and that adds again some overhead that, in turn, increase the break-even.

I find unrolling to be the best solution when you know the length you're going to need.
But not when you don't.

Por aoineko

Paladin (693)

imagem de aoineko

06-07-2022, 09:00

You know exactly when the algorithm become faster than otir based one.
In the last version:
size >= 7 (for multiple of 16)
size >= 24 (otherwise)
It's not dynamic but if the doc is clear enough, it should be a programmer's choice to use this one or the default one.

But yes, I don't see it as a generic algorithm I may use in any case. In MSXgl, I added it as optional function for fast copy of large data.

Por Metalion

Paragon (1587)

imagem de Metalion

06-07-2022, 11:19

In your calculation of the break-even, you forgot the //Get Parameters part.
It adds 75 t-states to the total.

Por Bengalack

Hero (660)

imagem de Bengalack

06-07-2022, 12:34

Metalion wrote:

In your calculation of the break-even, you forgot the //Get Parameters part.
It adds 75 t-states to the total.

To fix this, you add this keyword to the signature: __z88dk_callee

and then you move the pop iy to the "parameters" location, and add pop bc. By doing this, you need to keep the jp (iy), and not replace by ret as Grauw suggested.

You remove all of these (expensive) lines:

        ld      iy, #2
        add     iy, sp
        ld      c, 0(iy)
        ld      b, 1(iy)

This cuts the 75 cycles down to 11.

And, yes, you need jp (iy) as the final instruction in the code as well, as ret will not work in this case. Which in turn might make you want to declare the function as __naked. Note that using jp (iy) instead of ret will sometimes confuse Grauw's profiler-script Smile

Por aoineko

Paladin (693)

imagem de aoineko

06-07-2022, 13:23

@Bengalack Thank you. I'll study the __z88dk_callee directive. Smile

Metalion wrote:

In your calculation of the break-even, you forgot the //Get Parameters part.
It adds 75 t-states to the total.

This is also use for my "normal" ldir version so I only counted the extra code between the 2 versions.

Por Grauw

Ascended (10633)

imagem de Grauw

06-07-2022, 13:57

Surely the ret to jump indirectly will still work? You push the jump address right before the ret.

Página 1/2
| 2