Z80Babel: C++, D, Rust, Zig and Fortran

Página 4/4
1 | 2 | 3 |

Por santiontanon

Paragon (1832)

imagem de santiontanon

25-02-2022, 18:50

Very interesting numbers, so the hand-coded asm version is 4x of what the C compiler can get, wow!

Por Ped7g

Expert (67)

imagem de Ped7g

25-02-2022, 19:33

hm, the 92ms vs 222ms is more relevant, because the C version can also use odd-only work buffer, etc... I can maybe try to backport my second asm to C, maybe even using global buffer variables to give compiler somewhat similar situation? To compare red apples vs red apples, and not red apples vs yellow apples... Wink Big smile

Por Ped7g

Expert (67)

imagem de Ped7g

26-02-2022, 01:04

so this is somewhat close to what my second asm code does, although some things are difficult to translate, and I have no idea what would be better for sdcc (I did just debug and verify it in x86_64 linux):

// globals defined like this;
// #define C4_WORK_SIZE 1024        // same as 2048 work_size for sieve_c_1
// uint16_t g_primes[4000];         // makes sure there's enough room for all results
// uint8_t g_work[C4_WORK_SIZE];    // C4_WORK_SIZE must be >= 2 (to be on par with assumptions used in asm version)

uint16_t sieve_c_4() {
    // using "work" buffer to mark only odd numbers, work[0] belongs to prime 5
    // (work[-1] is prime 3, work[1] is prime 7, work[2] is 9, ...)

    uint16_t* p = g_primes;
    *p++ = 2;                       // hard-code prime 2 into result

    for (uint16_t i = 0; i < C4_WORK_SIZE; ++i) g_work[i] = 0x00;   // reset "work" array

    uint8_t* w = g_work;
    uint8_t* const w_end = g_work + C4_WORK_SIZE;

    // phase 1, search for primes which can mark new fields in sieve, enter directly into marking (for prime 3)
    do {
        // w points at work for i+2 number, and i is prime, mark it in sieve
        uint16_t i = 2 * (w - g_work + 1) + 1;
        *p++ = i;

        uint8_t* mark = w - 1 + i;  // initial mark pointer, check also for 16b overflow on ZX (mark < w) case
        if (w_end <= mark || mark < w) break;   // already outside of work, stop first phase
        while (w < mark && mark < w_end) {      // mark g_work at multiplies of i
            *mark = 1;
            mark += i;
        }
        // search from "w" for next prime
        while (*w++) ;              // and move past the found w[i] == 0
    } while (1);                    // infinite loop breaks when marking would go outside of buffer

    // phase 2, search for remaining large primes, which don't need to mark sieve
    do {
        if (*w++) continue;
        uint16_t i = 2 * (w - g_work + 1) + 1;
        *p++ = i;
    } while (w < w_end);

    return p - g_primes;
}

uint16_t sieve_c_4_end() { return (uint16_t)&sieve_c_4; }

edit: maybe keeping track of actual prime number "i" instead of calculating it would make it easier for compilers, as they will probably run out of registers any way, and the calculation is just extra strain on them, while in asm it's worth it just because keeping track of prime number itself is more costly. Keeping it explicitly would look like this:

uint16_t sieve_c_5() {
    // using "work" buffer to mark only odd numbers, work[0] belongs to prime 5
    // (work[-1] is prime 3, work[1] is prime 7, work[2] is 9, ...)

    uint16_t* p = g_primes;
    *p++ = 2;                       // hard-code prime 2 into result

    for (uint16_t i = 0; i < C4_WORK_SIZE; ++i) g_work[i] = 0x00;   // reset "work" array

    uint8_t* w = g_work;
    uint8_t* const w_end = g_work + C4_WORK_SIZE;
    uint16_t prime = 3;

    // phase 1, search for primes which can mark new fields in sieve, enter directly into marking (for prime 3)
    do {
        // w points at work for i+2 number, and i is prime, mark it in sieve
        *p++ = prime;
        uint8_t* mark = w - 1 + prime;  // initial mark pointer, check also for 16b overflow on ZX (mark < w) case
        if (w_end <= mark || mark < w) break;   // already outside of work, stop first phase
        while (w < mark && mark < w_end) {      // mark g_work at multiplies of i
            *mark = 1;
            mark += prime;
        }
        // search from "w" for next prime
        while (prime += 2, *w++) ;                // and move past the found w[i] == 0
    } while (1);                    // infinite loop breaks when marking would go outside of buffer

    // phase 2, search for remaining large primes, which don't need to mark sieve
    do {
        if (prime += 2, *w++) continue;
        *p++ = prime;
    } while (w < w_end);

    return p - g_primes;
}

uint16_t sieve_c_5_end() { return (uint16_t)&sieve_c_5; }

the stuff which didn't translate well:
- mark in asm is done with high byte of "w" address, as the "work" buffer is not expected to start in $0000..$00FF address region (there's ROM in ZX Spectrum), so not needed explicit "1" in asm
- the overflow while advancing "w" during marking is easier to check in asm, I did add extra condition to source to make it check 16b pointer overflow, but I didn't test it in x64, so I just hope these work also when compiled for Z80.

Por Ped7g

Expert (67)

imagem de Ped7g

26-02-2022, 16:52

I have been looking a bit into hand-written asm quicksort, and it dawned on me how difficult the 16bit pointer math is with Z80, for example if you have two pointers for word array: Abegin, Aend, and you want to test if those pointers contain empty or one-word size only, but you don't want to do any assumption about the particular absolute address , it becomes such a pain... even just having array aligned to end of 64ki address space makes such construct fail (end pointer is 0x10000) and requires either 24/32b pointer math, or I don't know how current C compilers deal with it.

Or do you must write C code for Z80 in a way like the original test is done, with the A pointer + length in elements, because two pointer stuff would generate correct, but huge code?

@salutte:
does it make sense to produce asm variant with certain assumptions (the array is never too close to 64ki boundaries so going +-2 is safe), just for the sake of comparing C output to hand-asm-with-assumptions, or only "correct" asm variants make sense?

Por Ped7g

Expert (67)

imagem de Ped7g

26-02-2022, 18:43

Been thinking about it more, and it's impossible the Z80 C compilers deal with it by generating fully correct code treating also edge cases, that would bloat the resulting machine code to the point nobody would ever use them.

So I guess taking some shortcuts in asm is just as fine, and just documenting the constraints is then enough and it's still comparable.

Por salutte

Master (165)

imagem de salutte

26-02-2022, 19:24

Hi Ped7g!

I'm travelling this days and will have limited access to internet until tuesday. I will add and test all your codes asap! Thanks a lot!

And about the shortcuts, I fully agree, its perfectly acceptable.

Por Ped7g

Expert (67)

imagem de Ped7g

01-03-2022, 14:19

So, asm quicksort, didn't think about any constraints and just went after fastest asm I can do (except maybe tampering with stack, but considering the result, it would be difficult to gain anything by that trick) (so I don't think any compiler will ever produce this).
https://gist.github.com/ped7g/0c4e94796b474994ed88d0bdd1bf2f25

@salutte: no pressure, I have fun writing code, and I very appreciate you take your time to incorporate it and measure it under your conditions (with MSX machine I guess? I'm "home" at ZX, so we have slightly different instructions timings), so you are saving me from all the boring stuff. Thank you.

I will skip md5, already spent lot of time on the previous two, and md5 does exist in asm in InterNestor Lite: https://github.com/Konamiman/MSX/blob/master/SRC/INL1/INL-RE...

Por salutte

Master (165)

imagem de salutte

04-03-2022, 13:02

I updated z80_babel with better explanations of how to replicate the results (still very hard though), and I added the codes from Ped7g. I also tried to add the md5 from konamiman, but it uses macros extensively and I did not have the time to translate it in detail.

There are a lot of interesting take outs. For example, the carefully optimized C versions of the Sieve by Ped7G create smaller and faster binaries... if we use SDCC as a front and back end! Conversely, if we use llvm as a front end, which tends to provide better results for generic code, the optimized C versions end up being slower!

I tested approximately a googol combinations of languages, compiler options, and algorithm variants. In short, ASM is faster than compiled languages by a factor of 3x-6x and smaller by a factor of 2x. Also, there is no consistent language/ compiler /compilation option that outperforms the rest, with huge variations depending on the algorithm. But using llvm as a front end, sdcc as a back end, -Os as a compiler flag, and 50K allocations seems to provide adequate results overall.

I'll keep experimenting with MDL, but I do not plan to add more features to z80_babel, as this project has already fulfilled my needs. Next step would be, maybe, to work on a decent z80 backend for llvm, but this will be left for future work.

Página 4/4
1 | 2 | 3 |