Assembly Z80, best way to divide by 16

Page 1/2
| 2

By albs_br

Champion (456)

albs_br's picture

18-09-2022, 22:07

Hi guys, I need to improve this code, so far I got this result (using standard Z80 times, not M1 times).

Tem goal is to divide by 16 the value on H register (ideally not destroying HL, nor DE registers):

    ; ; divide by 16 (32 cycles)
    ; srl     h
    ; srl     h
    ; srl     h
    ; srl     h

    ; ; divide by 16 (27 cycles)
    ; ld      a, h
    ; and     1111 0000 b
    ; rrca
    ; rrca
    ; rrca
    ; rrca

    ; divide by 16 (18 cycles)
    ld      b, base_addr_div_by_16_LUT
    ld      c, h
    ld      a, (bc)

The third one needs a 256-bytes Look Up Table with all values pre-loaded (it also needs to be address aligned to 0x00 on low byte).

Speed is more crucial here than size.

Is there a best solution?

Login or register to post comments

By theNestruo

Champion (387)

theNestruo's picture

18-09-2022, 23:01

If speed is more crucial here than size, go for the look up table. If you do more than one division, you can even "reuse" b.

Btw, as we are in an MSX forum, I think you are not taking into account M1 wait states in your measures. They should read: 40 cycles, 33 cycles, and 21 cycles.

By albs_br

Champion (456)

albs_br's picture

19-09-2022, 00:03

theNestruo wrote:

Btw, as we are in an MSX forum, I think you are not taking into account M1 wait states in your measures. They should read: 40 cycles, 33 cycles, and 21 cycles.

The VS Code extension that I use shows standard Z80 timing, is there one that shows MSX times?

By theNestruo

Champion (387)

theNestruo's picture

19-09-2022, 08:55

albs_br wrote:

The VS Code extension that I use shows standard Z80 timing, is there one that shows MSX times?

If you are using Z80 Assembly meter, there is a z80-asm-meter.platform setting. Set it to msx.

Quote:
  • z80-asm-meter.platform: Controls the instruction set to use and the timing information to display:
    • z80 (default): Uses the default Z80 instruction set and shows default timing information.
    • msx: For MSX developers. Uses the default Z80 instruction set and shows Z80+M1 timing information (MSX standard).
    • (...)

By gdx

Enlighted (5848)

gdx's picture

19-09-2022, 09:03

Another method:

	ld	hl,0D000h
	ld	(hl),Value	; Put the value to divide by 16 at 0D000h
	xor	a
	rld			; A = the value divided by 16

The division takes 22 cycles (+ 3 for M1).

By Metalion

Paragon (1596)

Metalion's picture

19-09-2022, 16:55

gdx wrote:

Another method:

	ld	hl,0D000h
	ld	(hl),Value	; Put the value to divide by 16 at 0D000h
	xor	a
	rld			; A = the value divided by 16

The division takes 22 cycles (+ 3 for M1).

Interesting, but I count 44 cycles, and 47 if the value is not a register.

By theNestruo

Champion (387)

theNestruo's picture

19-09-2022, 17:26

rld (and rrd) was my first idea, but they take 20 cycles (including M1), so any previous setup will make that solution slower than the LUT solution above. Particularly if you need to preserve the HL pair.

By gdx

Enlighted (5848)

gdx's picture

20-09-2022, 11:01

I gave the method with RLD because it is possible to do a series of division without having to specify the value of HL nor XOR A each time. Once is enough. Then the value to divide can be specified by register B, C, D or E (eg LD (HL),B). It is even possible to use IX instead of HL. It's slower but it can avoid manipulations that ultimately save time in some cases.

Also the method below is almost same as the one with rrca when the M1 cycle is take in account.

     ld     a,h
     srl     a
     srl     a
     srl     a
     srl     a

By santiontanon

Paragon (1698)

santiontanon's picture

20-09-2022, 16:24

I just asked MDL to produce the optimal sequence for this (without using memory), and it came up with these two alternatives. The second is the same as @gdx proposed, but the first was curious (both of them with the same timing).

xor a
xor h
rra
sra a
sra a
sra a
ld a, h
srl a
srl a
srl a
srl a

It comes up with a few other alternatives with the same time, but basically variations Smile

By wouter_

Champion (492)

wouter_'s picture

20-09-2022, 17:13

santiontanon wrote:

I just asked MDL to produce the optimal sequence for this (without using memory) ...

Did you maybe also exclude some immediate values from the search-space(*)? I'm asking because albs_br's original solution (ld a,h ; and 0xf0 ; 4x rrca) is faster than these "optimal" solutions.

(*) It's very typical for super-optimizers to only allow a limited number of immediate values. Otherwise the search-space explodes.

By Ped7g

Resident (61)

Ped7g's picture

20-09-2022, 22:07

I have been trying to port upkr unpack to Z80 recently, and part of that decompression algorithm is also `((prob+8)>>4)` expression, my initial version was

and $F8
rra
rra
rra
rra
adc a,something ; something is something else to be added anyway

final version is a bit different because I have use also for carry coming as input (making it +0/+16 depending on the initial carry), so it's:

    rra                             ; + (bit<<4) ; part of -prob_offset, needs another -16
    and     $FC                     ; clear/keep correct bits to get desired (prob>>4) + extras, CF=0
    rra
    rra
    rra                             ; A = (bit<<4) + (prob>>4), CF=(prob & 8)
    adc     a,-16                   ; A = (bit<<4) - 16 + ((prob + 8)>>4) ; -prob_offset = (bit<<4) - 16

https://github.com/ped7g/upkr/blob/z80_ped7g/z80_unpacker/un...

(the example snapshot is for ZX Spectrum, but technically the code should work also with MSX, but you need to build the packer from the Rust source to test it with own data, I did ask exoticorn (author of upkr) to provide me with few ZX screens packed to avoid dealing with that part, as I was interested only to write the Z80 code and have fun with that. :) )

santiontanon wrote:

...

BTW, if you are bored, would you try MDL on the unpack.asm? Maybe I did overlook some further optimisation.

Page 1/2
| 2