z80 gurus - can this be further optimized?

Page 1/3
| 2 | 3

By Sdw

Resident (50)

Sdw's picture

16-10-2008, 13:48

I'm trying to accomplish the following with as few T-states as possible:

...
var1=table[idx++]
var1=var1-const1
var1=abs(var1)
var2=table[idx++]
var2=var2-const2
var2=abs(var2)
result=var1+var2
...

I've come up with the following Z80 code:

...
		ld a,(hl)		
		inc hl
		sub d
		jp p,noneg1
		neg	
noneg1:		
		ld c,a
		ld a,(hl)		
		inc hl
		sub e
		jp p,noneg2
		neg			
noneg2:
		add a,c
...

Anyone see any further optimizations possible?

Login or register to post comments

By Arjan

Paladin (714)

Arjan's picture

16-10-2008, 13:58

You could change the JP to JR, depending on how often you have to neg the result. JR takes 7 cycles if no jump is needed, and 12 cycles if the jump is needed. JP always takes 10 cycles.

By Sdw

Resident (50)

Sdw's picture

16-10-2008, 14:07

I don't think the Z80 has a JR on the P condition, or?

By Arjan

Paladin (714)

Arjan's picture

16-10-2008, 14:33

Hmm, you're right. If the data consists of unsigned bytes, then you could use JR NC instead, but if the data are signed bytes then I guess this is pretty much the best you can do.

By Sdw

Resident (50)

Sdw's picture

16-10-2008, 14:52

Yes, data is signed bytes. Well, if this is the best solution I'm satisfied, it means that my Z80-coding is up to par, so thanks! Smile

By jltursan

Prophet (2190)

jltursan's picture

16-10-2008, 23:15

And just in case you're timing tight your code and need to count the exact number of t-states of every instruction, keep in mind that on MSX systems every opcode is penalized with 1 or 2 T-states (the so called M1 wait state). So, for example, the JR instruction uses 8 T-states or 13 T-states...

By wouter_

Champion (418)

wouter_'s picture

17-10-2008, 23:37

...
		ld a,(hl)		
		inc hl
		sub d
		jp p,noneg1
		neg	
noneg1:		
		ld c,a
		ld a,(hl)		
		inc hl
		sub e
		jp p,noneg2
		neg			
noneg2:
		add a,c
...

I see two possible optimizations:

1) Use a lookup table to implement the abs function:
Instead of
jp m / neg
You can do
ld l,a / ld a,(hl)
Register h must already be filled with the high byte of a 256-byte aligned lookup table.
The first approach takes 11 or 21 cycles (depending on whether jump is taken or not). The second approach always takes 13 cycles.

2) You can 'abuse' the stack to read data from a table and adjust an index. Though you'll have to store the data in reverse order compared to the original code.

This is the routine i came up with:

        ld      (save_sp),sp
        ld      sp,table_end
        ld      h,abs_table / 256 ; must be 256 bytes aligned
        ld      de,constants


        pop     bc              ; 11
        ld      a,b             ;  5
        sub     d               ;  5
        ld      l,a             ;  5
        ld      b,(hl)          ;  8
        ld      a,c             ;  5
        sub     e               ;  5
        ld      l,a             ;  5
        ld      a,(hl)          ;  8
        add     a,b             ;  5
                                ; ==
                                ; 62

        ld      sp,(save_sp)

If i counted correctly your routine took between 72 and 92 cycles. This routine always takes 62 cycles.

By dvik

Prophet (2200)

dvik's picture

18-10-2008, 04:22

.

By wouter_

Champion (418)

wouter_'s picture

18-10-2008, 08:53

If an error of +/- 1 on the result is acceptable, an even faster version is possible:

	ld	(save_sp),sp
	ld	sp,table_end
	ld	d,abs_table / 256
	ld	bc,negative_constants

	pop	hl		; 11
	add	hl,bc		; 12   **
	ld	e,h		;  5
	ld	h,d		;  5
	ld	a,(de)		;  8
	add	a,(hl)		;  8
				; ==
				; 49

The error comes from the 'add hl,bc' instruction, because there can be a carry from bit 7 to bit 8 (so register H can be one too high),

By dvik

Prophet (2200)

dvik's picture

18-10-2008, 18:35

Sdw, how are you using this routine. wouter_'s suggestion works well if you execute the calculations many times in a row without any call/ret or other stack mods. I was looking at something similar that didn't use the stack but I made an error, hence my blank post.

By Sdw

Resident (50)

Sdw's picture

19-10-2008, 00:16

Thanks wouter, that's some very nice tricks with the sp-abuse!

dvik:
Yes, I'm sorry, I should have explained the context of the routine better. To make it simple, I just cut out some parts in the middle and simplified before I posted, but to really be able to tell how far it can be optimized you would probably need the whole thing, since some registers are not free for use due to the overall setup, for example I run this in a loop, using dnjz, so using bc is probably out. Also, the stuff I called 'constants' are actually changed between loops, so it is a bit more complex.
Anyway here's the whole thing, just in case anyone is interested:


routine:
		ld b,32
C1:
		ld d,0		
C2:		
		ld e,0
loop:		
		ld a,(hl)		
		inc hl
		sub e
		jp p,noneg1	
		neg	
noneg1:		
		ld  c,a
		ld a,(hl)		
		inc hl
		sub d
		jp p,noneg2
		neg			
noneg2:
		add	a,c
		ld ixl,a
		ld a,(hl)		
		inc hl		
C3:		
		sub	0
		jp p,noneg3
		neg	
noneg3:
		ld c,a		
		ld a,(hl)		
		inc hl
		sub	d
		jp p,noneg4
		neg	
noneg4:		
		add	a,c
		ld iyl,a		
		ld a,(ix+0)
		or (iy+0)				
		out (0x98),a				
		inc	d		
		djnz	loop
		ret

The usage then looks like this:

ld hl,table1
ld ix,table2
ld iy,table3

(set up VRAM output adr)
ld a,XX
ld (C1+1),a
ld a,XX
ld (C2+1),a
ld a,XX
ld (C3+1),a
call routine
(set up VRAM output adr)
ld a,XX
ld (C1+1),a
ld a,XX
ld (C2+1),a
ld a,XX
ld (C3+1),a
call routine
(set up VRAM output adr)
ld a,XX
ld (C1+1),a
ld a,XX
ld (C2+1),a
ld a,XX
ld (C3+1),a
call routine
.. 
repeat a number of times

So it will probably be hard to use the SP-modifying approach, but it might be possible, I will give it some thought!

Page 1/3
| 2 | 3