# z80 gurus - can this be further optimized?

Page 1/3
| 2 | 3

I'm trying to accomplish the following with as few T-states as possible:

```...
var1=table[idx++]
var1=var1-const1
var1=abs(var1)
var2=table[idx++]
var2=var2-const2
var2=abs(var2)
result=var1+var2
...
```

I've come up with the following Z80 code:

```...
ld a,(hl)
inc hl
sub d
jp p,noneg1
neg
noneg1:
ld c,a
ld a,(hl)
inc hl
sub e
jp p,noneg2
neg
noneg2:
...
```

Anyone see any further optimizations possible?

You could change the JP to JR, depending on how often you have to neg the result. JR takes 7 cycles if no jump is needed, and 12 cycles if the jump is needed. JP always takes 10 cycles.

I don't think the Z80 has a JR on the P condition, or?

Hmm, you're right. If the data consists of unsigned bytes, then you could use JR NC instead, but if the data are signed bytes then I guess this is pretty much the best you can do.

Yes, data is signed bytes. Well, if this is the best solution I'm satisfied, it means that my Z80-coding is up to par, so thanks! And just in case you're timing tight your code and need to count the exact number of t-states of every instruction, keep in mind that on MSX systems every opcode is penalized with 1 or 2 T-states (the so called M1 wait state). So, for example, the JR instruction uses 8 T-states or 13 T-states...

```...
ld a,(hl)
inc hl
sub d
jp p,noneg1
neg
noneg1:
ld c,a
ld a,(hl)
inc hl
sub e
jp p,noneg2
neg
noneg2:
...
```

I see two possible optimizations:

1) Use a lookup table to implement the abs function:
jp m / neg
You can do
ld l,a / ld a,(hl)
Register h must already be filled with the high byte of a 256-byte aligned lookup table.
The first approach takes 11 or 21 cycles (depending on whether jump is taken or not). The second approach always takes 13 cycles.

2) You can 'abuse' the stack to read data from a table and adjust an index. Though you'll have to store the data in reverse order compared to the original code.

This is the routine i came up with:

```        ld      (save_sp),sp
ld      sp,table_end
ld      h,abs_table / 256 ; must be 256 bytes aligned
ld      de,constants

pop     bc              ; 11
ld      a,b             ;  5
sub     d               ;  5
ld      l,a             ;  5
ld      b,(hl)          ;  8
ld      a,c             ;  5
sub     e               ;  5
ld      l,a             ;  5
ld      a,(hl)          ;  8
; ==
; 62

ld      sp,(save_sp)
```

If i counted correctly your routine took between 72 and 92 cycles. This routine always takes 62 cycles.

.

If an error of +/- 1 on the result is acceptable, an even faster version is possible:

```	ld	(save_sp),sp
ld	sp,table_end
ld	d,abs_table / 256
ld	bc,negative_constants

pop	hl		; 11
ld	e,h		;  5
ld	h,d		;  5
ld	a,(de)		;  8
; ==
; 49
```

The error comes from the 'add hl,bc' instruction, because there can be a carry from bit 7 to bit 8 (so register H can be one too high),

Sdw, how are you using this routine. wouter_'s suggestion works well if you execute the calculations many times in a row without any call/ret or other stack mods. I was looking at something similar that didn't use the stack but I made an error, hence my blank post.

Thanks wouter, that's some very nice tricks with the sp-abuse!

dvik:
Yes, I'm sorry, I should have explained the context of the routine better. To make it simple, I just cut out some parts in the middle and simplified before I posted, but to really be able to tell how far it can be optimized you would probably need the whole thing, since some registers are not free for use due to the overall setup, for example I run this in a loop, using dnjz, so using bc is probably out. Also, the stuff I called 'constants' are actually changed between loops, so it is a bit more complex.
Anyway here's the whole thing, just in case anyone is interested:

```routine:
ld b,32
C1:
ld d,0
C2:
ld e,0
loop:
ld a,(hl)
inc hl
sub e
jp p,noneg1
neg
noneg1:
ld  c,a
ld a,(hl)
inc hl
sub d
jp p,noneg2
neg
noneg2:
ld ixl,a
ld a,(hl)
inc hl
C3:
sub	0
jp p,noneg3
neg
noneg3:
ld c,a
ld a,(hl)
inc hl
sub	d
jp p,noneg4
neg
noneg4:
ld iyl,a
ld a,(ix+0)
or (iy+0)
out (0x98),a
inc	d
djnz	loop
ret
```

The usage then looks like this:

```ld hl,table1
ld ix,table2
ld iy,table3

ld a,XX
ld (C1+1),a
ld a,XX
ld (C2+1),a
ld a,XX
ld (C3+1),a
call routine
ld a,XX
ld (C1+1),a
ld a,XX
ld (C2+1),a
ld a,XX
ld (C3+1),a
call routine
ld a,XX
ld (C1+1),a
ld a,XX
ld (C2+1),a
ld a,XX
ld (C3+1),a
call routine
..
repeat a number of times
```

So it will probably be hard to use the SP-modifying approach, but it might be possible, I will give it some thought!

Page 1/3
| 2 | 3