Code optimization for speed

Page 1/2
| 2

By Roukan

Supporter (8)

Roukan's picture

20-10-2012, 16:05

Hi All

I'm still a relative z80 novice, and was wondering if any of the gurus here could cast their beady eye over this code snippet and suggest any optimization for max speed.

The code takes 8 bytes (for example a character definition ) and extracts the 8 bits for each of the 8 bytes and stores them in a 64 byte buffer as a 0 or 1 for later use.

Many thanks in advance for your time and effort.

Regards

Roukan

Magnify:	ld b,8	        ; Number of bytes to process
		ld de,MagBuff	; 64 byte buffer
		ld hl,CharBuff	; 8 byte buffer

.ByteLoop:	ld c,(hl)	; Get a byte
		push bc		; Save byte counter
		ld b,8		; Number of bits to process

.BitLoop:	xor a		; Zero accumulator
		rlc c		; Move bit into carry flag
		adc a,0		; Add carry to accumulator
		ld (de),a	; Store result in buffer
		inc de	        ; Point to next byte in buffer
		djnz .BitLoop	; Loop if <8 bits processed
		
		inc hl		; Point to next byte in buffer
		pop bc		; Recall byte counter
		djnz .ByteLoop	; Loop if <8 bytes processed
			
		ret

CharBuff:	DS 8
MagBuff:	DS 64
Login or register to post comments

By Roukan

Supporter (8)

Roukan's picture

20-10-2012, 16:48

One thing I came up with after some thought was the: "ADC A,0" could be replaced with "RLA".

I believe this saves 3 cycles (x 64) and 1 byte overall.

By ARTRAG

Enlighted (6923)

ARTRAG's picture

20-10-2012, 17:17

I assume zero in the buffer and that the char definition can be corrupted

Magnify:	
                ld a,8	        ; Number of bytes to process
		ld hl,MagBuff	; 64 byte buffer
		ld de,CharBuff	; 8 byte buffer

.ByteLoop:	ex af,af'
                ld a,(de)	; Get a byte
                ld c,a
		ld b,8		; Number of bits to process
	        
.BitLoop:	
		rlc c		; Move bit into carry flag
		rl (hl)		; Add carry to accumulator
		inc hl	        ; Point to next byte in buffer
		djnz .BitLoop	; Loop if <8 bits processed
		
		inc de		; Point to next byte in buffer
                ex af,af'
		dec a		; Recall byte counter
		jr nz .ByteLoop	; Loop if <8 bytes processed
			
		ret

CharBuff:	DS 8
MagBuff:	DS 64

Warning: untested code

By GhostwriterP

Paladin (683)

GhostwriterP's picture

20-10-2012, 17:12

ld c,(de) Question does that work?

By ARTRAG

Enlighted (6923)

ARTRAG's picture

20-10-2012, 17:16

very untested Tongue
I patched the code, thanks ghost

By wouter_

Champion (508)

wouter_'s picture

20-10-2012, 18:30

ARTRAG wrote:

I assume zero in the buffer and ...

That's quite a big assumption. Also rl (hl) is a slow instruction (17 cycles), so if I counted correctly the inner loop in ARTAG's code is just 1 cycle faster compared to Roukan's improved version. (And if you first need to zero the whole buffer, then it will be slower in total).

Here's my attempt:

First a small question would it be OK to write 0/255 (instead of 0/1) to the output buffer? In some cases 0/255 is also better than 0/1, because it allows faster (branch-less) code when you're actually using the values in the output buffer. E.g. to select between two values (e.g. two colors) depending on the value in the buffer, you can now do ld a,(hl) ; and b ; add c (with register C containing one color value and register B containing the difference between the two color values).

If 0/255 is allowed you can in Roukan's routine replace xor a ; rlc c ; rla with rlc c ; sbc a,a. This already saves 5 cycles in the inner loop.

You can save another 4 cycles per iteration by abusing the stack pointer (actually you save 8 cycles per 2 iterations). I also unrolled the inner loop 4x, that eliminates the need for the inner djnz instruction (saves approx 14 cycles per iteration). Here's my final routine:

Magnify:	di
		ld	(save_sp),sp
	
		ld	hl,CharBuff+8
		ld	sp,MagBuff+64
		ld	b,8
	
ByteLoop:	dec	hl
		ld	c,(hl)
	
		rrc	c
		sbc	a,a
		ld	d,a
		rrc	c
		sbc	a,a
		ld	e,a
		push	de
	
		rrc	c
		sbc	a,a
		ld	d,a
		rrc	c
		sbc	a,a
		ld	e,a
		push	de
	
		rrc	c
		sbc	a,a
		ld	d,a
		rrc	c
		sbc	a,a
		ld	e,a
		push	de
	
		rrc	c
		sbc	a,a
		ld	d,a
		rrc	c
		sbc	a,a
		ld	e,a
		push	de
	
		djnz	ByteLoop
	
		ld	sp,(save_sp)
		ei
		ret

save_sp:	dw	0
CharBuff:	ds	8
MagBuff:	ds	64

BTW i also have a version that writes 0/1 (not 0/255) in the output buffer, it's 1 byte longer and 3 cycles slower (3 cycles in total, not per output byte). But I like this version better.

By Roukan

Supporter (8)

Roukan's picture

20-10-2012, 19:49

Quote:

ARTRAG: I assume zero in the buffer and that the char definition can be corrupted

Thanks for the reply ARTRAG.

Whilst I was sat eating my tea, I thought I should of mentioned these two points in my original post.

The char definition cannot be corrupted and I had no plan on the buffer being initialised, and it will be constantly changed by numerous calls to this subroutine.

By Roukan

Supporter (8)

Roukan's picture

20-10-2012, 19:54

Quote:

Wouter: First a small question would it be OK to write 0/255 (instead of 0/1) to the output buffer?

Thanks also for your reply Wouter.

I don't see a problem with 0/255 - It wouldn't really affect the way the buffer will be utilised - so that would be fine.

By Roukan

Supporter (8)

Roukan's picture

20-10-2012, 20:27

wouter_ wrote:

Here's my attempt:

First a small question would it be OK to write 0/255 (instead of 0/1) to the output buffer? In some cases 0/255 is also better than 0/1, because it allows faster (branch-less) code when you're actually using the values in the output buffer. E.g. to select between two values (e.g. two colors) depending on the value in the buffer, you can now do ld a,(hl) ; and b ; add c (with register C containing one color value and register B containing the difference between the two color values).

If 0/255 is allowed you can in Roukan's routine replace xor a ; rlc c ; rla with rlc c ; sbc a,a. This already saves 5 cycles in the inner loop.

You can save another 4 cycles per iteration by abusing the stack pointer (actually you save 8 cycles per 2 iterations). I also unrolled the inner loop 4x, that eliminates the need for the inner djnz instruction (saves approx 14 cycles per iteration). Here's my final routine:

BTW i also have a version that writes 0/1 (not 0/255) in the output buffer, it's 1 byte longer and 3 cycles slower (3 cycles in total, not per output byte). But I like this version better.

Wow, quite some optimisation you have managed there Smile

I read through your code to make sure I understood what it was all doing - and I managed that.

Loop unrolling / Stack pointer abuse:
A couple of things I shall always try to bear in mind when writing code in the future, being what I would call a bit of a novice on the z80 front (mainly due to working with it for a while / then long period of absence / when I come back - I'm refreshing what I previously learned and not picking up much new) this sort of feedback is always very welcome indeed.

I've been into this stuff on and off since the mid 80's, though I don't have any actual real hardware ATM (emulation only for now) - I still love tinkering around with MSX / Tatung Einstein coding - my girlfriend thinks I'm mad and live in the past Tongue - lol

Sadly the only Tatung Einstein emulator I can find (MESS) appears to be broken where interrupts are concerned - which is a real pain Sad - at least the MSX is in good shape from an emulation point of view.

Thanks once again for the replies - I'm sure I'll have a few more bits and pieces to run by you in the future - maybe one day I'll feel qualified enough to offer advice to others in my situation - we can but hope Smile

Regards

Roukan

By hit9918

Prophet (2927)

hit9918's picture

20-10-2012, 22:12

A way to further optimize this thing maybe could go like this: not do this thing Smile

@Roukan, what do you need it for? It smells like big letters scrolling onscreen?

For example the border fill of a scroller:
With the current code, every 8th frame you get a hiccup that got to extract all 8 bits of a byte in that frame.
Having an 8 byte buffer where you roll the bytes, every scroll step needs to just extract the top bit of the byte and then rotate the byte.

By RetroTechie

Paragon (1563)

RetroTechie's picture

20-10-2012, 23:01

Good thinking hit9918! Smile2 and I couldn't agree more.

Normally you'd do something like this if you need some intermediate 'value', you need it often, and it takes lots of work to obtain. So you'd do the heavy lifting once, store the result, and use that often.

In this case I don't see the point: you start with a sequence of 8 bytes, and for each byte you're (apparently) only interested in the 8 bits in there, each bit as an on/off value. That's a) very compact, and b) as 'ready-to-eat' as it gets. For example (and just an example!): put start in IX, and you can test each bit in a single Z80 instruction: bit n,(IX+d)

(and then do something or not according to Z flag)
So what's the point of the above routine? What does that extra 64 byte buffer with yes/no values bring you, other than wasting 64 bytes? Question (apart from wasting space, CPU cycles & effort on above routine)

Page 1/2
| 2