z80 speedup by struct of arrays / 8bit style

Pagina 1/4
| 2 | 3 | 4

Door hit9918

Prophet (2927)

afbeelding van hit9918

02-08-2010, 01:41

say you got a sprite struct { x,y,c,p,xlo,dx,ylo,dy } (first 4 members are for 9918 usage).

struct of arrays style: x[256], y[256], c[256],... addresses of all arrays must be 256 aligned:

C000 spr_x equ 0xC0 ; take care your definitions are right for LD H,label usage.
C100 spr_y equ 0xC1
C200 spr_c ...
C300 spr_p
C400 spr_xlo
C500 spr_dx
C600 spr_ylo
C700 spr_dy


comparison. move sprites by adding dx,dy. for a short example, code does only handle x, and only 8bit dx, like dx=255 means speed 0.999.

	ld l,0		;L = struct index = 0
	ld b,spr_count
loop:
8	ld h,spr_dx	;ld h: select random struct member
8	ld a,(hl)
8	ld h,spr_xlo	;ld h: select random struct member
8	add (hl)
8	ld (hl),a
8	ld h,spr_x	;ld h: select random struct member
8	ld a,(hl)
8	adc 0
8	ld (hl),a

5	inc l		;inc l: next struct
--
77 cycles total

	djnz loop


and now in classic 16bit comfort style, using IX, array of struct:

C000 spr

C000 spr_x
C001 spr_y
C002 spr_c
C003 spr_p
C004 spr_xlo
C005 spr_dx
C006 spr_ylo
C007 spr_dy

	ld ix,spr
	ld b,spr_count
	ld de,8 	;size of struct
loop:
15	ld a,(ix + dx)  
15	add (ix + xlo)
15	ld (ix + xlo),a

15	ld a,(ix + x)
8	adc 0
15	ld (ix + x),a

16	add ix,de	;next struct in 16 cyles AND waste DE for sizeof struct offset
--
99 cycles total

	djnz loop


slow, must optimize:

C000 spr

C000 spr_x
C001 spr_y
C002 spr_c
C003 spr_p
C004 spr_xlo
C005 spr_dx
C006 spr_ylo
C007 spr_dy

	ld b,spr_count
	ld hl,spr_x
	ld de,8 	;size of struct
	exx
	ld hl,spr_xlo
	ld de,8 	;size of struct
	exx
loop:
8	ld a,(hl)	;xlo
8	inc hl
8	add (hl)	;dx
8	dec hl
8	ld (hl),a	;xlo

5	exx
8	ld a,(hl)
8	adc 0
8	ld (hl),a	
11	add hl,de	;next struct
5	exx
	
11	add hl,de	;next struct
--
96 cycles total

	djnz loop

rewrite everything to use HL, and result is just a few cycles less?

sidenote: if you would move the x member more near to xlo,dx to get faster member addressing,
then you wrecked the 9918 header. you will then get slow code when sending the bytes to VDP
just to make this code look good, this would be kind of cheating.
sooner or later, members will be further away than just an INC HL.

-- summary: --

z80 was always called the nice 16bit comfort in comparison to 6502 and also 8080.
but if you want really fast code, z80 is the hardest cpu:
registerallocation including EXX is a lot of work with lots of special purpose registers,
8/16 bit multiparadigm, HL/IX multiparadigm.

but the 256 byte aligned array style (where applicable)
is fast and also easy to program once you wrapped your head around it.
game enemy structs are a very good candidate for this paradigm, what do you think?

it turns your z80 in a faster machine and more comfortable to program Big smile

-- APPENDIX: limits, null pointer, pointer types etc --

limits: you can have max 255 objects of the same struct type.
max 255 sprites, max 255 charset BOBs, etc... I think 255 sprites is ok Wink

null pointer: one does prefere to start at index L = 0. though L = 0xFF is a good candidate for null pointer.
when you got between 0 and 255 objects, the number of objects fits in 8bit.
and the indexes of 255 objects go from L = 0 to L = 254. and L = 255 is the NULL pointer, nice match.

variable name style: to avoid confusion, make variablenames struct_member,
so you always are aware which array is meant to belong to which struct:

spr_x equ 0xC0
spr_y equ 0xC1
bob_x equ 0xC8
bob_y equ 0xC9

an index meant to be used for spr got no meaning in bob, the variable naming helps here too.

storing only 8bit indexes in most cases is as useful as full pointers in classic style.
with index in L, you need to know the type of the struct to get a full address:
nothing really new, in classic style you too need to know the struct type to make sane struct offsets.
index in L, know that it is a spr, full address: LD H,spr_x. I took the first member of the struct.

exception: a code capable of working on x,y members, no matter whether it is a spr or a bob.
in this case you need to pass the full pointers to the code in question.

setxy0: ;set members x and y to 0. full address in HL
	ld (hl),0	;member x = 0
	inc H		;to next member in 5 cycles: inc HI byte of address.
	ld (hl),0	;member y = 0
	ret

this kind of is a mixed mode paradigm: you work with full pointers.
but next byte of the struct is still 256 bytes further.

Aangemeld of registreer om reacties te plaatsen

Van PingPong

Prophet (4093)

afbeelding van PingPong

02-08-2010, 08:50

A bit of considerations:
- if do not need full 16 bit addressing why use IX registers? It's better to use 256 aligned structs.
- z80 gives you the ability to point a structure using IX register & an offset of 16 bit. Slow, but comfortable. Not a 'panacea' however.
- 6502 have sophisticated addressing modes, yes, but they are here only to compensate for the lack of registers.

Doing a memory move routine without 256 bytes limits on 6502 (like an ldir on z80), require more and more efforts. And run like a dead snail.

Summary. For those 8 bit beasts not all operations are easy. They require careful planning. They require tricks.

Wink

Van PingPong

Prophet (4093)

afbeelding van PingPong

02-08-2010, 16:39

Sorry value added to IX is a offset of 8 bit not 16 bit.
plus if data is 256 byte aligned and sized, inc hl could become inc l, a bit more faster.

Van hit9918

Prophet (2927)

afbeelding van hit9918

02-08-2010, 23:03


6502 have sophisticated addressing modes, yes, but they are here only to compensate for the lack of registers.

The 6502 version of the code:

4	lda spr_dx,x
2	clc 		;clear carry
4	adc spr_xlo,x
4	sta spr_xlo,x
4	lda spr_x,x
2	adc 0
4	sta spr_x,x
--
24 cycles total

classic programmed 3.57MHz MSX took 99 cycles, a NES 1.79Mhz does this in 48 MSX cycles:
the lack of registers, the lack of 16bit - more than compensated, twice as fast! oO

Things are different with C64 however, real life is shared video memory
needs throtteling 6502 to halve speed. I think here is an actual 6502 weakness.

The C64 needs 86.4 MSX cycles, still faster than MSX in 16bit style - the biggest
surprise given C64 cpu is halve speed. However you can beat it with 77 cycles
if you program z80 in 6502 style, oh the irony Wink

As I load H and L, I wonder whether this is the style the makers of 8080 had in mind? my favourite is this:


;struct list { list *next; list *prev; bla bla }
list_next equ 0xC0
list_prev equ 0xC1
list_bla equ 0xC2

;HL points to a list item (H = 0xC0). get next item.

ld l,(hl)

;DONE!

followed a pointer in 8 cycles Cool

Van ARTRAG

Enlighted (6923)

afbeelding van ARTRAG

02-08-2010, 23:38

hit9918
If I correctly understand

inc/dec H would change field
inc/dec L would change record

I see the pros, but your solution implies 256*N bytes of RAM to be wasted (with N is teh # of fields in teh sprite record) even if I have only 20 sprites .... quite a lot.

Van PingPong

Prophet (4093)

afbeelding van PingPong

03-08-2010, 12:36



Quote:
6502 have sophisticated addressing modes, yes, but they are here only to compensate for the lack of registers.

The 6502 version of the code:

4	lda spr_dx,x
2	clc 		;clear carry
4	adc spr_xlo,x
4	sta spr_xlo,x
4	lda spr_x,x
2	adc 0
4	sta spr_x,x
--
24 cycles total

classic programmed 3.57MHz MSX took 99 cycles, a NES 1.79Mhz does this in 48 MSX cycles:
the lack of registers, the lack of 16bit - more than compensated, twice as fast! oO

As said. more depends on what one is doing.


Things are different with C64 however, real life is shared video memory
needs throtteling 6502 to halve speed. I think here is an actual 6502 weakness.

Don't agree. the C64 CPU is not stopped so much. Every 8 scanline there is a 40 cycle steal, but otherwise, 6510 run at full speed. (Of course 1Mhz). Sprites take additional 3*number_of_sprites_on_current_scanline cycles, needed to fetch sprite data.

(Regards to this sprite time: I would to point you to see how much better is the sprite implementation of VIC compared with the Crappy TMS ONE:
- Max n. of access on scanline 3*8 sprites = 24 accesses at full load, without sprite "erasing"
- The TMS, only to *KNOW* the sprites it should display need 32 accesses!!!!!!! Crazy, then 6 accesses per 4 sprites=24)
- VIC does better with a lot of less work.

Assuming 4 sprites on a scanline on average for the entire screen (that is unlikely), every scanline has a penalty of 4*3+(40/8)=5+12=17 steals. on 64us that is 64 cycles. So, even in this unlikely scenario, CPU is stopped for 64/17 cycles, only in active area, not halved.

The real weakness, is the *EXTREMELY COUPLED* timing between VIC-II and 6510. It's not easy to double the CPU speed without having troubles with VIC-II. Effectively the C128 must disable the vic to double the CPU speed, otherwise VIC accessed clashed always with CPU ones.

Tongue

Van hit9918

Prophet (2927)

afbeelding van hit9918

03-08-2010, 16:40

@ARTRAG:

hit9918
If I correctly understand

inc/dec H would change field
inc/dec L would change record

Yes, struct member = select field = H, index = select record = L.


I see the pros, but your solution implies 256*N bytes of RAM to be wasted (with N is teh # of fields in teh sprite record) even if I have only 20 sprites .... quite a lot.

This is where 6502 shines again, because the 8bit x y registers are relative to a random 16bit base in the opcode.
However I see a solution:

spr_x equ C0
spr_y EQU C1
bob_x equ C0
bob_y equ C1

the x of sprites and bobs are in the same array C000 - C0FF.
now let the indexes (L register) of sprites go from 0 to 31, bobs from 32 to 64.
you lost the feature that the index always starts at 0. you did NOT lose any of the INC L and LD H,member features.

so this leads to the style recommendation to not do LD L,0 like in intro examples,
but LD A,(spr_start) : LD L,A. With such code you can pack arrays later.

Take care that all arrays meant to belong to the same struct,
e.g. spr_x, spr_y, spr_c... must share the same index range.

I favour a model where everything would be in one store, i.e. sprites and bobs share the same struct.
In a bob the p variable is not sent to VDP sprites and could contain a bob graphics selector.

mem consumption example, a bullet struct with fixpoint speed is a big one: x,y,c,p,xlo,ylo,dx,dy,dxlo,dylo

10 bytes. I would need a 2560 byte area for my variable store.
This is not asked too much for a Salamander with 4 options and hell action? Smile
If I limit the max index for critters (sprites + bobs) to 128,
I got 10 arrays of 128 bytes for further "misc" arrays in this 2560 bytes area.

Van hit9918

Prophet (2927)

afbeelding van hit9918

03-08-2010, 17:03

C000 spr

C000 spr_x
C001 spr_y
C002 spr_c
C003 spr_p
C004 spr_xlo
C005 spr_dx
C006 spr_ylo
C007 spr_dy

;HL points to dx, and D = H and B = H

8 ld a,xlo-dx
5 add l
5 ld e,a
8 add x-xlo
5 ld c,a

8 ld a,(de) ;xlo
8 add (hl) ;dx
8 ld (de),a ;xlo

8 ld a,(bc) ;x
8 adc 0
8 ld (bc),a ;x

8 ld a,sizeofstruct
5 add l
5 ld l,a

--
97 cycles total

another attempt with the multiparadigm Z80.
it is amazing how after huge effort doing everything different I get practically identic cycles Smile2

Van hit9918

Prophet (2927)

afbeelding van hit9918

03-08-2010, 21:29

@PingPong

We are having a rare discussion: you speak for z80 and VIC, and I speak for 6502 and 9918 Smile

I recently have become a strong 9918 fan: because of texas instruments sprite gameplay.
I was bored by Amiga(!) games like Katakis, Xenon Megablast, XR35, because they lack gradius gameplay.

I now understand this happened because other homecomputers got no texas instruments sprites.
This results in different gameplay style.

When in a hard gradius gameplay situation the flicker flicker monochrome sprite MSX 1 got more than 8 sprites per scanline,
the following just happened:
#1 poor sprite demo
#2 C64 is INCAPABLE of this gameplay.

The C64 cannot flicker. Sprites must be sorted by Y.
rotate the list before the sort: no effect. rotate after the sort: you just wrecked the sort.

C64 gameplay is crippled even further by multiplexer Y regions.

http://www.youtube.com/watch?v=RoVXdhIeZ_g "sprite multiplexer".
this does look like an attempt of a general purpose multiplexer without Y region limits. 32 sprites take 50% cpu.


(Regards to this sprite time: I would to point you to see how much better is the sprite implementation of VIC compared with the Crappy TMS ONE:
- Max n. of access on scanline 3*8 sprites = 24 accesses at full load, without sprite "erasing"
- The TMS, only to *KNOW* the sprites it should display need 32 accesses!!!!!!! :x, then 6 accesses per 4 sprites=24)
- VIC does better with a lot of less work.

I can feel your pain ;)
I did run the same calculation of the 3*8 = 24 bytes versus 32 bytes just for Y. ouch ._,

What you get: explained in Commodore Terminology: MSX 1 got 4 sprites and a HARDWARE sprite multiplexer B-)

Additionally the hardware multiplexer got the feature that it does not scramble the priority,
which is needed for going beyond the scanline limit alias flicker.

MSX is the only homecomputer with texas instruments sprites. Amiga sprites too got Y region pain.
TI sprites give you random movable sprites - NO bad influence on gameplay style - at very low cpu.
This is why the humble monochrome sprites of MSX games got tricky moves and crisp collision detecion -
not only Gradius, but in general.

Van PingPong

Prophet (4093)

afbeelding van PingPong

04-08-2010, 00:23

@hit9918
@PingPong

....
When in a hard gradius gameplay situation the flicker flicker monochrome sprite MSX 1 got more than 8 sprites per scanline,
the following just happened:
#1 poor sprite demo
#2 C64 is INCAPABLE of this gameplay.

The C64 cannot flicker. Sprites must be sorted by Y.
rotate the list before the sort: no effect. rotate after the sort: you just wrecked the sort.

C64 gameplay is crippled even further by multiplexer Y regions.

Maybe C64 sprite subsystem is incapable etc. but look at this at 0:08

http://www.youtube.com/watch?v=oyf2W239eRY&feature=related

not sure about the main monster but...
Now, tell me how to do the same with TMS.....
;)

Van hit9918

Prophet (2927)

afbeelding van hit9918

04-08-2010, 23:51

@PingPong
@hit9918
Maybe C64 sprite subsystem is incapable etc. but look at this at 0:08

http://www.youtube.com/watch?v=oyf2W239eRY&feature=related

not sure about the main monster but...
Now, tell me how to do the same with TMS.....
;)

real gradius tank bullets do really target you and dont have wrong direction and speed.
Now, tell me how to do the same with C64 ;)

The two different sprite hardware aproaches strongly influence the kind of games that can be made.
Disussing the fat C64 monster sprites is easy. Arguing lack of gameplay is not so easy.

Pagina 1/4
| 2 | 3 | 4