Grauw’s RPG in development

페이지 8/22
1 | 2 | 3 | 4 | 5 | 6 | 7 | | 9 | 10 | 11 | 12 | 13

By PingPong

Prophet (3460)

PingPong의 아바타

02-04-2018, 11:20

Grauw wrote:

Interesting test! I just tried it, commented out the jr c,WaitReady in VDPCommand_Execute_HL, and the CPU time of the diagonal tile rendering code goes down from 33.55% (11.18 ms) to 26.90% (8.97 ms). Looks like it’s definitely VDP bound there for 2.22 ms while drawing the 32 tile fragments.

@Grauw. What is the size of command (W x H) you are sending to vdp in diagonal scrolling ? 248 cycles is about a scanline!

By Grauw

Ascended (8515)

Grauw의 아바타

02-04-2018, 13:13

I posted the vdpcmdtrace of all the copies I do each 30 fps frame here:

VDPCmd  YMMM-IMP (0,782)->(0,1008),0 [256,2]   -- player sprite patterns copy
VDPCmd  HMMM-IMP (88,560)->(200,16),0 [4,16]   -- tile copies for diagonal scroll
VDPCmd  HMMM-IMP (8,512)->(200,32),0 [4,16]    -- ...
VDPCmd  HMMM-IMP (8,512)->(200,48),0 [4,16]
VDPCmd  HMMM-IMP (8,512)->(200,64),0 [4,16]
VDPCmd  HMMM-IMP (8,512)->(200,80),0 [4,16]
VDPCmd  HMMM-IMP (8,512)->(200,96),0 [4,16]
VDPCmd  HMMM-IMP (24,608)->(200,112),0 [4,16]
VDPCmd  HMMM-IMP (8,512)->(200,128),0 [4,16]
VDPCmd  HMMM-IMP (8,512)->(200,144),0 [4,16]
VDPCmd  HMMM-IMP (88,512)->(200,160),0 [4,16]
VDPCmd  HMMM-IMP (88,544)->(200,176),0 [4,16]
VDPCmd  HMMM-IMP (88,528)->(200,192),0 [4,16]
VDPCmd  HMMM-IMP (88,560)->(200,208),0 [4,16]
VDPCmd  HMMM-IMP (8,512)->(200,224),0 [4,16]
VDPCmd  HMMM-IMP (8,512)->(200,240),0 [4,16]
VDPCmd  HMMM-IMP (8,512)->(200,0),0 [4,16]
VDPCmd  HMMM-IMP (84,544)->(212,16),0 [4,16]
VDPCmd  HMMM-IMP (4,512)->(228,16),0 [4,16]
VDPCmd  HMMM-IMP (4,512)->(244,16),0 [4,16]
VDPCmd  HMMM-IMP (4,512)->(4,16),0 [4,16]
VDPCmd  HMMM-IMP (4,512)->(20,16),0 [4,16]
VDPCmd  HMMM-IMP (4,512)->(36,16),0 [4,16]
VDPCmd  HMMM-IMP (4,512)->(52,16),0 [4,16]
VDPCmd  HMMM-IMP (4,512)->(68,16),0 [4,16]
VDPCmd  HMMM-IMP (4,512)->(84,16),0 [4,16]
VDPCmd  HMMM-IMP (4,512)->(100,16),0 [4,16]
VDPCmd  HMMM-IMP (4,512)->(116,16),0 [4,16]
VDPCmd  HMMM-IMP (4,512)->(132,16),0 [4,16]
VDPCmd  HMMM-IMP (4,512)->(148,16),0 [4,16]
VDPCmd  HMMM-IMP (4,512)->(164,16),0 [4,16]
VDPCmd  HMMM-IMP (4,512)->(180,16),0 [4,16]
VDPCmd  HMMM-IMP (4,512)->(196,16),0 [4,16]
VDPCmd  YMMM-IMP (0,984)->(0,976),0 [256,4]    -- sprite colour table copy

Those 4x16 copies should take about 700 cycles to complete.

Might be nice to extend the profiler script at some point to show a chart like this, including a bar for the VDP commands executing…

By PingPong

Prophet (3460)

PingPong의 아바타

02-04-2018, 15:45

sorry i didn't notice this "VDPCmd HMMM-IMP (88,560)-&gtCrying200,16),0 [4,16] -- tile copies for diagonal scroll"
So those are!
VDP continues to surprise me in it's slowness. i think the overhead is bigger because of the small width x "larger" height format...
it is moving a byte every 20 z80 T-states. z80 can do better with unrolled OUTI :-(

By DarkSchneider

Paladin (880)

DarkSchneider의 아바타

02-04-2018, 15:57

That is a problem but there is no solution. The mask only gives you 8px of black bar, so the scroll process spares the work drawing 4px width. Maybe 16px of mask would have been better. And so small blocks you can only fit putting the next command between those 4x16 copies.

By Grauw

Ascended (8515)

Grauw의 아바타

02-04-2018, 16:57

If the mask is 16 pixels wide I would still need to do sixteen 16x16 copies within one frame. A mask of 32 would give me sufficient buffer to do a single copy per pixel scrolled, but that would be a bit too much horizontal screen space sacrificed Big smile. For that one should just use the 2-page horizontal scroll mode, at the cost of VRAM. So my use of 4x16 copies is due to my choice not to use the 2-page scroll mode.

But I’m not too bothered by the copy speed currently. It’s nice at least that it executes in parallel, and the CPU isn’t waiting for it excessively much, just a bit. If I had to move all that data with the CPU (which I did consider earlier) I would’ve been in a lot more trouble with my frame time!

By Grauw

Ascended (8515)

Grauw의 아바타

02-04-2018, 17:00

Grauw wrote:

If I had to move all that data with the CPU (which I did consider earlier) I would’ve been in a lot more trouble with my frame time!

Thinking about this a bit more… I just measured that currently each tile spends ~900 cycles in VDP command set-up, wait and execution code. If I would replace that with a CPU->VRAM transfer via HMMC and stored tiles in memory as both 16x4 and 4x16 data you could OUTI those 32 bytes straight, with the math and paging overhead it would probably end up at a comparable speed.

So for me I don’t think it’s interesting to pursue that approach currently. However maybe it’s a more attractive proposition in screen 7, 8 or 11, because you wouldn’t need to store the tile set in VRAM (at the cost of 128K RAM/ROM memory for 256 tiles).

PingPong wrote:

I think the overhead is bigger because of the small width x "larger" height format...

The overhead of 4x16 copies (640 cycles) compared to 16x16 copies (2048 cycles) is 25%. [1]

By MOA

Champion (293)

MOA의 아바타

02-04-2018, 17:59

I did a test with a custom ISR, using im 2, to see what happens if ISR always makes sure s#2 is selected, as it reduces the VDP wait code to something much tighter:

; with our custom ISR, s#2 is always selected instead of s#0
WaitReady:
	in a,(VDP_PORT_1)
	rra
	jr c,WaitReady

It speeds things up quite a bit, so you can try it out:

ISR (I placed it behind Application_Main):

	org #4040
Application_ISR:
	push af
	xor a
	out (VDP_PORT_1),a  ; select s#0
	ld a,15|128
	out (VDP_PORT_1),a
	in a,(VDP_PORT_1)      ; read s#0
	and a                  ; does INT originate from VDP (b7=1 - True)
	ld a,2			; select s#2 for fast VDP command ready checks
	out (VDP_PORT_1),a
	ld a,15|128
	out (VDP_PORT_1),a
	jp p,notFromVDP        ; no vdp interrupt
	
	push bc
	push de
	push hl
	push ix
	push iy
	exx
	ex af,af'
	push af
	push bc
	push de
	push hl
	call H.TIMI
	pop hl
	pop de
	pop bc
	pop af
	ex af,af'
	exx
	pop iy
	pop ix
	pop hl
	pop de
	pop bc	

notFromVDP:
	pop af
	ei
	reti

Custom ISR setup code:

	; custom ISR
	ld a,#e0  ; ivec table @ #e000..#e100
	ld i,a
	ld bc,256
	ld h,a ; #e000
	ld l,c
	ld d,h ; #e001
	ld e,b
	ld (hl),#40 ; ISR routine @ #4040
	ldir
	im 2	

By Grauw

Ascended (8515)

Grauw의 아바타

02-04-2018, 20:39

Cool, I tried it, looks like it saves 2% CPU time (0.67 ms, 2400 cycles) when scrolling diagonally. Good info.

I happened to implemented a custom ISR with IM2 two days ago, for screensplits… Smile Pre-selecting status register 1 may be needed to get them tight, but let’s see.

By MOA

Champion (293)

MOA의 아바타

03-04-2018, 02:49

Playtime is over, as my long Easter weekend has come to an end and real-life games programming/optimizing is back on the table Smile

Good luck with your project; it has real potential (my own attempt at such engine a long, long time ago was trying to do a two layer approach, so the sprites could move behind trees/walls, etc. This engine would update sprite pattern tables depending on what was in the foreground. Also layer drawing was made cheaper by allowing certain tiles to be simple LINE,BF type of VDP commands, but that's something that has to fit the art-style of the game.)

My final results (43+% free CPU time when scrolling diagonally):

Idle:
idle3

The boost to idle mode is because you have a minor bug in your code: you still do tile collision testing when the player doesn't move and I fixed it in my local version.

Move diagonally:
movediag2

Move vertically:
movevert2

Move horizontallly:
movehor2

By Grauw

Ascended (8515)

Grauw의 아바타

03-04-2018, 10:33

MOA wrote:

Playtime is over, as my long Easter weekend has come to an end and real-life games programming/optimizing is back on the table Smile

Haha, cool though that you checked it out Smile.

MOA wrote:

Good luck with your project; it has real potential (my own attempt at such engine a long, long time ago was trying to do a two layer approach, so the sprites could move behind trees/walls, etc. This engine would update sprite pattern tables depending on what was in the foreground. Also layer drawing was made cheaper by allowing certain tiles to be simple LINE,BF type of VDP commands, but that's something that has to fit the art-style of the game.)

Thanks. I’ve thought about that, but it seems a bit complicated to do with sprites. One approach would be to store bitmasks for the terrain and manually mask out the every frame. Seems expensive though since it needs to process 256 bytes of pattern data on the 60 fps loop.

For now my plan is to simply disallow the player from moving behind objects. If I want to have some narrow overhang in specific places (like an archway over a 1-tile wide passage) I can put a static sprite object there. Up to four sprites per line remain for those kind of things.

MOA wrote:

My final results (43+% free CPU time when scrolling diagonally):

Nice Smile. So let’s see if I got it right how you got there...

3% frame time from inlining stuff. Esp. in the sprite attribute update it seems there’s over a whole percent gained. I think it might be a good idea to start using macros for my getter functions.

2% frame time from copy wait loops defaulting to status register 2.

2% frame time from... other general optimisations. I especially see 1% gain in the player move update / collision handling. More inlining?

MOA wrote:

The boost to idle mode is because you have a minor bug in your code: you still do tile collision testing when the player doesn't move and I fixed it in my local version.

Actually that’s intentional... I don’t optimise for those cases and just always run it, because it reduces the amount of variation in code paths, and I need to optimise for the worst case anyway. I prefer things to just always run so that I have a constant budget and the meters don’t jump so much. I might even start doing all the scrolling copies when idling!

Maybe at some point if I want to do some framerate-dependent things (like doing more tile animations if you’re idle) I would optimise those things, but for now I think it’s more beneficial to have the frame budget allocation be constant, so I can be sure there will be no frame drops.

One thing that can be optimised though is that collision only needs to be checked every other frame, because player input is sampled on the "slow tick", so as far as the game is concerned the player sprite only moves in steps of 4 pixels per frame. Should halve the time spent there, saving about 3%.

페이지 8/22
1 | 2 | 3 | 4 | 5 | 6 | 7 | | 9 | 10 | 11 | 12 | 13