Tile drawing performance tests

Page 1/5
| 2 | 3 | 4 | 5

By Grauw

Ascended (9379)

Grauw's picture

17-03-2016, 23:48

I did some performance tests drawing 16x16 blocks in screen 5 with HMMM at 60 Hz (16.78 ms / frame), starting at VBLANK.

Drawing 12 16x16 tiles:

Sprites: 10.42 ms (63%)
No sprites: 9.05 ms (54%)

Drawing 16 16x16 tiles:

Sprites: 14.19 ms (85%)
No sprites: 12.08 ms (72%)

Additionally I did a test copying large horizontal blocks, which should be faster because VDP is operating more efficiently. This is due to the reduced amount of draw calls (during which it is idle) and line wraps (which slow it down a bit).

Drawing 192x16 block:

Sprites: 8.59 ms (52%)
No sprites: 7.05 ms (42%)

Drawing 256x16 block:

Sprites: 11.86 ms (71%)
No sprites: 9.37 ms (56%)

Interesting to see that it makes such a big difference. Shows that when implementing a tile engine, it’s well worth to write some extra code which tries to combine adjacent copies on horizontal lines. When the whole row moves by one tile, breaking up the copy to skip unchanged tiles (a common optimisation) is only worth it when more than 2 tiles are unmodified.

Finally I did a test with 8x8 tiles:

Drawing 48 8x8 tiles:

Sprites: 15.43 ms (93%)
No sprites: 13.80 ms (83%)

Goes to show that this tile size is quite inefficient, and combining adjacent copies is even more important. Breaking up a row copy to skip unchanged tiles is only worth it when more than 21 tiles are unmodified.

A final nice bit of information; due to VDP slowdown, the turboR in R800 mode is actually quite a bit slower than the Z80; the difference is 0.4 ms (2.4%) for 12 16x16 tiles and 1.8 ms (10.8%) for 48 8x8 tiles.

Login or register to post comments

By Grauw

Ascended (9379)

Grauw's picture

27-03-2016, 17:27

Grauw wrote:

Additionally I did a test copying large horizontal blocks, which should be faster because VDP is operating more efficiently. This is due to the reduced amount of draw calls (during which it is idle) and line wraps (which slow it down a bit).

I did an additional test to determine which part of this is due to the reduction in line wraps:

Drawing 16x192 block:

Sprites: 8.97 ms (54%)
No sprites: 7.59 ms (46%)

So the draw call reduction accounts for the majority of the difference (more so with sprites enabled).

The percentages I mention are the percentages of the total frame time by the way. Note that the VDP is faster in the VBLANK period, from the 12 vs 16 tile measurements you can derive how much exactly if interested.

Tiles summary table:

             12 16x16   16 16x16   192x16    256x16     48 8x8     16x192
--------------------------------------------------------------------------
Sprites      10.42 ms   14.19 ms   8.59 ms   11.86 ms   15.43 ms   8.97 ms
No sprites   9.05 ms    12.08 ms   7.05 ms   9.37 ms    13.80 ms   7.59 ms
--------------------------------------------------------------------------
Sprites      63%        85%        52%       71%        93%        54%
No sprites   54%        72%        42%       56%        83%        46%

Sorted by speed:

             48 8x8     12 16x16   16x192     192x16
-----------------------------------------------------
Sprites      15.43 ms   10.42 ms   8.97 ms    8.59 ms
No sprites   13.80 ms   9.05 ms    7.59 ms    7.05 ms
-----------------------------------------------------
Sprites      93%        63%        54%        52%
No sprites   83%        54%        46%        42%

By PingPong

Prophet (3556)

PingPong's picture

18-03-2016, 08:19

@grauw . Interesting. Wounld be nice to know how many 16x16 sprites one can move @30fps assuming the usual double buffer pattern that involve these steps:
Bksave
Tpset sprite draw
Bkrestore

By ARTRAG

Enlighted (6453)

ARTRAG's picture

18-03-2016, 09:35

Nice but for emulating borders while scrolling you need vertical slices

By wouter_

Champion (437)

wouter_'s picture

18-03-2016, 10:15

Very nice performance tests!
I assume these are measurements from a real MSX machine? I wonder how different MSX emulators perform in this test.

By Grauw

Ascended (9379)

Grauw's picture

18-03-2016, 11:09

No they’re on openMSX, under the assumption that it emulates the real machine well enough (I trust in that Smile). I was interested in the draw budgets per frame so I can decide what’s possible and what’s not on MSX2 and MSX2+.

By wouter_

Champion (437)

wouter_'s picture

18-03-2016, 12:02

Grauw wrote:

No they’re on openMSX, under the assumption that it emulates the real machine well enough (I trust in that Smile).

To the best of my knowledge that's indeed the case. But it would be nice to have an (independent) confirmation of this.

By PingPong

Prophet (3556)

PingPong's picture

18-03-2016, 21:09

Grauw wrote:

Finally I did a test with 8x8 tiles:

Drawing 48 8x8 tiles:

Sprites: 15.43 ms (93%)
No sprites: 13.80 ms (83%)

Goes to show that this tile size is quite inefficient, and combining adjacent copies is even more important. Breaking up a row copy to skip unchanged tiles is only worth it when more than 21 tiles are unmodified.

The diffference between 8x8 and 16x16 is not only due to vdp. With 8x8 tiles you do four times the number of outs . So to know how much the vdp is slower with small tiles you need to take into account the 4x outs that are executed by cpu perhaps in a condition that leaves the vdp idle . When setting up the command you cannot take advantage of parallelism .

So if you want to compare vdp perfs on 8x8 and 16x16 you need to subtract from the time needed for a command the time needed to setup r0the command

By Grauw

Ascended (9379)

Grauw's picture

27-03-2016, 17:34

PingPong wrote:

So if you want to compare vdp perfs on 8x8 and 16x16 you need to subtract from the time needed for a command the time needed to setup the command

No, you must not, because there is no way to make the VDP execute them without it. This is meant to be a realistic measurement to determine what my budgets are. Showing amongst others how much of a performance disadvantage there is to using 8x8 tiles (quite a lot, as it turns out), and how important it is to aggregate draws even if it means copying some unnecessary data.

Just quickly interpreting the data above, the draw call overhead is ~0,15 ms (1% of frame time).

p.s. If you’re interested in the performance cost of line wraps on the VDP, you can compare the 192x16 and 16x192 measurements and check the excellent V9938 VRAM timings article by Wouter, Command engine timing section.

PingPong wrote:

Would be nice to know how many 16x16 sprites one can move @30fps assuming the usual double buffer pattern that involve these steps:
Bksave
Tpset sprite draw
Bkrestore

I did a test drawing softsprites using save (HMMM) / draw (LMMM-TIMP) / restore (HMMM).

Drawing two 16x16 softsprites:

Sprites: 7.24 ms (43%)
No sprites: 6.53 ms (39%)

Drawing four 16x16 softsprites:

Sprites: 15.09 ms (91%)
No sprites: 12.96 ms (78%)

Drawing two 16x32 softsprites:

Sprites: 14.23 ms (85%)
No sprites: 12.34 ms (74%)

Drawing two 16x24 softsprites:

Sprites: 10.79 ms (65%)
No sprites: 9.37 ms (56%)

The save step can be omitted by redrawing the tiles the sprite was on, however that comes at the cost of additional draw calls… Either way, it’s fair to say that soft sprites are expensive.

Softsprites summary table:

             2 16x16    4 16x16    2 16x32    2 16x24
------------------------------------------------------
Sprites      7.24 ms    15.09 ms   14.23 ms   10.79 ms
No sprites   6.53 ms    12.96 ms   12.34 ms   9.37 ms
------------------------------------------------------
Sprites      43%        91%        85%        65%
No sprites   39%        78%        74%        56%

By Grauw

Ascended (9379)

Grauw's picture

27-03-2016, 17:55

Grauw wrote:

Just quickly interpreting the data above, the draw call overhead is ~0,15 ms (1% of frame time).

Worth noting that my code restores the status register and enables interrupts during the CE polling. By keeping interrupts disabled during polling and restoring the status register after issueing the command, the draw call overhead can be reduced to ~0.11 ms.

Keeping interrupts disabled for prolonged periods of time is not something I like to do, however it’s an option if you’re only doing small copies and don’t use line interrupts etc. Alternatively a custom ISR could have s#2 enabled by default.

By Grauw

Ascended (9379)

Grauw's picture

27-03-2016, 19:56

I did some calculations on the draw call overhead…

Keeping interrupts enabled during polling and command execution:

VDPCommand_Execute_EI:
	ld c,VDP_PORT_3
	ld a,32
	di
	out (VDP_PORT_1),a
	ld a,17 | 128
	ei
	out (VDP_PORT_1),a
WaitReady:
	ld a,2
	di
	out (VDP_PORT_1),a  ; select s#2
	ld a,15 | 128
	out (VDP_PORT_1),a
	in a,(VDP_PORT_1)
	rra
	ld a,0
	out (VDP_PORT_1),a  ; select s#0
	ld a,15 | 128
	ei
	out (VDP_PORT_1),a
	jr c,WaitReady
	REPT 15
	outi
	ENDM
	ret

Best case overhead: 12 + 5 + 8 + 12 + 8 + 5 + 12 + 8 + 15 * 18 = 340 cycles
Worst case overhead: 340 + 5 + 8 + 12 + 8 + 5 + 12 + 13 + 8 + 5 + 12 + 8 + 12 = 448 cycles

Average overhead: 394 cycles, 0.110 ms, 0.66% frame time

Keeping interrupts disabled during polling and command execution:

VDPCommand_Execute_DI:
	ld c,VDP_PORT_3
	ld a,32
	di
	out (VDP_PORT_1),a
	ld a,17 | 128
	out (VDP_PORT_1),a
	ld a,2
	out (VDP_PORT_1),a  ; select s#2
	ld a,15 | 128
	out (VDP_PORT_1),a
WaitReady:
	in a,(VDP_PORT_1)
	rra
	jr c,WaitReady
	REPT 15
	outi
	ENDM
	xor a
	out (VDP_PORT_1),a  ; select s#0
	ld a,15 | 128
	ei
	out (VDP_PORT_1),a
	ret

Best case overhead: 12 + 5 + 8 + 15 * 18 = 295 cycles
Worst case overhead: 295 + 5 + 13 = 313 cycles

Average overhead: 304 cycles, 0.085 ms, 0.51% frame time

These numbers are without VDP command setup time, also they do not take into account availability of access slots and missing valuable blanking time, which I assume accounts for the difference between these numbers and the measurements.

I redid three of the previous measurements with the first and second version of the Execute routine (discrepancy of EI version with the previous test is due to removing two calls):

             1 192x16 draw    12 16x16 draws   48 8x8 draws
-------------------------------------------------------------
EI version   8.65 ms (52%)    10.39 ms (62%)   14.84 ms (89%)
DI version   8.65 ms (52%)    9.94 ms (60%)    13.43 ms (81%)

Overhead per draw:

For 12 16x16 draws EI version: (10.39 - 8.65) / 11 = 0,16 ms (0.95%)
For 12 16x16 draws DI version: (9.94 - 8.65) / 11 = 0,12 ms (0.70%)
For 48 8x8 draws EI version: (14.84 - 8.65) / 47 = 0,13 ms (0.79%)
For 48 8x8 draws DI version: (13.43 - 8.65) / 47 = 0,10 ms (0.61%)

Page 1/5
| 2 | 3 | 4 | 5