Which of these is faster?

Page 2/3
1 | | 3

By theNestruo

Master (198)

theNestruo's picture

21-09-2020, 20:34

Just tested the routine I wrote earlier... and there is a HUGE difference:

(Please note that code is also tricking BIOS' KEYINT to skip keyboard scan, etc., and that gives even more room for every frame.

By Grauw

Ascended (9340)

Grauw's picture

21-09-2020, 21:01

Note that when counting scanlines, if you’re using a 60 Hz machine in openMSX then 22 lines are hidden in the overscan, and if you’re using a 50 Hz machine then 73 lines are hidden in the overscan. Of course a large difference nevertheless.

By albs_br

Master (147)

albs_br's picture

21-09-2020, 21:18

Pretty clever solution, 3 x 256 bytes loops, theNestruo. There are lots of cool code tricks on msxlib.

By gdx

Prophet (3748)

gdx's picture

22-09-2020, 09:14

TheNestruo's routine is the faster for your specific use.

This way inspired on the routines above is probably the faster for a general use:

; RAM to VRAM fast transfer (LDIRVM IO for MSX1 screen modes)

;Input: HL = source address in RAM
;       DE = destination address in VRAM (0-3FFFh)
;       BC = number of bytes to transfer
;Note:	DI and EI have to be removed when used during vblank

VDP_DW	equ	00007h	; VDP write port


ldirvmIO:
	push	bc

	ld	a,(VDP_DW)	;Incompatible under DOS.
	ld	c,a
	inc	c

	di
	out	(c),e		;bits 0 to 7 of VRAM address
	ld	a,d
	and	03fh		;set bits 8 to 13
	or	040h		;set bit 6 for write to the VRAM
	out	(c),a		;bits 8-13 of VRAM address
	ei
	dec	c
	pop	de		;DE = number of bytes to transfert
	xor	a
	cp	d
	jr	z,ldirvm_jp	;jump if bytes to transfert <256
	ld	b,a
ldirvm_lp1:
	outi
	jr	nz,ldirvm_lp1
	dec	d
	jr	nz,ldirvm_lp1
ldirvm_jp:
	ld	b,e	
ldirvm_lp2:
	outi
	jr	nz,ldirvm_lp2
	ret

CALL SETWRT is not necessary for MSX1 screen modes.

Paragon, I think PUSH BC and POP BC are useless in your routine. It seems also not work when we have to transfer less than 256 bytes.

By theNestruo

Master (198)

theNestruo's picture

22-09-2020, 10:09

gdx wrote:

This way inspired on the routines above is probably the faster for a general use (...)

I would try to swap the loops: first copy E bytes, then copy Dx256 bytes.
This way, you can save a few comparisons and LDs. D must be adjusted accordingly (and that's a little tricky if E is 00).
Untested code:

	(...)
	pop	de		; DE = number of bytes to transfer

; Adjusts DE
	dec	de		; (so edge cases 0100, 0300, ... become 00ff, 02ff, ...)
	inc	d		; (for 00xx to stop the loop the first time DEC D is reached)
	inc	e		; (restores the value of E for the first loop)

; Copies E bytes the first time
	ld	b,e
ldirvm_lp:
	outi
	jp	nz,ldirvm_lp	; (jp is faster than jr when taken; that is 255 of 256 times)

; Copies the remaining Dx256 bytes (repeats a 256b loop D times)
	dec	d
	jp	nz,ldirvm_lp
	ret

Minutely additional optimizations can be done at the end of the routine if it is known the routine will be used mostly for <=256b (RET Z / JP LDRIVM_LP) rather than >256b.

By santiontanon

Paragon (1092)

santiontanon's picture

22-09-2020, 10:54

Yep, that last piece of code uses exactly the same idea of the code snipped I pasted before, except I use "a" instead of "d" in the final loop. But it's basically the same.

If speed is the most important thing here, and if the copy is always the name table, I'd go with something like TheNestruo's 3*256 loops specific routine above. When you know exactly what you are going to copy, specialized routines are always faster! And you can go even further, if you know you are calling it during vblank some parts of the loops can be unrolled to make it faster if needed.

By gdx

Prophet (3748)

gdx's picture

22-09-2020, 13:00

theNestruo wrote:

I would try to swap the loops: first copy E bytes, then copy Dx256 bytes.

Good idea!

	(...)
	pop	de		;DE = number of bytes to transfert
	ld	b,e
	xor	a
ldirvm_lp1:
	outi
	jr	nz,ldirvm_lp1
	cp	d
	ret	z	;back if bytes to transfert <256
ldirvm_lp2:
	outi
	jr	nz,ldirvm_lp2
	dec	d
	jr	nz,ldirvm_lp2
	ret

By albs_br

Master (147)

albs_br's picture

22-09-2020, 15:37

santiontanon wrote:

(...)if you know you are calling it during vblank some parts of the loops can be unrolled to make it faster if needed.

Is it acceptable/recommended/good practice, I mean, is it usual to use such a big number (768) of unrolled OTIR? For me, it looks like trading space by speed.

Also, is there a REPEAT directive on the tniAsm 0.45?I didn't find it in the documents.

By albs_br

Master (147)

albs_br's picture

22-09-2020, 15:53

Found here: it looks like a good tradeoff between size and speed:

"Making LDIR 21% faster", with loops for multiples of 16 and any number:

http://map.grauw.nl/articles/fast_loops.php#unrolling

By Metalion

Paragon (1203)

Metalion's picture

22-09-2020, 16:50

albs_br wrote:

Found here: it looks like a good tradeoff between size and speed:
"Making LDIR 21% faster", with loops for multiples of 16 and any number:
http://map.grauw.nl/articles/fast_loops.php#unrolling

Yes, but unrolled OUTI out of a loop takes only 17 cycles, so it's usually too fast for the MSX1 VDP. Of course, it depends on the screen mode, and if you're out or in VBLANK ... But on MSX1 screen 2, with sprites and out of VBLANK, you have to wait 29 cycles between each write.

Page 2/3
1 | | 3