Super Fast binary to decimal routine

Страница 1/2
| 2

By retrocanada76

Hero (542)

Аватар пользователя retrocanada76

14-10-2011, 05:02

I've been searching for a fast binary to decimal routine and the widespread solution is always dividing by 10 and all posible optmizations around division by 10.
But our beloved Z-80 doesn't like divisions. Although the BCD numbers are useful they can't be used everywhere specially for arithmetic intensive variables.
So I decided to make my own algorithm that converts from binary to BCD and then prints BCDs.

Googling around binary to BCDs is another bottleneck as the most common solution is the double-dabble. But for full 32-bit numbers that means shifting along 72 bits sequences for 32 times. Also conditional additions of 3 are performed between shifts for each nibble.
So I decided to make my very own binary to BCD using a good old LUT. For each bit I have a BCD number where I add using BCD math to the result. Then finally I convert the result BCD to a string.

.bios
.page 2
.rom
.start INIT

db		"DECIMAL PRINT", 0x1a

BCD_LUT:
	db		0x00, 0x00, 0x00, 0x00, 0x01
	db		0x00, 0x00, 0x00, 0x00, 0x02
	db		0x00, 0x00, 0x00, 0x00, 0x04
	db		0x00, 0x00, 0x00, 0x00, 0x08
	db		0x00, 0x00, 0x00, 0x00, 0x16
	db		0x00, 0x00, 0x00, 0x00, 0x32
	db		0x00, 0x00, 0x00, 0x00, 0x64
	db		0x00, 0x00, 0x00, 0x01, 0x28
	db		0x00, 0x00, 0x00, 0x02, 0x56
	db		0x00, 0x00, 0x00, 0x05, 0x12
	db		0x00, 0x00, 0x00, 0x10, 0x24
	db		0x00, 0x00, 0x00, 0x20, 0x48
	db		0x00, 0x00, 0x00, 0x40, 0x96
	db		0x00, 0x00, 0x00, 0x81, 0x92
	db		0x00, 0x00, 0x01, 0x63, 0x84
	db		0x00, 0x00, 0x03, 0x27, 0x68
	db		0x00, 0x00, 0x06, 0x55, 0x36
	db		0x00, 0x00, 0x13, 0x10, 0x72
	db		0x00, 0x00, 0x26, 0x21, 0x44
	db		0x00, 0x00, 0x52, 0x42, 0x88
	db		0x00, 0x01, 0x04, 0x85, 0x76
	db		0x00, 0x02, 0x09, 0x71, 0x52
	db		0x00, 0x04, 0x19, 0x43, 0x04
	db		0x00, 0x08, 0x38, 0x86, 0x08
	db		0x00, 0x16, 0x77, 0x72, 0x16
	db		0x00, 0x33, 0x55, 0x44, 0x32
	db		0x00, 0x67, 0x10, 0x88, 0x64
	db		0x01, 0x34, 0x21, 0x77, 0x28
	db		0x02, 0x68, 0x43, 0x54, 0x56
	db		0x05, 0x36, 0x87, 0x09, 0x12
	db		0x10, 0x73, 0x74, 0x18, 0x24
	db		0x21, 0x47, 0x48, 0x36, 0x48
		
INIT:

	xor		a
	call	CHGMOD

	ld		hl,0
	ld		de,0
@@LOOP:

	push	de
	push	hl
	call	PRINT_UINT32
	pop		hl
	pop		de
	ld		bc,1
	and		a	
	adc		hl,bc
	ex		hl,de
	ld		bc,0
	adc		hl,bc
	ex		hl,de

	jp		@@LOOP
		
PRINT_UINT32:
	call	BIN32_TO_BCD
	call	PRINT_BCD
	ret
	
PRINT_BCD:
	ld		ix,BCD_RESULT
	ld		b,5
	ld		iy,STR_RESULT
@@LOOP:
	ld		a,[ix]
	inc		ix
	ld		c,a
	srl		a
	srl		a
	srl		a
	srl		a
	add		'0'
	ld		[iy],a
	inc		iy
	ld		a,c
	and		0x0f
	add		'0'
	ld		[iy],a
	inc		iy
	djnz	@@LOOP
	
	ld		hl,STR_RESULT
	ld		de,0
	ld		bc,10
	call	LDIRVM
	
	ret
	
BIN32_TO_BCD:
; DEHL = signed 32 bit value

	ld		bc,5
	ld		iy,BCD_LUT
	ld		ix,BCD_RESULT
	ld		[ix],0
	ld		[ix+1],0
	ld		[ix+2],0
	ld		[ix+3],0
	ld		[ix+4],0
	
@@LOOP:

	sra		d
	rr		e
	rr		h
	rr		l
	jr		nc,@@NEXT_BIT
	
	ld		a,[ix+4]
	add		a,[iy+4]
	daa
	ld		[ix+4],a

	ld		a,[ix+3]
	adc		a,[iy+3]
	daa
	ld		[ix+3],a
	
	ld		a,[ix+2]
	adc		a,[iy+2]
	daa
	ld		[ix+2],a
	
	ld		a,[ix+1]
	adc		a,[iy+1]
	daa
	ld		[ix+1],a

	ld		a,[ix]
	adc		a,[iy]
	daa
	ld		[ix],a
	
@@NEXT_BIT:

	add		iy,bc
	
	ld		a,d
	or		e
	or		h
	or		l
	jr		nz,@@LOOP
	
	ret
	
.page   3
BCD_RESULT:
	ds		5
STR_RESULT:
	ds		10
		

If you have any tip to improve the speed let me know it.

Для того, чтобы оставить комментарий, необходимо регистрация или !login

By PingPong

Prophet (4093)

Аватар пользователя PingPong

14-10-2011, 08:12

ld		a,[ix+4]
	add		a,[iy+4]
	daa
	ld		[ix+4],a

	ld		a,[ix+3]
	adc		a,[iy+3]
	daa
	ld		[ix+3],a
	
	ld		a,[ix+2]
	adc		a,[iy+2]
	daa
	ld		[ix+2],a
	
	ld		a,[ix+1]
	adc		a,[iy+1]
	daa
	ld		[ix+1],a

	ld		a,[ix]
	adc		a,[iy]
	daa
	ld		[ix],a

If you have any tip to improve the speed let me know it.

try if you can use HL' instead of ix/iy registers for the code fragment above.
you access sequentially a block of memory so it should be feasible. at least a push hl pop hl take 23 cycles toghether.

ld a,[ix+....] take 20 or more. so a couple of access with hl compensate the overhead of push pop hl
ix/iy registry are terribly slow compared to HL/BC/DE when used to access memory.
i use them when i need a fast counter and i've already used all a-h registers.
(via undocumented instructions like inc ixh ).

By ARTRAG

Enlighted (6923)

Аватар пользователя ARTRAG

14-10-2011, 09:12

and a
adc hl,bc

can safely change in

add hl,bc

By wouter_

Champion (508)

Аватар пользователя wouter_

14-10-2011, 10:47

If you have any tip to improve the speed let me know it.
Your PRINT_BCD routine can be speed up by using the strange 'RLD' and 'RRD' Z80 instructions (in fact this type of routine is the only use so far I found for these instructions).

Alternative 1:

PRINT_BCD:
	ld	hl,BCD_RESULT+5
	ld	de,STR_RESULT+10
	ld	a,'0'
	ld	b,5
@@LOOP:	dec	hl
	rrd
	ld	(de),a
	dec	de
	rrd
	ld	(de),a
	dec	de
	djnz	@@LOOP
	; at this point DE=STR_RESULT, B=0
	ex	de,hl
	ld	d,b
	ld	e,b
	ld	c,10
	jp	LDIRVM

Alternative 2:

PRINT_BCD:
	di
	xor	a
	out	(#99),a
	ld	a,14+128
	out	(#99),a		; VDP reg#14 = 0
	xor	a
	out	(#99),a
	ld	a,#40
	ei
	out	(#99),a		; set VRAM access pointer

	ld	hl,BCD_RESULT
	ld	b,5
	ld	a,'0'
@@LOOP:	rld
	out	(#98),a
	rld
	out	(#98),a
	inc	hl
	djnz	@@LOOP		; Possibly unroll 5x for more speed
	ret

Note that these two routines change the content of the BCD_RESULT array, but I guess usually that's not a problem.

Though I think it's possible to gain more speed by optimizing the BIN32_TO_BCD routine. I'll try to do that later.

By RetroTechie

Paragon (1563)

Аватар пользователя RetroTechie

14-10-2011, 12:43

You might want to check out this thread first: 32 bits long to ascii

Includes the MOAB2D (Mother-Of-All-Binary-To-Decimal) routine. But if speed is crucial, you might want to take advantage of the following: say you have a number ABCD hexadecimal. Regardless of values of A/B/C/D, the decimal value it represents = 4096 * A + 256 * B + 16 * C + D. You could put decimal values for these numbers in a small set of tables (max 5 digits BCD per hex nibble), and then quickly add the intermediate results using BCD arithmetic. For max speed you could waste some more memory & do this on a per-byte basis.

But for 32 bits you'd still have to do 4 bytes, and BCD add the results. Don't know how much you can optimize that... repeatedly shifting bits looks cumbersome, but bit shifts & co are pretty fast on the Z80.

By wouter_

Champion (508)

Аватар пользователя wouter_

14-10-2011, 12:58

Though I think it's possible to gain more speed by optimizing the BIN32_TO_BCD routine. I'll try to do that later.
Here's my initial attempt to speedup BIN32_TO_BCD. It follows PingPong's suggestion to not use the IX and IY registers. In addition I also kept all intermediate results in registers instead of constantly reading/writing them from/to memory.

BIN32_TO_BCD:
; DEHL = UNsigned 32 bit value
	exx
	ld	hl,LUT
	ld	bc,0	; B = digits 2-1  C = digits 4-3
	ld	d,b	; D = digits 6-5
	ld	e,c	; E = digits 8-7
	exx
	ld	b,0	; B = digits 10-9

Loop:	sra	d
	rr	e
	rr	h
	rr	l	; carry flag is next bit
	exx
	jr	nc,Bit0

Bit1	ld	a,(hl)
	inc	hl
	add	b	; digits 2-1
	daa
	ld	b,a

	ld	a,(hl)
	inc	hl	; important: 'INC HL' doesn't change carry flag
	adc	c	; digits 4-3
	daa
	ld	c,a

	ld	a,(hl)
	inc	hl
	adc	d	; digits 6-5
	daa
	ld	d,a

	ld	a,(hl)
	inc	hl
	adc	e	; digits 8-7
	daa
	ld	e,a

	ld	a,(hl)
	inc	hl
	exx
	adc	b	; digits 10-9
	daa
	ld	b,a
	jr	Loop

Bit0:	inc	hl	; If the LUT array (size 5x32 = 160 bytes) doesn't
	inc	hl	; cross a 256-byte boundary, it's possible to replace
	inc	hl	; these 5 instructions with 'INC L'. This saves 2 cycles
	inc	hl	; per instruction (on Z80). And similarly in the 'Bit1'
	inc	hl	; branch (also 'INC L' doesn't change the carry flag).
	exx
	ld	a,d	; Stop if DE:HL == 0
	or	e	;  Only check this condition in the Bit0 branch. Also
	or	h	;  checking in Bit1 branch may or may not be faster
	or	l	;  depending on the input. I didn't check, but I think
	jr	nz,Loop	;  on average only checking here is faster.

End:	ld	a,b     ; Alternative: don't store in memory, but directly
			; convert BCD->ASCII from registers
	ld	(RESULT+0),a	; digits 10-9
	exx
	ld	(RESULT+1),de	; digits 8-5
	ld	(RESULT+3),bc	; digits 4-1
	ret

RESULT:	ds	5

LUT:	; Note: columns are reversed compared to the original version
	db	#01, #00, #00, #00, #00
	db	#02, #00, #00, #00, #00
	db	#04, #00, #00, #00, #00
	db	#08, #00, #00, #00, #00
	db	#16, #00, #00, #00, #00
	db	#32, #00, #00, #00, #00
	db	#64, #00, #00, #00, #00
	db	#28, #01, #00, #00, #00
	db	#56, #02, #00, #00, #00
	db	#12, #05, #00, #00, #00
	db	#24, #10, #00, #00, #00
	db	#48, #20, #00, #00, #00
	db	#96, #40, #00, #00, #00
	db	#92, #81, #00, #00, #00
	db	#84, #63, #01, #00, #00
	db	#68, #27, #03, #00, #00
	db	#36, #55, #06, #00, #00
	db	#72, #10, #13, #00, #00
	db	#44, #21, #26, #00, #00
	db	#88, #42, #52, #00, #00
	db	#76, #85, #04, #01, #00
	db	#52, #71, #09, #02, #00
	db	#04, #43, #19, #04, #00
	db	#08, #86, #38, #08, #00
	db	#16, #72, #77, #16, #00
	db	#32, #44, #55, #33, #00
	db	#64, #88, #10, #67, #00
	db	#28, #77, #21, #34, #01
	db	#56, #54, #43, #68, #02
	db	#12, #09, #87, #36, #05
	db	#24, #18, #74, #73, #10
	db	#48, #36, #48, #47, #21

Some possible future enhancements:
* In the first 6 iterations digits 10-3 always remain zero, so it's not needed to update them. Similarly for digits 10-5 for the first 13 iterations and so on.
* (less important) Only in the first 8 iterations it's needed to shift the whole 32bit number DE:HL, after that register D remains zero. Similarly for register E and H after 16 and 24 iterations. Or even better: the first 8 iterations only shift register L, next 8 iterations only shift H and so on.

Though taking advantage of these properties requires (at least partly) unrolling this routine. So, depending on your application, the extra speed may not be worth it compared to the increase in code size.

By retrocanada76

Hero (542)

Аватар пользователя retrocanada76

14-10-2011, 15:18

Thanks for all, seems that lot of people here love an assembly riddle.

Wouter_ your version seems very good, the trick to revert the BCD to match memory endianess is very clever. I wonder how fast is this version compared to the post about the mother of all conversions ? I'm interested in the conversion only not the output as it should be implemented differently for each app. What do you think ? Can we do a benchmark ?

I think unrolling the loop would be an overkill at least for almost all applications as it increases memory footprint.

By retrocanada76

Hero (542)

Аватар пользователя retrocanada76

14-10-2011, 15:29

In the extreme unrolling optimization we could use immediates instead of reading the LUT. That would be fast.... and looooong! At least 32 times the Bit1-Bit0 segment.

By bore

Master (161)

Аватар пользователя bore

14-10-2011, 21:26

Perhaps it would be possible to loop each byte separatly in four different loops.
This would give you the benefit of only having to shift one byte at a time and this will allow you to skip out early in all four loops as soon as it finds a byte with the topmost bits set to 0. The zero flag will also be set for free.
It will also be possible to reduce the lookup-table for the early bytes since the last bytes are always zero.
In the first three loops it will be possible to reduce the number of additions since the higher bytes are always 0 at that time. (First byte can at maximum cause the result to be 255 which allows us to skip addition for the top three bytes in the first loop.)

The bad thing is that it gives you almost three times as much code while only reducing the table by 32 bytes.

Now that I think about it it would probably be fast to replace the first loop in the code below with a non-table based function and will let you reduce the table size with another 16 bytes.

BIN32_TO_BCD:
; DEHL = UNsigned 32 bit value
	exx
	ld	hl,LUT1
	ld	bc,0	; B = digits 2-1  C = digits 4-3
	ld	d,b	; D = digits 6-5
	ld	e,c	; E = digits 8-7
	exx
	ld	b,0	; B = digits 10-9

Loop_1:	sra	l	; carry flag is next bit
	exx
	jr	nc,Bit0_1

Bit1_1	ld	a,(hl)
	inc	hl
	add	b	; digits 2-1
	daa
	ld	b,a

	ld	a,(hl)
	inc	hl	; important: 'INC HL' doesn't change carry flag
	adc	c	; digits 4-3
	;daa		; Not really needed since c can't exceed 2 in this loop
	ld	c,a

	exx
	jr	Loop_1

Bit0_1:	inc	hl	; important: 'INC HL' doesn't change zero flag
	inc	hl	; The shift that caused the 0-carry sets the zero flag if
	exx		; there are no more bits to shift
	jr	nz,Loop_1

	exx
	ld	hl,LUT2	; Number of loops is depending on input data so we must reload hl
	exx

Loop_2:	sra	h	; carry flag is next bit
	exx
	jr	nc,Bit0_2

Bit1_2	ld	a,(hl)
	inc	hl
	add	b	; digits 2-1
	daa
	ld	b,a

	ld	a,(hl)
	inc	hl	; important: 'INC HL' doesn't change carry flag
	adc	c	; digits 4-3
	daa
	ld	c,a

	ld	a,(hl)
	inc	hl
	adc	d	; digits 6-5
	;daa		; Not really needed since d can't exceed 6 in this loop
	ld	d,a

	exx
	jr	Loop_2

Bit0_2:	inc	hl	; important: 'INC HL' doesn't change zero flag
	inc	hl	; The shift that caused the 0-carry sets the zero flag if
	inc	hl	; there are no more bits to shift
	exx
	jr	nz,Loop_2

	exx
	ld	hl,LUT3	; Number of loops is depending on input data so we must reload hl
	exx

Loop_3:	sra	e	; carry flag is next bit
	exx
	jr	nc,Bit0_3

Bit1_3	ld	a,(hl)
	inc	hl
	add	b	; digits 2-1
	daa
	ld	b,a

	ld	a,(hl)
	inc	hl	; important: 'INC HL' doesn't change carry flag
	adc	c	; digits 4-3
	daa
	ld	c,a

	ld	a,(hl)
	inc	hl
	adc	d	; digits 6-5
	daa
	ld	d,a

	ld	a,(hl)
	inc	hl
	adc	e	; digits 8-7
	daa
	ld	e,a

	exx
	jr	Loop_3

Bit0_3:	inc	hl	; important: 'INC HL' doesn't change zero flag
	inc	hl	; The shift that caused the 0-carry sets the zero flag if
	inc	hl	; there are no more bits to shift
	inc	hl
	exx
	jr	nz,Loop_3

	exx
	ld	hl,LUT4	; Number of loops is depending on input data so we must reload hl
	exx

Loop_4:	sra	d	; carry flag is next bit
	exx
	jr	nc,Bit0_4

Bit1_4	ld	a,(hl)
	inc	hl
	add	b	; digits 2-1
	daa
	ld	b,a

	ld	a,(hl)
	inc	hl	; important: 'INC HL' doesn't change carry flag
	adc	c	; digits 4-3
	daa
	ld	c,a

	ld	a,(hl)
	inc	hl
	adc	d	; digits 6-5
	daa
	ld	d,a

	ld	a,(hl)
	inc	hl
	adc	e	; digits 8-7
	daa
	ld	e,a

	ld	a,(hl)
	inc	hl
	exx
	adc	b	; digits 10-9
	daa
	ld	b,a
	jr	Loop_4

Bit0_4:	inc	hl	; important: 'INC HL' doesn't change zero flag
	inc	hl	; The shift that caused the 0-carry sets the zero flag if
	inc	hl	; there are no more bits to shift
	inc	hl
	inc	hl
	exx
	jr	nz,Loop_4

End:	ld	a,b     ; Alternative: don't store in memory, but directly
			; convert BCD->ASCII from registers
	ld	(RESULT+0),a	; digits 10-9
	exx
	ld	(RESULT+1),de	; digits 8-5
	ld	(RESULT+3),bc	; digits 4-1
	ret

RESULT:	ds	5

	; Note: columns are reversed compared to the original version
LUT1:	db	#01, #00
	db	#02, #00
	db	#04, #00
	db	#08, #00
	db	#16, #00
	db	#32, #00
	db	#64, #00
	db	#28, #01
LUT2:	db	#56, #02, #00
	db	#12, #05, #00
	db	#24, #10, #00
	db	#48, #20, #00
	db	#96, #40, #00
	db	#92, #81, #00
	db	#84, #63, #01
	db	#68, #27, #03
LUT3:	db	#36, #55, #06, #00
	db	#72, #10, #13, #00
	db	#44, #21, #26, #00
	db	#88, #42, #52, #00
	db	#76, #85, #04, #01
	db	#52, #71, #09, #02
	db	#04, #43, #19, #04
	db	#08, #86, #38, #08
LUT4:	db	#16, #72, #77, #16, #00
	db	#32, #44, #55, #33, #00
	db	#64, #88, #10, #67, #00
	db	#28, #77, #21, #34, #01
	db	#56, #54, #43, #68, #02
	db	#12, #09, #87, #36, #05
	db	#24, #18, #74, #73, #10
	db	#48, #36, #48, #47, #21

DISCLAIMER:
I have neither tested nor assembled this code. It is not unlikely that it doesn't work.

By bore

Master (161)

Аватар пользователя bore

14-10-2011, 22:20

One reason it wouldn't work is for example that the indata should be shifted logically instead of arithmetically

Replace all sra with srl, both in my version and those above to make sure that it doesn't get stuck in an infinite loop for numbers with the most significant bit set.

By wouter_

Champion (508)

Аватар пользователя wouter_

14-10-2011, 22:26

In the extreme unrolling optimization we could use immediates instead of reading the LUT. That would be fast.... and looooong! At least 32 times the Bit1-Bit0 segment.
Interesting idea ... I tried it just for fun. The fully unrolled routine allows for _lots_ of small tweaks. I ended up with a routine that was 'only' 593 bytes in size. For comparison the size of the non-unrolled routine + LUT is 236 bytes. So the fully unrolled version is only about 2.5 times bigger (I expected a lot more). I'm not showing my routine here because it's too big and I don't think anyone should actually use it. (But ask me if you're interested anyway).

Bore's version is also interesting, I didn't measure it's size but it might be somewhere in between. But IMHO again I don't think it's worth using this larger routine. Bin->dec conversion should never be _that_ speed critical in your application. In case bin->dec conversion is not speed critical at all, you may want to avoid this algorithm completely because with the LUT it is always quite large.

Страница 1/2
| 2