Assember: 16bit bubble sort

Страница 3/5
1 | 2 | | 4 | 5

By wouter_

Champion (470)

Аватар пользователя wouter_

27-07-2011, 22:51

@wouter, can you explain the db #01 and the remark that it would be a LD BC,nn ? Where is the nn defined or is it really just to skip the dec l and the ret z, but then why not use just a jp as a jp has the same execution time (10 cycles), or am i missing something?
'LD BC,#xxyy' is a 3 byte instruction, it's encoded as 01 yy xx. By putting a 'db #01' in the code, the next 2 bytes are seen as part of the 'LD BC,#xxyy' instruction and so they won't get executed as individual instructions. Both 'LD BC,#xxyy' and 'JP #xxyy' take 11 cycles to execute (not 10). From a performance point of view there's no advantage in using this trick, but it does save 2 bytes on code size.

Never ever did take a look at it but it seems that a JR actually is taking more cycles then a JP, that's strange ?
JP always takes 11 cycles to execute, whether the (conditional) jump is actually taken or not. JR takes 8 cycles when the jump is not taken, but 13 cycles when the jump is taken. So for often taken jumps (e.g. also unconditional jumps) JP is faster than JR, for often not-taken jumps, JR is faster. When the chance for taken/non-taken is 50/50 JR has (on average) a small advantage over JP.
The situation is different on R800, there JR is always faster than JP (JR = 2 or 3 cycles, JP = 3 or 5 cycles).

By WORP3

Paladin (835)

Аватар пользователя WORP3

28-07-2011, 09:44


JP always takes 11 cycles to execute, whether the (conditional) jump is actually taken or not. JR takes 8 cycles when the jump is not taken, but 13 cycles when the jump is taken. So for often taken jumps (e.g. also unconditional jumps) JP is faster than JR, for often not-taken jumps, JR is faster. When the chance for taken/non-taken is 50/50 JR has (on average) a small advantage over JP.
The situation is different on R800, there JR is always faster than JP (JR = 2 or 3 cycles, JP = 3 or 5 cycles).

According to the Zilog Z80 manual (see below) the JP instruction will take 10 T-states, not 11 ?!?! Where did you find the number of cycles for the JP instruction ?
For some reason their seems to be some deviation between our documentation ?

www.zilog.com/docs/z80/um0080.pdf

Ok "LD BC,#1234" is indeed taking less memory then the JP instruction ;)

By wouter_

Champion (470)

Аватар пользователя wouter_

28-07-2011, 10:52

According to the Zilog Z80 manual (see below) the JP instruction will take 10 T-states, not 11 ?!?! Where did you find the number of cycles for the JP instruction ?
For some reason their seems to be some deviation between our documentation ?

Your documentation is correct: a JP instruction takes 10 cycles on a Z80. However, in an MSX computer, there is a circuit that adds 1 or 2 extra wait cycles per instruction (one per M1-cycle: very roughly this means 1 cycle for most instructions, 2 cycles for instruction that start with a CB, DD, ED or FD prefix). I don't know the details, but I've been told this extra delay is related to RAM refresh.

By wouter_

Champion (470)

Аватар пользователя wouter_

28-07-2011, 19:41

@NYYRIKKI: I believe there's a bug in your 16-bit sort routines: to compare 2 16-bit values you use a sequence like this:

	LD A,D
	CP H
	JR C,.SWAP
	LD A,E
	CP L
	JR NC,.LOOPO

But that's wrong: the lower 8-bits should only be compared when the upper 8-bits are equal (not when they are smaller). A possible fix is to add 'JP NZ,LOOPO' right before 'LD A,E'.

I've implemented an alternative 16-bit comparison in the routine below. Depending on the type of input data the original or the alternative comparison will be faster: if values with an equal high byte are rare, the original comparison is likely faster. If equal high bytes are common, the alternative is likely faster.

In the code below I've applied the optimizations I did in my 8-bit routines to your 16-bit routine. I believe it's quite a bit faster, but it also has more restrictions.

; Restrictions:
; - The to-be-sorted buffer must have a start- and an end-sentinel:
;    - start-sentinel must be aligned at 256-bytes, it can have any value
;      the size of the sentinel itself can be 8 or 16 bit (current code
;      assumes 16 bit).
;    - end-sentinel is 16 bit and must have the value #FFFF (or at least
;      it must be bigger or equal than the biggest to-be-sorted element).
; - The buffer may at most contain 128 (16-bit) elements. But for such
;   big buffers you anyway shouldn't use this O(N^2) algorithm.
;
Sort16:	LD   (StorSP),SP ; Save SP
	LD   HL,DataE-4  ; 2nd-to-last element in buffer
	JP   Start       ;

LoopO:	EXX              ; At this point sub-range [HL, BufE) is sorted
	DEC  L           ; Extend sub-range, the new element is (possibly)
	DEC  L           ;   not yet in the correct position
	JR   Z,Exit      ; If low addr byte == 0 (the start-sentinel), we're done
Start:	LD   SP,HL       ;
	EXX              ;

LoopI	POP  DE          ; DE <- (SP) ; SP += 2    DE = new  element
	POP  HL          ; HL <- (SP) ; SP += 2    HL = next element
	OR   A           ; clear carry flag
	SBC  HL,DE       ; next >= new -> new element is in correct position
	JR   NC,LoopO    ; because of the end-sentinel, this loop always terminates
	ADD  HL,DE       ; restore value of 'next element'
	PUSH DE          ; Store new and
	PUSH HL          ;       next element in different order
	POP  HL          ; SP += 2 (the POP'ed value doesn't matter)
	JP   LoopI       ;

Exit:	LD SP,0          ; restore SP
StorSP: equ $-2          ;
	RET              ;

	.align 256
	dw #DEAD         ; Start-sentinel, can have any value, is only used to
	                 ; align the start address at #xx02. Aligning at #xx01
	                 ; is also possible and even gives a slightly faster
	                 ; exit-test (but 2 byte aligned is more natural).
DataS:	dw 0,9,2,7,6,2,8,1,1,4
DataE:  dw #FFFF         ; End-sentinel, must have the maximum 16-bit value.

By Heca

Rookie (21)

Аватар пользователя Heca

29-07-2011, 00:40

I think I would try to postpone all the push pop operationsbby just registering the number of iterations inside LoopI.
(It would save 2 push, 2 pop, and the ADD HL,DE)
Then I'll update the data in one row (maybe with LDIR)

By wouter_

Champion (470)

Аватар пользователя wouter_

29-07-2011, 22:12

I think I would try to postpone all the push pop operations by just registering the number of iterations inside LoopI (it would save 2 push, 2 pop, and the ADD HL,DE). Then I'll update the data in one row (maybe with LDIR)
Seems like an excellent idea. Below is the best routine I could come up with that uses your idea.

Sort16:	LD   (StorSP),SP ; Save SP
	LD   IX,DataE-4  ; 2nd-to-last element in buffer
	LD   D,IXH       ; prepare the high bytes of the registers
	LD   H,D         ;   BC, DE, HL for the LDIR instruction below, these
	LD   B,0         ;   values don't change for the duration of this routine
	EXX              ;
	LD   B,2         ; value doesn't change during this routine

LoopO:	LD   SP,IX       ;
	XOR  A           ; count 2 x "how many times loop 'LoopI' gets executed"
	POP  DE          ; DE <- (SP) ; SP += 2    DE = new  element
LoopI:	POP  HL          ; HL <- (SP) ; SP += 2    HL = next element
	ADD  A,B         ; A += 2   clear carry flag
	SBC  HL,DE       ; next >= new -> new element is in correct position
	JP   C,LoopI     ; because of the end-sentinel, this loop always terminates
	SUB  B           ; 'LoopI' only executed once? -> new element already
	JR   Z,Next      ;     in correct position, no need for LDIR
	EXX              ;
	LD   C,A         ; BC = number of bytes to copy
	LD   E,IXL       ; DE = destination addr (position of new element)
	LD   L,E         ;
	INC  L           ;
	INC  L           ; HL = DE + 2
	LDIR             ; Shift block 2 bytes backwards
	LD   SP,HL       ; HL points right after last copied element
	EXX              ;   (prev position of SP was 1 element too high)
	PUSH DE          ; store new element at correct position
Next	DEC  IXL         ;
	DEC  IXL         ; extend sub-range with a new element
	JP   NZ,LoopO    ; new element has low-byte = 0 -> we're done

Exit:	LD SP,0          ; restore SP
StorSP: EQU $-2          ;
	RET              ;

	.align 256
	dw #DEAD         ; Start-sentinel, can have any value, is only used to
	                 ; align the start address at #xx02. Aligning at #xx01
	                 ; is also possible and even gives a slightly faster
	                 ; exit-test (but 2 byte aligned is more natural).
DataS:	dw 0,9,2,7,6,2,8,1,1,4
DataE:  dw #ffff         ; End-sentinel, must have the maximum 16-bit value.

However the timing tests I did are a bit disappointing. For small buffer sizes (~10 elements) my first routine was _only_slightly_ faster, for bigger buffer sizes (~30 elements) this new routine was _only_slightly_ faster (I only did a limited amount of tests, and the result of course also depends on the original order of the elements). For still bigger buffer sizes a different algorithm with better big-O complexity is likely faster.

The reason why this new routine is not spectacularly better (as I originally expected) is that LDIR is a relatively slow instruction compared to PUSH/POP: it takes 23(!) cycles to copy 1 byte, so 46 cycles for a 16-bit value. But that's exactly the same amount of time to execute 2 PUSH and 2 POP instructions. So you don't gain much per iteration of the inner loop. And you also loose some cycles in the more complex outer loop. So in the end you don't gain much speed, and even loose some speed when the inner loop doesn't iterate many times (as is the case for small buffers or for 'almost sorted' input).

By Heca

Rookie (21)

Аватар пользователя Heca

29-07-2011, 23:18

Hum...
What if instead LDIR you push back the data ???
Something like that:

LD B,1 ; value doesn't change during this routine

LoopO:
LD SP,IX ;
XOR A ; count 2 x "how many times loop 'LoopI' gets executed"
POP DE ; DE <- (SP) ; SP += 2 DE = new element

LoopI:
POP HL ; Loop while HL < DE
ADD A,B ;
SBC HL,DE ;
JR C,LoopI ;

LD B,A
DJNZ Next ;Element already in position

DEC SP ;
DEC SP ;
EX DE,HL

LoopU:
DEC SP ; while b > 0
DEC SP ;
EX (SP),HL ;
DJNZ LoopU ;
...
Next:

LoopU is not faster than LDIR, but the take-off is faster than what you wrote for LDIR.

By wouter_

Champion (470)

Аватар пользователя wouter_

30-07-2011, 12:05

Hum... What if instead LDIR you push back the data???

	...
LoopU:	
        DEC  SP
	DEC  SP
	EX   (SP),HL
	DJNZ LoopU
        ...	


Good idea, here's a routine with all the details worked out:

Sort16:	LD   (StorSP),SP ; Save SP
	LD   SP,DataE-4  ; 2nd-to-last element in buffer
	LD   BC,NElems-1 ; C = #elems-1   B = 0

Loop1:	XOR  A           ; Count how many times Loop2 gets executed
	INC  B           ; Slightly faster than 'LD B,1'
	POP  DE          ; DE <- (SP) ; SP += 2    DE = new  element
Loop2:	POP  HL          ; HL <- (SP) ; SP += 2    HL = next element
	ADD  A,B         ; A += 1 ('INC A' doesn't clear carry flag)
	SBC  HL,DE       ; Search for an element that's bigger or equal
	JP   C,Loop2     ; Because of the end-sentinel, this loop always terminates

	LD   B,A         ; number of elements to move backwards (incl new elem)
	DEC  SP          ; SP -= 4:  Go back to the element right before the
	DEC  SP          ;  bigger or equal element. This is the new position
	DEC  SP          ;  for the new element. Possibly this is the initial
	DEC  SP          ;  position of the new element.
	EX   DE,HL       ; HL = value of new element
Loop3:	EX   (SP),HL     ; Swap values
	DEC  SP          ;  and move backwards 
	DEC  SP          ;  till we've reached the position right
	DJNZ Loop3       ;  before the new elements initial position

	DEC  C           ; Done sorting one more element
	JP   NZ,Loop1    ; Continue if there are still elements left

Exit:	LD SP,0          ; restore SP
StorSP: equ $-2          ;
	RET              ;

; Note: no need to align buffer, no need for start-sentinel
DataS:	dw 0,9,2,7,6,2,8,1,1,4
DataE:  dw #ffff         ; End-sentinel, must have the maximum 16-bit value.
NElems: equ (DataE-DataS)/2 ; number of elements, must be less than 256

The (limited) speed tests I did show that this new routine is a tiny bit faster than the variant that uses LDIR. For small buffer sizes (~15 elements or less) my original routine seems to be slightly faster. For bigger buffer sizes this routine seems faster. But all timing results are very close, biggest difference was around 10%.

This new routine also doesn't require an aligned buffer (or put another way: it can't benefit from one). It also leaves the shadow and index registers unchanged. Compared to NYYRIKKI's routines it only requires an end-sentinel and the number of to-be-sorted elements must be less than 256 (but again, for sorting that many elements you really should use an algorithm with better complexity).

By Heca

Rookie (21)

Аватар пользователя Heca

30-07-2011, 13:32

Good work and nice comments

By ARTRAG

Enlighted (6681)

Аватар пользователя ARTRAG

05-09-2011, 19:12

what about Comb sort ? It is derived by Boubble sort but faster

http://en.wikipedia.org/wiki/Comb_sort

Страница 3/5
1 | 2 | | 4 | 5