# Assember: 16bit bubble sort

Страница 3/5
1 | 2 | | 4 | 5

@wouter, can you explain the db #01 and the remark that it would be a LD BC,nn ? Where is the nn defined or is it really just to skip the dec l and the ret z, but then why not use just a jp as a jp has the same execution time (10 cycles), or am i missing something?
'LD BC,#xxyy' is a 3 byte instruction, it's encoded as 01 yy xx. By putting a 'db #01' in the code, the next 2 bytes are seen as part of the 'LD BC,#xxyy' instruction and so they won't get executed as individual instructions. Both 'LD BC,#xxyy' and 'JP #xxyy' take 11 cycles to execute (not 10). From a performance point of view there's no advantage in using this trick, but it does save 2 bytes on code size.

Never ever did take a look at it but it seems that a JR actually is taking more cycles then a JP, that's strange ?
JP always takes 11 cycles to execute, whether the (conditional) jump is actually taken or not. JR takes 8 cycles when the jump is not taken, but 13 cycles when the jump is taken. So for often taken jumps (e.g. also unconditional jumps) JP is faster than JR, for often not-taken jumps, JR is faster. When the chance for taken/non-taken is 50/50 JR has (on average) a small advantage over JP.
The situation is different on R800, there JR is always faster than JP (JR = 2 or 3 cycles, JP = 3 or 5 cycles).

JP always takes 11 cycles to execute, whether the (conditional) jump is actually taken or not. JR takes 8 cycles when the jump is not taken, but 13 cycles when the jump is taken. So for often taken jumps (e.g. also unconditional jumps) JP is faster than JR, for often not-taken jumps, JR is faster. When the chance for taken/non-taken is 50/50 JR has (on average) a small advantage over JP.
The situation is different on R800, there JR is always faster than JP (JR = 2 or 3 cycles, JP = 3 or 5 cycles).

According to the Zilog Z80 manual (see below) the JP instruction will take 10 T-states, not 11 ?!?! Where did you find the number of cycles for the JP instruction ?
For some reason their seems to be some deviation between our documentation ?

www.zilog.com/docs/z80/um0080.pdf

Ok "LD BC,#1234" is indeed taking less memory then the JP instruction ;)

According to the Zilog Z80 manual (see below) the JP instruction will take 10 T-states, not 11 ?!?! Where did you find the number of cycles for the JP instruction ?
For some reason their seems to be some deviation between our documentation ?

Your documentation is correct: a JP instruction takes 10 cycles on a Z80. However, in an MSX computer, there is a circuit that adds 1 or 2 extra wait cycles per instruction (one per M1-cycle: very roughly this means 1 cycle for most instructions, 2 cycles for instruction that start with a CB, DD, ED or FD prefix). I don't know the details, but I've been told this extra delay is related to RAM refresh.

@NYYRIKKI: I believe there's a bug in your 16-bit sort routines: to compare 2 16-bit values you use a sequence like this:

```	LD A,D
CP H
JR C,.SWAP
LD A,E
CP L
JR NC,.LOOPO
```

But that's wrong: the lower 8-bits should only be compared when the upper 8-bits are equal (not when they are smaller). A possible fix is to add 'JP NZ,LOOPO' right before 'LD A,E'.

I've implemented an alternative 16-bit comparison in the routine below. Depending on the type of input data the original or the alternative comparison will be faster: if values with an equal high byte are rare, the original comparison is likely faster. If equal high bytes are common, the alternative is likely faster.

In the code below I've applied the optimizations I did in my 8-bit routines to your 16-bit routine. I believe it's quite a bit faster, but it also has more restrictions.

```; Restrictions:
; - The to-be-sorted buffer must have a start- and an end-sentinel:
;    - start-sentinel must be aligned at 256-bytes, it can have any value
;      the size of the sentinel itself can be 8 or 16 bit (current code
;      assumes 16 bit).
;    - end-sentinel is 16 bit and must have the value #FFFF (or at least
;      it must be bigger or equal than the biggest to-be-sorted element).
; - The buffer may at most contain 128 (16-bit) elements. But for such
;   big buffers you anyway shouldn't use this O(N^2) algorithm.
;
Sort16:	LD   (StorSP),SP ; Save SP
LD   HL,DataE-4  ; 2nd-to-last element in buffer
JP   Start       ;

LoopO:	EXX              ; At this point sub-range [HL, BufE) is sorted
DEC  L           ; Extend sub-range, the new element is (possibly)
DEC  L           ;   not yet in the correct position
JR   Z,Exit      ; If low addr byte == 0 (the start-sentinel), we're done
Start:	LD   SP,HL       ;
EXX              ;

LoopI	POP  DE          ; DE <- (SP) ; SP += 2    DE = new  element
POP  HL          ; HL <- (SP) ; SP += 2    HL = next element
OR   A           ; clear carry flag
SBC  HL,DE       ; next >= new -> new element is in correct position
JR   NC,LoopO    ; because of the end-sentinel, this loop always terminates
ADD  HL,DE       ; restore value of 'next element'
PUSH DE          ; Store new and
PUSH HL          ;       next element in different order
POP  HL          ; SP += 2 (the POP'ed value doesn't matter)
JP   LoopI       ;

Exit:	LD SP,0          ; restore SP
StorSP: equ \$-2          ;
RET              ;

.align 256
dw #DEAD         ; Start-sentinel, can have any value, is only used to
; align the start address at #xx02. Aligning at #xx01
; is also possible and even gives a slightly faster
; exit-test (but 2 byte aligned is more natural).
DataS:	dw 0,9,2,7,6,2,8,1,1,4
DataE:  dw #FFFF         ; End-sentinel, must have the maximum 16-bit value.
```

I think I would try to postpone all the push pop operationsbby just registering the number of iterations inside LoopI.
(It would save 2 push, 2 pop, and the ADD HL,DE)
Then I'll update the data in one row (maybe with LDIR)

I think I would try to postpone all the push pop operations by just registering the number of iterations inside LoopI (it would save 2 push, 2 pop, and the ADD HL,DE). Then I'll update the data in one row (maybe with LDIR)
Seems like an excellent idea. Below is the best routine I could come up with that uses your idea.

```Sort16:	LD   (StorSP),SP ; Save SP
LD   IX,DataE-4  ; 2nd-to-last element in buffer
LD   D,IXH       ; prepare the high bytes of the registers
LD   H,D         ;   BC, DE, HL for the LDIR instruction below, these
LD   B,0         ;   values don't change for the duration of this routine
EXX              ;
LD   B,2         ; value doesn't change during this routine

LoopO:	LD   SP,IX       ;
XOR  A           ; count 2 x "how many times loop 'LoopI' gets executed"
POP  DE          ; DE <- (SP) ; SP += 2    DE = new  element
LoopI:	POP  HL          ; HL <- (SP) ; SP += 2    HL = next element
ADD  A,B         ; A += 2   clear carry flag
SBC  HL,DE       ; next >= new -> new element is in correct position
JP   C,LoopI     ; because of the end-sentinel, this loop always terminates
SUB  B           ; 'LoopI' only executed once? -> new element already
JR   Z,Next      ;     in correct position, no need for LDIR
EXX              ;
LD   C,A         ; BC = number of bytes to copy
LD   E,IXL       ; DE = destination addr (position of new element)
LD   L,E         ;
INC  L           ;
INC  L           ; HL = DE + 2
LDIR             ; Shift block 2 bytes backwards
LD   SP,HL       ; HL points right after last copied element
EXX              ;   (prev position of SP was 1 element too high)
PUSH DE          ; store new element at correct position
Next	DEC  IXL         ;
DEC  IXL         ; extend sub-range with a new element
JP   NZ,LoopO    ; new element has low-byte = 0 -> we're done

Exit:	LD SP,0          ; restore SP
StorSP: EQU \$-2          ;
RET              ;

.align 256
dw #DEAD         ; Start-sentinel, can have any value, is only used to
; align the start address at #xx02. Aligning at #xx01
; is also possible and even gives a slightly faster
; exit-test (but 2 byte aligned is more natural).
DataS:	dw 0,9,2,7,6,2,8,1,1,4
DataE:  dw #ffff         ; End-sentinel, must have the maximum 16-bit value.
```

However the timing tests I did are a bit disappointing. For small buffer sizes (~10 elements) my first routine was _only_slightly_ faster, for bigger buffer sizes (~30 elements) this new routine was _only_slightly_ faster (I only did a limited amount of tests, and the result of course also depends on the original order of the elements). For still bigger buffer sizes a different algorithm with better big-O complexity is likely faster.

The reason why this new routine is not spectacularly better (as I originally expected) is that LDIR is a relatively slow instruction compared to PUSH/POP: it takes 23(!) cycles to copy 1 byte, so 46 cycles for a 16-bit value. But that's exactly the same amount of time to execute 2 PUSH and 2 POP instructions. So you don't gain much per iteration of the inner loop. And you also loose some cycles in the more complex outer loop. So in the end you don't gain much speed, and even loose some speed when the inner loop doesn't iterate many times (as is the case for small buffers or for 'almost sorted' input).

Hum...
What if instead LDIR you push back the data ???
Something like that:

LD B,1 ; value doesn't change during this routine

LoopO:
LD SP,IX ;
XOR A ; count 2 x "how many times loop 'LoopI' gets executed"
POP DE ; DE <- (SP) ; SP += 2 DE = new element

LoopI:
POP HL ; Loop while HL < DE
SBC HL,DE ;
JR C,LoopI ;

LD B,A
DJNZ Next ;Element already in position

DEC SP ;
DEC SP ;
EX DE,HL

LoopU:
DEC SP ; while b > 0
DEC SP ;
EX (SP),HL ;
DJNZ LoopU ;
...
Next:

LoopU is not faster than LDIR, but the take-off is faster than what you wrote for LDIR.

Hum... What if instead LDIR you push back the data???

```	...
LoopU:
DEC  SP
DEC  SP
EX   (SP),HL
DJNZ LoopU
...
```

Good idea, here's a routine with all the details worked out:

```Sort16:	LD   (StorSP),SP ; Save SP
LD   SP,DataE-4  ; 2nd-to-last element in buffer
LD   BC,NElems-1 ; C = #elems-1   B = 0

Loop1:	XOR  A           ; Count how many times Loop2 gets executed
INC  B           ; Slightly faster than 'LD B,1'
POP  DE          ; DE <- (SP) ; SP += 2    DE = new  element
Loop2:	POP  HL          ; HL <- (SP) ; SP += 2    HL = next element
ADD  A,B         ; A += 1 ('INC A' doesn't clear carry flag)
SBC  HL,DE       ; Search for an element that's bigger or equal
JP   C,Loop2     ; Because of the end-sentinel, this loop always terminates

LD   B,A         ; number of elements to move backwards (incl new elem)
DEC  SP          ; SP -= 4:  Go back to the element right before the
DEC  SP          ;  bigger or equal element. This is the new position
DEC  SP          ;  for the new element. Possibly this is the initial
DEC  SP          ;  position of the new element.
EX   DE,HL       ; HL = value of new element
Loop3:	EX   (SP),HL     ; Swap values
DEC  SP          ;  and move backwards
DEC  SP          ;  till we've reached the position right
DJNZ Loop3       ;  before the new elements initial position

DEC  C           ; Done sorting one more element
JP   NZ,Loop1    ; Continue if there are still elements left

Exit:	LD SP,0          ; restore SP
StorSP: equ \$-2          ;
RET              ;

; Note: no need to align buffer, no need for start-sentinel
DataS:	dw 0,9,2,7,6,2,8,1,1,4
DataE:  dw #ffff         ; End-sentinel, must have the maximum 16-bit value.
NElems: equ (DataE-DataS)/2 ; number of elements, must be less than 256
```

The (limited) speed tests I did show that this new routine is a tiny bit faster than the variant that uses LDIR. For small buffer sizes (~15 elements or less) my original routine seems to be slightly faster. For bigger buffer sizes this routine seems faster. But all timing results are very close, biggest difference was around 10%.

This new routine also doesn't require an aligned buffer (or put another way: it can't benefit from one). It also leaves the shadow and index registers unchanged. Compared to NYYRIKKI's routines it only requires an end-sentinel and the number of to-be-sorted elements must be less than 256 (but again, for sorting that many elements you really should use an algorithm with better complexity).