SymbOS MSX multitasking operating system - help needed!

Page 315/398
308 | 309 | 310 | 311 | 312 | 313 | 314 | | 316 | 317 | 318 | 319 | 320

By PingPong

Prophet (3885)

PingPong's picture

11-01-2014, 08:13

Prodatron wrote:

Hi PingPong!

PingPong wrote:

Are you sending charater byte data via VDP logical pixel-based commands or direct VRAM access by reading/masking/writing bytes in VRAM?

I use the latter one, so I write complete bytes into the VRAM. More specific: For each char I first write one 8x? rectangle (?=height of the font) into the VRAM to a non-visible area and then let the VDP copy the char with its individual width to its position on the screen (pixel-based). This should be faster than writing the char pixel-wise directly to its destination, as in this case the Z80 would have to do the 2x or 4x amount of OUTs.
But I hope there is still room for optimizations...

The worst thing is that you probably need to set, EVERY line of char the VRAM ptr. this is the true weak point of MSX vram access.
Maybe you do not know that while a vdp command is in progress, you are able to do direct vram access. maybe you can use two different visible areas.
1) write on area (A)
2) issue the vdp command (pixel based)
3) while (2) is still processing write on area (B) (of course another vram region!)
4) issue the vdp command (pixel based) now with source on area (B)
5) GOTO (1)

I also guess that you cannot accept both limitations in plotting charaters:
a) byte aligned X coordinate
b) lack of trasparency emulation

otherwise, you can do a box copy by istructing vdp of x*y char size (in bytes) and throwing your font bytes to the vdp color register. this should be faster.

By Prodatron

Paragon (1808)

Prodatron's picture

11-01-2014, 14:32

It seems, that switching between two hidden rectangles is not necessary, as the VDP is fast enough to copy one rectangle pixel-wise to its destination while the Z80 is preparing the next char.
Btw for byte-wise VRAM access I use VDP command #F0 (Highspeed Put Bytes, CPU->VRAM). So for a 8x8 char with 16 colours I do 4x8 OUTs + command header. Would direct VRAM access be faster? I was sure not because of your mentioned VRAM line pointer.
I will try out another optimization now: As SymbOS uses proportional fonts, many chars have a width of only 4 or 6 pixel, so it's not necessary always to copy 4bytes/line into the VRAM but only 2 or 3bytes/line. I hope this will already speed up the textoutput of alternative fonts in a noticeable way.

PingPong wrote:

I also guess that you cannot accept both limitations in plotting charaters:
a) byte aligned X coordinate
b) lack of trasparency emulation

No, both are no options. I need at least pixel aligned X coordinates because of the proportional font.

By PingPong

Prophet (3885)

PingPong's picture

11-01-2014, 17:22

Prodatron wrote:

Btw for byte-wise VRAM access I use VDP command #F0 (Highspeed Put Bytes, CPU->VRAM). So for a 8x8 char with 16 colours I do 4x8 OUTs + command header. Would direct VRAM access be faster? I was sure not because of your mentioned VRAM line pointer.

Umh, i'm not sure about my explanation of Direct VRAM access. What do i mean with "Direct VRAM access" ? i mean the ability to access the vram by raw address. for example having a HL holding a 14 bit address of vram, A holding the byte to write one could write a byte with those instructions: (assuming c is pointing to port 0x99)

out (c),l ; output the low address byte
set 6,h ;  set the bit indicating that i need a vram write
out (c),h ; output the high six bits of vram address + bit 6 = 1 (meaning that i requesting vram write)
nop ; give time
nop
nop
nop
out (0x98),a ; send the byte

So compare the two methods:
(A) Direct VRAM Access
from now, every out (0x98), something will write to the next address. so if you need to write 3 consecutive bytes you simply add others writes to 0x98 port. The problem arise when you need to write another line of char font. that's because you need to re-set the vram ptr to the beginning of charater line. your solution, (high speed move) avoid this.That's because you tell the vdp the box size, then write all bytes data to the color register. But you also have an overhead: for each charater you have to write the command header. pratically for a char of 3 bytesx8rows with direct access you have:
2xout (c) instructions+set instruction+4 nop->2*13+2*10+4*5=66 t-states (one can do better, than my example)
3xout(0x98) instruction -> 3*12=36 t-states
so every line is 66+36=102 t-states.
if you have 8 chars you need 102*8->816 t-states for one char.

(B) #F0 command
By contrast, with #f0 you write 8*3 bytes -> 24 bytes and assuming you need 13 t-states per byte you end up with 312 t-states+the overhead of setting 15 registers outi (i guess) ->15*18 = 270

summarizing you will need 312+270 = 582 t-states...

I'm not totally sure of math, but i think it's better your method...

By edoz

Prophet (2437)

edoz's picture

12-01-2014, 11:41

Thanks for the explanation how symbos works! I think it is sometimes complex to get it working all together!
Like those new fonts in Symbos! I know I told it before but I'm still surprised about the speed etc. And the developing part in symstudio is also easy to use..

By hit9918

Prophet (2921)

hit9918's picture

12-01-2014, 15:00

@Prodatron,
The z80 doing plain copy is faster than blitter doing LMMM.
I would flip between two buffers to be safe.

By hit9918

Prophet (2921)

hit9918's picture

12-01-2014, 15:08

And the rectangular transfer to vram is using the blitter gear.
For parallel processing, the cpu got to use port 0x98 vram acess.

By hit9918

Prophet (2921)

hit9918's picture

12-01-2014, 15:37

A code sketch for vram write a square

	;todo set up the 16k vram acess page
	
	exx
	ld hl,vram address with bit 6 of H set to make vram write mode
	ld de,256	;lenght of one screen line
	ld c,0x99	;vram setup port
	exx
	ld hl,RAM address
	ld e,how many bytes is the char wide
	ld d,height of the char
	ld c,0x98	;vram write port
loop:
	exx
	out (c),l
	out (c),h	;setup vram address
	add hl,de	;slide hl
	exx
	ld b,e
	otir
	dec d
	jp nz,loop

By hit9918

Prophet (2921)

hit9918's picture

12-01-2014, 16:00

@PingPong,
in the code you posted, the NOP delay after vram setup is only needed for vram read.
because there the time must pass for VDP to hit an acesss slot before the first IN 0x98.
On MSX2, one NOP is enough.

By Prodatron

Paragon (1808)

Prodatron's picture

12-01-2014, 17:32

Thanks for the information and the sampel code! I wonder, how long (microseconds) does it take for the VDP to copy an 8x8 square pixel-wise. There is some Z80-overhead between the VDP commands, so I am not sure, if the Z80 sometimes has to wait for the VDP or not. If I would know the exact time for the copy process, I could calculate, if it makes sense to switch to two buffers + plain copy.

By hit9918

Prophet (2921)

hit9918's picture

12-01-2014, 19:03

You dont need to calculate timings.

It's like these steps:

1. the cpu draws in a buffer
at same time, blitter may still be busy pasting the other buffer

2. wait blitter status bit for finish.
launch the blit command

There is that pattern that one checks blitter finish not at the end of a blit, but at begining of next blit.

To have that pattern with generaly bringing in cpu parallel rendering, all rectangles of the GUI would need an ID.

But in this special case of font printing, where every call toggles to the other buffer, the ID comparison is not needed.

But if one were bringing in parallel cpu work more generaly, there woule be a variable "which ID has the blitter been working on recently".

Then blitter is dealt like this:
As you want to start painting with cpu.
If the ID is not same as the one blitter been working on, you can go without wait for blitter.

What exactly gets its own ID is not so clear.
I guess you got plenty clip regions.
But handing out IDs per-window would be enough.
But special buffers like said two buffers for rendering chars, they to get an ID.

And then some graphics functions done with the blitter could have a cpu version.
For example "clear rectangle".
If blitter is still busy
and if it is working on a different buffer ID
then use the cpu version.

Then parallel processing makes gfx faster Smile

Page 315/398
308 | 309 | 310 | 311 | 312 | 313 | 314 | | 316 | 317 | 318 | 319 | 320