R800 DMA...

Page 2/5
1 | | 3 | 4 | 5

By NYYRIKKI

Enlighted (5365)

NYYRIKKI's picture

07-05-2019, 03:05

PingPong wrote:

I've thought this because the only company that done a MSX TR was the panasonic

... but this is not true... NIA-2001 was made by Takaoka.

Quote:

and its name has never been a more logical name like MSX 3 so in my mind it was only a tweak of msx2+ standard made by panasonic.

If you did mean the "turbo R" as name, then you might be right... I do not recall this name in ASCII documentation or inside system software (BIOS etc).

To me it is clear that after MSX2, MSX3 was the obvious goal that everyone was trying to reach, but that newer happened... First in MSX2+ VDP had a minor update, but the way I see it now is that Yamaha finally finalized the design that they should have had ready already 3 years earlier.

Then R800 became ready, but Yamaha was struggling with the new VDP design, so tR became just another step toward "right direction". V9990 that was subset of V9978 was announced in 1992 that was already 3 years after R800 had been ready to ship. I guess if V9978 would have been finalized the year would have been 1993 or 1994...

I bet ASCII didn't really want to even publish these models before MSX3, but companies need to make money. They can't wait 10 years that the next model design becomes ready. It was this constant fight against time that gave the final shot to MSX3 standard... World started to move already toward 32bit era and yet ASCII & Yamaha had not managed to catch up even with 16bit-machines. It was time to give up... Can't blame them though... There was no any real 8bit -> 16bit survival stories not to talk about 8bit -> 32bit as there was simply no time to adapt.

By Sandy Brand

Master (153)

Sandy Brand's picture

08-05-2019, 00:29

PingPong wrote:

You are not seeing the real problem: this is not due to I/O port based access. It is mainly because the vdp store sprite color informations linked to plane number instead of a more logical way related to pattern number.
this force you to manipulate 16*32 bytes of VRAM instead of writing only a different pattern no into sat.

What I see is that by only being able to sequentially write sprite attributes in VRAM I now have to either sort my game objects to update in the correct order, or use some dummy buffer to first build it in RAM and then copy it into VRAM. This adds complexity and overhead.

And yes, like you said, the V9938's quirks makes it even worse Sad

Again, look at the tricks you can pull off on C64 with only 8 hardware sprites, just because you can directly access all sorts of VDP registers and VRAM. There is some amazing stuff being done on that machine.

PingPong wrote:

Even with random access you will be forced to do the same manipulation and probably you will end up to have a separate RAM buffer in order to perform the same operation without incurring in tearing or any other sort of glitches,

Double buffering can (and probably should) be done in VRAM.
And, as a coder, I would like choose where I can afford some space: RAM or VRAM.

PingPong wrote:

... Even if you perform a randomized plane access by plane no (in order to touch only some sprite planes instead of full sat) the overhead is moderate, because to change a msx2 sprite color you need 16 out operations with a overhead of some 2/4 out to set vram ptr. again it does not change too much .

If you want to run your game at 60 frames per second, from my experience, all these bits of overhead will definitely add up Smile
(btw, on Z80 LDIR AND OUTI are equally slow).

PingPong wrote:

memory copies of vdp register will be necessary even with a memory mapped schema instead of port based I/O.
remember registers are not memory location. they are not required to behave like standard memory locations, it is common that they are write or read only.

Fair enough, that depends on the hardware indeed. Smile
However, with I/O ports this seems to be most common approach.

PingPong wrote:

...plus disabling interrupts is not required unless you write data in an interrupt handler. the msx int 38h simply read data from a register and the vram ptr is not touched by this operation....

That is too much of a simplification. If I want to use both line and V-blank interrupts in my game, I will need to change VDP R15 in order to read either status register S0 and S1. So _ANY_ non-interrupt code accessing the VDP will need to disable interrupts to make it safe to write pairs of bytes to the VDP ports.

PingPong wrote:

...again is not strictly a limit of I/O or memory mapped approach. is instead a limit of the vdp. ...

Well, true, yes. It depends on the implementation of the hardware; theoretically you could expose many more VDP registers through different ports. For directly accessing VRAM though, this is impractical?

PingPong wrote:

Consider that VRAM access tend to be sequential by nature, in those situations the I/O vram ptr access does not make a huge difference.

Apart from sprites, which is a big part of games Smile

PingPong wrote:

but when you write some sequential bytes (for example 4-8 bytes) for each vram ptr setup the overhead became acceptable.

For writing 4 bytes on MSX into VRAM I need to write 8 bytes to output ports (first 4 for setting the VRAM address).
That's a pretty bad ratio :/

PingPong wrote:

And the more bigger the block you write the lighter is this overhead.

That is true.

Don't get me wrong, I/O ports also have their advantages Smile They probably lead to more modularity themselves in terms of hardware, and are 'cleaner'. Remember the early PCs whereby there were loads of gaps in RAM address space because these were used by all sorts of hardware? That was very ugly and got very messy.

By PingPong

Prophet (3435)

PingPong's picture

08-05-2019, 11:16

Quote:
Sandy Brand wrote:
PingPong wrote:

You are not seeing the real problem: this is not due to I/O port based access. It is mainly because the vdp store sprite color informations linked to plane number instead of a more logical way related to pattern number.
this force you to manipulate 16*32 bytes of VRAM instead of writing only a different pattern no into sat.

What I see is that by only being able to sequentially write sprite attributes in VRAM I now have to either sort my game objects to update in the correct order, or use some dummy buffer to first build it in RAM and then copy it into VRAM. This adds complexity and overhead.

if you mean the SAT (not the sprite color table) is is only 128 bytes in length. with a so small amount of data it is very unlikely that it is a show stopper. Instead if you mean also color sprite attributes that's another story, a 512 bytes overhead may be a problem, not because of the I/O access mode but for the pure horse power of a 8 bit CPU. Even having a Pentium in place of a z80 nothing could change.
You always forget that there are timing contraints forced by the hw and ram bandwith that are more limiting that a I/O based schema vs a memory mapped one.

Or do you think that a C64 with a pentium CPU due to VRAM mapped can do extraordinary things only because of a faster CPU? the limits are the memory bandwidth not the adressing style. Even with a faster cpu (C128 proved that) the memory mapped schema can't do anything about speed and showed as a limiting factor because it entail CPU speed with VIC-II speed over the same vram chips. C128 designers had to slow down the 8502 in order to keep the VIC-II active. with a VDP style they could make not dependent CPU speed from Video Circuitry. And effectively they did it with VDC. That had a very similar access scheme as the msx vdp but allowed a 6502 clone to work at 2Mhz.

Let's assume that there is a way to map into z80 address space the VRAM. Do you think you have full access at any speed?
No, you should have to wait for an access slot exactly as you wait with I/O access.

BTW this kind of experiment was done with ADVRAM on msx2 too. But it was never showed to be more fast.
There are some examples geared to show the advantage of ADVRAM like doing a pset in screen 8.
maybe it is faster but it is built to flavor z80. For example thing changes radically if you do the same test on screen 5..
in screen 8 you simply do a LD (HL),pixelcolor and the psed is done, given the fact that 1px=1byte and resolution is 256x216 so you simply put into H the Y value and in L the X value and voilà
look at screen 5 instead (that is more suitable for action games). you need to map x,y to HL (this involve a shift) then you do two I/I operations, a rread , a mask or shift operation , and a write and voilà the magics of DIRECT VRAM access manipulation disappeared!

Quote:

And yes, like you said, the V9938's quirks makes it even worse Sad

That's the real point. Every thing in VDP is made up to add headcaches. take for example the stupid magic Y value in msx 2 screenmodes. It does force you to check for a specific Y value and adjust this in order to avoid the disappearing of sprites! What a STUPID THING. And worse, it is not required, EVEN on TMS VDP!!! Simply because when a sprite is off-screen it does not set collision or overflow bits, nor it does not display, so WHY they maintained the stupid thing in MSX2 and put some effort to change (based on screen mode) y from 208 to 216? They should simply DISABLE THIS.

Quote:

Again, look at the tricks you can pull off on C64 with only 8 hardware sprites, just because you can directly access all sorts of VDP registers and VRAM. There is some amazing stuff being done on that machine.

I do not agree, the amazing things are not for the mapped memory style, they are because the VIC-II is extremely well designed and flexible in manipulation. The ram access of the 6502 cpu is extremely slower, in the best case you need 2us to perform a memory access (page 0). OK, the out on z80 are slower but it is not the problem.

C64 sprites does not have to worry of things like (when vdp reg. 24 is set to something different to 0 you need to adjust sprite y value taking into account, but BEWARE to magic Y VALUE!)
Nor you do not push 512 bytes of data to change 32 sprite color values, colors are (as it should be) part of the definition pattern. Did you know that with some tricks you can have 9 sprites on a scan line ? Try it with VDP. (without flickering)

Everything on C64 is done to make the cpu work easier, that's the real difference. By contrast the VDP is so poorly designed from the beginning that everything is a difficult task in order to save transistor count and costs.
the inability to read vram ptr is not because of I/O address it a matter of cost. only this.
As said there is nothing stopping you to give a full read/write I/O register as is common in msx and even on C64 (example: the scanline counter register) only a matter of costs

Quote:
PingPong wrote:

Even with random access you will be forced to do the same manipulation and probably you will end up to have a separate RAM buffer in order to perform the same operation without incurring in tearing or any other sort of glitches,

Double buffering can (and probably should) be done in VRAM.
And, as a coder, I would like choose where I can afford some space: RAM or VRAM.

there is nothing stopping you to do double buffering in VRAM, just change (for example) the SAT ptr register. However if you do double buffering it's because you cannot do your work @60fps without glitches , so you work at half speed and probably you can afford a relatively slow vram access because of double buffering

Quote:
PingPong wrote:

... Even if you perform a randomized plane access by plane no (in order to touch only some sprite planes instead of full sat) the overhead is moderate, because to change a msx2 sprite color you need 16 out operations with a overhead of some 2/4 out to set vram ptr. again it does not change too much .

If you want to run your game at 60 frames per second, from my experience, all these bits of overhead will definitely add up Smile
(btw, on Z80 LDIR AND OUTI are equally slow).

Math is math. (even if my count is approximate)

Assuming you are manipulating a single sprite with I/O access you need

sat:
two register loads + two outs + a direct logical operation (not four you do not need to specify the full 17 bit address when dealing with sat) = 42 cycles
four outi = 72
sct:
two register loads + two outs + a direct logical operation (not four you do not need to specify the full 17 bit address when dealing with sat) = 42 cycles
sixteen outi = 288 cycles

rougly 360 cycles x sprite

with memory mapped access:
two ldir operations: one for sat
LDIR 23*4= 92 cycles
LDIR 23*16 = 368

460 cycles.

*My examples are optimistics in both of scenario. things are even worse.

so with VRAM I/O access you save about 100 cycles.

Quote:
PingPong wrote:

memory copies of vdp register will be necessary even with a memory mapped schema instead of port based I/O.
remember registers are not memory location. they are not required to behave like standard memory locations, it is common that they are write or read only.

Fair enough, that depends on the hardware indeed. Smile
However, with I/O ports this seems to be most common approach.

And again, it is not a question of common approach, only a question of design in order to flavor conts instead of simplicity of access -> performances. I repeat, a fully I/O register scheme that make memory mapped register behave like ram location is absolutely possible. BUT you need to flavor well design instead of cost savings (the tms vdp was created only with cost savings in mind sacrifycing everything else and unfortunately a terrible and limited design).
For example would have the ability to specify sprite zoom at a sprite level instead of globally a complex task? NO. the hardware that perform zoom is already here, only a matter to link a single bit to the sprite plane instead of ALL sprite planes to a single bit! But nooooooooooooooooooooo! My GOD! what an expensive thing to add 32!!!!!!!! bits of zoom (for each sprite) you need 4 write only registers. Karl Guttag may get mad for a similar waste! So you got a vdp with 32 sprites globally zoomable or not. Pratically useless.

Quote:
PingPong wrote:

...plus disabling interrupts is not required unless you write data in an interrupt handler. the msx int 38h simply read data from a register and the vram ptr is not touched by this operation....

That is too much of a simplification. If I want to use both line and V-blank interrupts in my game, I will need to change VDP R15 in order to read either status register S0 and S1. So _ANY_ non-interrupt code accessing the VDP will need to disable interrupts to make it safe to write pairs of bytes to the VDP ports.

a DI or EI instruction does not change your life. And again not a I/O mapped problem but instead bad design. They needed to maintain compatibility with the old TMS that was bad designed. IMHO the VDP-CPU protocol should have been redesigned TOTALLY when working with MSX2 model with a more clever maintaining the old only for legacy msx1

look at the V9990. They dropped the stupid things. And it resulted in a more and cleaner CPU-VDP protocol.
Problem is the compatibility with a limited and crappy design devoted only to save costs.

Quote:
PingPong wrote:

...again is not strictly a limit of I/O or memory mapped approach. is instead a limit of the vdp. ...

Well, true, yes. It depends on the implementation of the hardware; theoretically you could expose many more VDP registers through different ports. For directly accessing VRAM though, this is impractical?

Sorry i cannot understand what you are meaning. However there are a lot of way of expose registers and vram to CPU. I think the TMS one is one of the worstest because of the stupid decision made to save costs.

Quote:
PingPong wrote:

Consider that VRAM access tend to be sequential by nature, in those situations the I/O vram ptr access does not make a huge difference.

Apart from sprites, which is a big part of games Smile

I do not agree, i've already compared above the difference of time of two approaces.

Quote:

Don't get me wrong, I/O ports also have their advantages Smile They probably lead to more modularity themselves in terms of hardware, and are 'cleaner'. Remember the early PCs whereby there were loads of gaps in RAM address space because these were used by all sorts of hardware? That was very ugly and got very messy.

If you do an in depth analisys you will find that most of the bottleneck of VDP are not strictly based on the fact the CPU cannot directly see VRAM. It is only a matter of design. For example, we have a autoincrement access mode to allow fast access. this is a good thing but should be improved a bit. For example
- add the ability to work even in decrement, -
- add a byte simple counter that allows, after a tot of bytes from cpu to add a costant offset value. this will allow to use a rectangular region of bytes with increment instead of being forced to write an entire screen of data. useful in games where the active region is 20x24 chars leaving 12x24 area to score bar. you simply say:
bytelen = 20, carryoffset = 12 start = 0 and magically you update a region of 20x24 bytes instead of being forced to work with entire screen.
- Another improvement: separate vram ptr in read from write. this will allow fast copy / move operations without need to buffer anything in main ram.

Even with a I/O style things can be a lot faster if the hw was well designed. Take for example SMS: it has a TMS clone with same port based approach, but SEGA (or YAMAHA ;-) ) did it better and removed some limitations of the original TMS. And effectively, despite you have more restricted access in active area with SMS VDP (not allowed) , you get a lot better results.

By Grauw

Ascended (8388)

Grauw's picture

08-05-2019, 11:40

PingPong wrote:

plus disabling interrupts is not required unless you write data in an interrupt handler. the msx int 38h simply read data from a register and the vram ptr is not touched by this operation.

The status register read clears the latch, so disabling interrupts between register value / address writes is mandatory.

Sandy Brand wrote:

Don't get me wrong, I/O ports also have their advantages :)

One being that we don’t loose a 16K block of our 64K memory to VRAM. Worse for the V9938 MSX2 VDP, that would’ve made a memory mapping scheme mandatory. Additionally, RAM shared between CPU and VDP is slower to access, and may bottleneck the CPU as a whole even without accessing the VRAM (like on the CPC iirc). These things pose challenges and create overhead of their own.

I agree the TMS9918 and V9938 designs aren’t perfect, but I’m not convinced this is due to the absence of direct memory mapped access to the VRAM and registers. I think PingPong made a good case for this. And other systems are bound to have their quirks as well that create inconvenience (like I believe the NES VDP -or was it the SMS one- isn’t very good at screensplits). And let’s not forget that in terms of resolution and colour depth, the V9938 provides the best graphics of any 8-bit system out there :).

By PingPong

Prophet (3435)

PingPong's picture

08-05-2019, 13:27

[quote=Grauw wrote:
PingPong wrote:

plus disabling interrupts is not required unless you write data in an interrupt handler. the msx int 38h simply read data from a register and the vram ptr is not touched by this operation.

The status register read clears the latch, so disabling interrupts between register value / address writes is mandatory.

Sorry, i misunderstood, what i was meaning was :
if your main program set vram ptr (protecting with DI instructions due to non atomicity of two outs), then, your interrupt code should be able to do any register manipulation including S0 read without disrupting the VRAM PTR current value.

The problem of DI/EI is mainly before the stupid decision to reuse every port address at maximum, instead of waste some other port addresses for other things. Of course having 8 different registers mapped into z80 I/O address space could be a bit too much, but even a DI/EI pair does not change a lot in timings.
For example is even worse the problem of updating the sat Y pos taking the vdp(24) value in account and TESTING for the stupid MAGIC Y VALUE. you cannot do a simple out if you transfer data to vdp, you need to:
load the y value
add the vdp(24) offset
test if is magic Y value and modify to avoid problems
out

a lot of overhead for what could be a single byte write operation, but even with a memory mapped schema nothing change, the operations are the same. So a design problem not a I/O based one.

People think that memory mapped is better but they forget the disadvantages you mentioned. All systems of that era pay for memory mapped schema:

ZX Spectrum pay: (even if only one memory bank of 16k due to a uncommon design)
Amstrad CPC pay: the real clock speed falls down to 3.3 Mhz from the nominal 4Mhz on the average
C64 pay: the VIC-II is a aggressive memory bandwith eater.
Worst case: 8 sprites displayed on screen and a bad line occouring:
the VIC need and extra 40 byte fetch (a charater line) to read what in msx parlance is the nametable ptrs.
you have 8 sprites so the VIC-II read in 8 sprite definition pattern, every sprite is 24 px -> 3 bytes so extra read of 24 bytes!

In this situation there is a 40+24 extra read. And VIC-II steal the CPU for 64 cycles=64us-> and entire scanline.

Plus if you push the hw, by causing more badlines to occour or if you do spriteplits those penalities increases.
But C64 can do amazing things because the VIC-II is extremely flexible and give you a lot of tradeoff to choose from. For example take the sprite system:

- Ability to zoom on X or Y or Both FOR EACH SPRITE (vdp has global flag for both x/Y and worse for all sprites)
- Ability to trade multicolor or high res sprites FOR EACH SPRITES (vdp only had high res ones)
- Ability to choose if a sprite is behind or front background for each sprite (vdp does not have)
- Larger sprites...

Sprite data can be moved easily in ram location.

Combine that things togheter and we see the source of those amazing things.

By Sandy Brand

Master (153)

Sandy Brand's picture

08-05-2019, 22:57

PingPong wrote:

Or do you think that a C64 with a pentium CPU due to VRAM mapped can do extraordinary things only because of a faster CPU? the limits are the memory bandwidth not the adressing style.

Please refrain from any kind of straw man arguments, that is really bad form. I never said or suggested anything like that because it is, of course, ridiculous. I am trying to have a discussion about home computers that were roughly each others contemporaries (e.g.: C64 and MSX 1/2).

PingPong wrote:

the 8502 in order to keep the VIC-II active. with a VDP style they could make not dependent CPU speed from Video Circuitry. And effectively they did it with VDC. That had a very similar access scheme as the msx vdp but allowed a 6502 clone to work at 2Mhz.

Yes, that is a good point.

PingPong wrote:

I do not agree, the amazing things are not for the mapped memory style, they are because the VIC-II is extremely well designed and flexible in manipulation. The ram access of the 6502 cpu is extremely slower, in the best case you need 2us to perform a memory access (page 0). OK, the out on z80 are slower but it is not the problem.

Yes, exactly, despite the 6502 being much slower, it can utilize the VIC-II's functionality really well because it has direct access to a lot of stuff. Hence the cool C64 demos Smile

PingPong wrote:
Quote:

If you want to run your game at 60 frames per second, from my experience, all these bits of overhead will definitely add up Smile
(btw, on Z80 LDIR AND OUTI are equally slow).

Math is math. (even if my count is approximate)

Assuming you are manipulating a single sprite with I/O access you need

sat:
two register loads + two outs + a direct logical operation (not four you do not need to specify the full 17 bit address when dealing with sat) = 42 cycles
four outi = 72
sct:
two register loads + two outs + a direct logical operation (not four you do not need to specify the full 17 bit address when dealing with sat) = 42 cycles
sixteen outi = 288 cycles

rougly 360 cycles x sprite

with memory mapped access:
two ldir operations: one for sat
LDIR 23*4= 92 cycles
LDIR 23*16 = 368

460 cycles.

Well, I am sorry, but your math is wrong Smile Not only did you use the timings for LDIR instead of LDI (probably due to the silly typo on my end, sorry), you also didn't sum the values correctly and omitted the 72 it takes to write into the SAT?:

So unless I am missing something, I think these would be the correct timings:

For I/O:
42 (set VRAM address for SAT) +
72 (4 * OUTI) +
42 (set VRAM address for SCT) +
288 (16 * OUTI)
= 444

For direct VRAM access:
72 (4 * LDI to write into SAT) +
288 (16 * LDI to write into SCT)
= 360

So clearly, direct VRAM access would faster.

I also don't fully see how you get to 42 T-states to set the destination VRAM address? If the destination address is hard coded with 2 times LD A,n; OUT (#99),A; this would only be 40? But for OUTI you will also need to first set register C which you left out as well (first to #99 and then to #98). So in the end it is even more slower than what you initially computed.

And this doesn't even include other speed-ups you can achieve with direct VRAM access by writing 2 bytes at once using 16 bits registers, or skipping data that doesn't need changing (while you have included some optimizations by assuming the highest bits of the destination VRAM are already valid).

So Smile

Anyways, let's agree to disagree Smile I fully share your point about the unfortunate VDP design. That has made a lot of things hard on MSX.

You made a good point that decoupling the VDP and CPU from the same memory address should make it possible to have both units run at full throttle so to say. The thing is, this is sort of what the MSX 2 has. You can send commands to the VDP to copy memory around all by itself and such. But yet, it will not be able to do what can be done on C64? This puzzles me.

Is it poor VDP design? Maybe.

I'd like to think that the original designers of the C64 made very clever use of what was available at the time. Knowing what kind of tricks coders want to pull off in order to get some cool game effects on screen (all the previous computer generations completely relied on this!) They were able to take (and introduce!) shortcuts because they fully 'owned' the design and hardware manufacturing process.

MSX has a completely different philosophy, of course. It is about extendability and modularity. And hey, it had it is advantages as well! I don't think I have seen something as awesome as an SCC or FM-PAC on, the C64, for example Smile

(btw. Thanks for the tip on how to speed up setting the 17 bits VRAM destination address! I never knew that was possible. I guess the VDP will just figure out that if bit 7 of the second byte send to port #99 again is a '1', that the data is intended for a VDP register?)

By Grauw

Ascended (8388)

Grauw's picture

09-05-2019, 00:36

Sandy Brand wrote:

Well, I am sorry, but your math is wrong Smile Not only did you use the timings for LDIR instead of LDI (probably due to the silly typo on my end, sorry), you also didn't sum the values correctly and omitted the 72 it takes to write into the SAT?:

So unless I am missing something, I think these would be the correct timings:

For I/O:
42 (set VRAM address for SAT) +
72 (4 * OUTI) +
42 (set VRAM address for SCT) +
288 (16 * OUTI)
= 444

For direct VRAM access:
72 (4 * LDI to write into SAT) +
288 (16 * LDI to write into SCT)
= 360

So clearly, direct VRAM access would faster.

I think your math is not particularly realistic either though Smile. Because it assumes that the CPU can access the VRAM at full speed, and there’s no way that can be true when the VDP also needs to access it. There will be a penalty you’re not accounting for. This example also shows very well the real sprite mode 2 bottleneck: the expensive rewrite of the SCT.

In my work-in-progress game I write the full SAT every frame, the full SCT every other frame. Sprites are ordered every other frame, by changing the order in which I call the objects’ SAT / SCT draw routines. This favours smooth motion over sorting precision & limited flickering abilities. I’m pretty happy with my current approach, the code is quite neat & flexible, also wrt working around y=216. But the SCT update is still expensive (8% CPU).

Sandy Brand wrote:

The thing is, this is sort of what the MSX 2 has. You can send commands to the VDP to copy memory around all by itself and such. But yet, it will not be able to do what can be done on C64? This puzzles me.

That’s really simple to answer isn’t it? The VDP commands can only be used in the MSX2 bitmap modes, which are 256x212 4bpp. A full screen update requires transferring 27.1k bytes. The Commodore 64 bitmap mode has a lower 160x200 resolution and attribute clash. A full screen update requires transferring 10k bytes, or 8k if you leave the colours untouched.

For this reason on MSX the big impressive demo effects will be either based on the pattern modes, or on screensplits if they’re in the bitmap modes.

I don’t like to bitch on MSX, and really I find it much more interesting to discuss what we can achieve with our hardware. But if we’re talking about missed opportunities, the memory-mapped VRAM access is one of the least interesting ones I think. I think that’s also PingPong’s point, although he didn’t have to be quite so thorough describing all our VDP’s deficiencies! Wink

I also think that if the VRAM had been memory-mapped, there would’ve been a high pressure to keep memory footprint low, and the V9938 wouldn’t have ended up with our system’s pride, unmatched by other 8-bit systems: high resolution, large palette, many colours. Those C64 demos always really proudly display still images. I’m sure they use a lot of trickery to get around the attribute clash limitations, but it always makes me think “what’s so special about that” because screen 8 Smile.

By PingPong

Prophet (3435)

PingPong's picture

09-05-2019, 00:21

Sandy Brand wrote:

Yes, exactly, despite the 6502 being much slower, it can utilize the VIC-II's functionality really well because it has direct access to a lot of stuff. Hence the cool C64 demos Smile

No, as always pointed out the cool demos are because of flexible and well designed VIC-II not because of memory mapped style.

Even the 9 bit (on msx the early clock) for X position is well designed and more manageable on VIC-II. Again not a memory address issue only a design that make things complicated.

Quote:

(btw. Thanks for the tip on how to speed up setting the 17 bits VRAM destination address! I never knew that was possible. I guess the VDP will just figure out that if bit 7 of the second byte send to port #99 again is a '1', that the data is intended for a VDP register?)

Yes, another stupid choice. packing control bit in the same byte used for address setup is somewhat a way to close door to future expansions in address range (and again thanks TMS vdp for giving this gift to us)

By Sandy Brand

Master (153)

Sandy Brand's picture

09-05-2019, 00:47

Grauw wrote:

I think your math is not particularly realistic either though Smile. Because it assumes that the CPU can access the VRAM at full speed, and there’s no way that can be true when the VDP also needs to access it.

Well hey, someone tries to convince me of something with an example showing clearly wrong numbers, I am sure I can point that out, right? Smile

But yes, timings will differ based on how the units need to synchronize and share a common bus.

Grauw wrote:

That’s really simple to answer isn’t it? The VDP commands can only be used in the MSX2 bitmap modes, which are 256x212 4bpp. A full screen update requires transferring 27.1k bytes. The Commodore 64 bitmap mode has a lower 160x200 resolution and attribute clash. A full screen update requires transferring 10k bytes, or 8k if you leave the colours untouched.

Yes, good point. They can get a way with a lot because there is just less stuff to 'draw' on the screen.
Still, if I am not mistaken, an MSX 2+ with V9958 can perform VDP commands in pattern modes? Could you achieve similar results though?

And yes, 'hi-res' graphics on a C64 are sub-par compared to MSX (I personally think the C64 color palette is rather horrible, but I guess that could be a matter of taste).

By PingPong

Prophet (3435)

PingPong's picture

09-05-2019, 01:03

Quote:
Quote:

Math is math. (even if my count is approximate)

Assuming you are manipulating a single sprite with I/O access you need

sat:
two register loads + two outs + a direct logical operation (not four you do not need to specify the full 17 bit address when dealing with sat) = 42 cycles
four outi = 72
sct:
two register loads + two outs + a direct logical operation (not four you do not need to specify the full 17 bit address when dealing with sat) = 42 cycles
sixteen outi = 288 cycles

rougly 360 cycles x sprite

with memory mapped access:
two ldir operations: one for sat
LDIR 23*4= 92 cycles
LDIR 23*16 = 368

460 cycles.

Well, I am sorry, but your math is wrong Smile Not only did you use the timings for LDIR instead of LDI (probably due to the silly typo on my end, sorry), you also didn't sum the values correctly and omitted the 72 it takes to write into the SAT?:

So unless I am missing something, I think these would be the correct timings:

For I/O:
42 (set VRAM address for SAT) +
72 (4 * OUTI) +
42 (set VRAM address for SCT) +
288 (16 * OUTI)
= 444

For direct VRAM access:
72 (4 * LDI to write into SAT) +
288 (16 * LDI to write into SCT)
= 360

So clearly, direct VRAM access would faster.

the LDIR instruction was used in place of outi because it is a more 'natural' approach. Of course you can unroll LDI . But always (because outi /ldi & otir/LDIR timings are the same). using outx with autoincrement let you save D, E registers a thing that is not available with LDIx and can be something useful in some scenario. (like taking into account the vdp(24) value for offsetting Y sprite position).

Anyway the pure transfer speed is the same, for each sprite you have to pay a relatively small overhead (compared to the data write you wrote)

My calculations are exact, as below, where do you see an error?: assuming you need to write 1 SAT & 1 SPC entry in a random mode lets recall:
; write SAT
hl = ptr

ld a,l ; 5
out(0x99),a ; 12
ld a,h ; 5
or 0x40 ; 8
out(0x40),a ; 12
---- 42 cycles
outi
outi
outi ; could be anything so eventually out(0x98),a
outi
----- 72 cycles
SCT
ld a,l ; 5
out(0x99),a ; 12
ld a,h ; 5
or 0x40 ; 8
out(0x40),a ; 12
---- 42 cycles
outi x 16
----- 288

Total: 288+72 = 360 cycles for each sprite entry

for direct vram access, now using LDI unrolled.

hl=ptr
LDI
LDI
LDI
LDI
-------- 72 cycles
16xLDI
-------- 288 cycles.

again 288+72 = 360 cycles.

because data write timings are the same (using unrolled also for direct memory mapped access without DE register unavailable) you pay 42 cycles x 2 vram setup operations = 84 cycles. About 25% slower for I/O addressing.
I've omitted the ceremony of setting the full bit 17 address (like it was an msx 1) because you do not need to set it more than once and because even with memory mapped schema you probably need something similar to I/O style just because the CPU cannot directly handle 17 bit addressing (z80 has only 16 address lines, so a 17 bit address may require some other memory access or I/O write to specify the 17th bit)

Page 2/5
1 | | 3 | 4 | 5