V9938 VRAM access reverse engineered

by wouter_ on 07-04-2013, 10:09
Tags: timing, vdp, openMSX
Languages:

At the 2013 MSX fair in Nijmegen some of the openMSX members took some measurements of a V9938 using a logic analyzer. After analyzing this data, they figured out when exactly the VDP reads or writes VRAM for which purpose.

This should allow to greatly improve the accuracy of the VDP emulation in openMSX. The work on the emulation part is ongoing (but early results look promising). Meanwhile they've written a document describing these findings in detail. This information may be useful for other MSX emulator developers or for MSX developers in general who like to understand the VDP at a deeper level.

Relevant link: VDP VRAM Timing

Comments (82)

By sd_snatcher

Prophet (3020)

sd_snatcher's picture

09-04-2013, 05:08

Joost Yervante Damad, Alex Wulms and Wouter Vermaelen: Thanks for all this effort on improving the quality of the emulation to a point no one dared before! The openMSX team surely deserves a lot of admiration! When implemented, these features will surely help MSX developers.

Some additional considerations:

- If "fast CPU accesses actually result in dropped requests" feature is to be implemented, then the MSX2+s turbo VRAM throughput mechanisms need to be implemented too, otherwise those turbo machines emulation will be instantly broken:
- The CIEL Expert Turbo (and AFAIK the Victor HC-95T too) use the V9958 /WAIT pin-26. So it would be very important to sample this VDP pin behavior on the logic analyzer. A friend (not MSX programmer) once did some sampling and according to him the /WAIT pin was never activated at 3.57MHz. But that deserves further investigation.
- The Panasonic MSX2+ machines don't use the V9958 /WAIT pin. They rely on the 6140140 chipset. This chip outputs the CPU clock and on the turbo mode it seems to slow down the Z80 while doing VDP I/O and on vblank/hblank. But that's a guess based on the service manual and on comments from another friend observing the strange behavior of the Z80 clock line on an oscilloscope.
- The Victor HC-95V uses a waitstate generator, probably built-in on the HD64B180 CPU.
- The HC-95V seems to have a dual set of BIOS+SubROM: one for the Z80 and another for the turbo mode. The current openMSX HC-95 ROM dump seems to contain only the Z80 BIOS+SubROM. I don't know if the ROM comes from a HC-95V or HC-95T. It's unknown if the HC-95T has this dual-BIOS.
- The MSX Turbo-R VDP throughput is well known and already implemented on openMSX.

- Also please implement a way to disable the "dropped requests", just like the set cmdtiming broken, for development purposes.

By andete

Expert (96)

andete's picture

09-04-2013, 07:25

All credits go to Wouter for spending the time doing the analysis. All I did was provide the hardware and software to do the measurements on the fair.

Joost

By Hrothgar

Champion (479)

Hrothgar's picture

09-04-2013, 13:02

A very interesting read, although I struggle to understand all the details given.

One question though, which boils down to verifying the question "whether the VDP limits have been reached". Certain users on this forum propose simultaneously using longish VDP commands (e.g. lazy copying of bands of the screen) and performing blitting, in order to have an entire smooth scrolling screen, combined with the ability to do e.g. some additional software sprite paints. Am I reading this document correctly in concluding that this approach will only have an actual result when having hardware sprites off?

By wouter_

Champion (412)

wouter_'s picture

09-04-2013, 14:18

sd_snatcher wrote:

... If "fast CPU accesses actually result in dropped requests" feature is to be implemented, then the MSX2+s turbo VRAM throughput mechanisms need to be implemented too, otherwise those turbo machines emulation will be instantly broken: ...

Can you give more details on this 'turbo VRAM throughput' mechanism? I've never heard of it. I don't see how it's technically possible to send data via the CPU to VRAM at a higher (turbo) speed. Is it instead a mechanism that makes sure the data is not send too fast?

Note that (when using a normal Z80 at 3.5MHz) this too fast VRAM access via port #98 issue _rarely_ triggers. An OTIR instruction or a sequence of OUTI or OUT (C),r instructions is all perfectly fine. The only potential problem is a sequence of OUT (#98),A instructions. And only when both display and sprites are enabled.

By wouter_

Champion (412)

wouter_'s picture

09-04-2013, 14:20

Hrothgar wrote:

... Certain users on this forum propose simultaneously using longish VDP commands (e.g. lazy copying of bands of the screen) and performing blitting, in order to have an entire smooth scrolling screen, combined with the ability to do e.g. some additional software sprite paints. Am I reading this document correctly in concluding that this approach will only have an actual result when having hardware sprites off?

Simultaneously executing VDP commands and reading/writing VRAM via IO port #98 can still be beneficial. This does result in a higher combined VRAM throughput compared to sequentially executing the command and the CPU-VRAM transfer. But it is true that the VDP command itself will execute slightly slower in the combined case compared to the sequential case.

Let me try to put this in perspective. In the most extreme case: that is sprites/display enabled, executing a HMMV command and simultaneously executing a long series of OUT (#98),A instructions. The command executes at approximately half the original speed. So you still 'gain' the throughput of this 'half command'. In practice you'll likely execute a slower VDP command (like some copy command) and send data to port #98 at a slower rate (because a sequence of OUT (#98),A is not often useful). So usually the command slowdown will be much less than 2x.

By PingPong

Prophet (3339)

PingPong's picture

09-04-2013, 18:34

Hrothgar wrote:

Am I reading this document correctly in concluding that this approach will only have an actual result when having hardware sprites off?

I do not know what you are meaning, but i've done, in the past, some tests on a pure nms8245. the test ran continuously a highspeed memory move / logical memory move while the z80 was performing a bunch of outi on port 98 and with 32 unmagnified sprites onscreen arranged in 4 rows of 8 sprites each.
The test ran without corruptions or slowdowns at 60hz or 50hz mode on vdp, with a normal z80 @3.57Mhz.

By PingPong

Prophet (3339)

PingPong's picture

09-04-2013, 19:27

@Wouter: first, my congratulations for your detailed work. Reading the doc make me surprised about the complexity of the access patterns used by the vdp itself. I've guessed a more simpler ( and poor ) time slot allocation.

"Very roughly speaking in mode 'screen off' there are about twice as many access slots as in the mode 'sprites off' and about 5 times as many as in the mode 'sprites on'. This does however not mean that in these modes the command engine will execute respectively 2× and 5× as fast. Instead in the mode 'sprites on' the speed of command execution is mostly limited by the amount of available access slots, while in the mode 'screen off', the bottleneck is mostly the speed of the command engine itself."

I've guessed this, even if not confirmed by data. I always asked myself: 'OK, there is no soooo much band available, but it does appear that cmds in vblank are not soooo fast than in display area like it should be.
That's a pity, because a greater speed could have been achieved with no need to faster chips. Also, it's a pity that the command engine itself is a limiting factor in command speed and not only VRAM bandwidth.

Last, the insane scanline based, sprite color attribute data fetch. It's a true waste of time.

By PingPong

Prophet (3339)

PingPong's picture

09-04-2013, 19:48

It also appear that the vdp command engine is poorly developed. Some optimizations could have been taken to make short some periods between the "calculate internally" , and "vram access" phases of the state machine

By wouter_

Champion (412)

wouter_'s picture

09-04-2013, 20:23

@PingPong: Indeed, that was also my impression. There are quite some inefficiencies in the VDP and especially in the command engine part. Though of course you have to see this in context. I can imagine that in the early 1980's designing a chip running at 21MHz was no easy task.

By sd_snatcher

Prophet (3020)

sd_snatcher's picture

09-04-2013, 20:41

wouter,

Don't be fooled by the neologism. Smile
Maybe I didn't chose a good term: It seems that the two following terms would better express the idea:
- CPU-VRAM throughput control (I/O port 98h)
- CPU-VDP_register throughput control (all other I/O ports)

When using a turbo CPU (IOW, anything faster than the standard 3.57MHz Z80), some kind of throughput control must be used to assure that the I/O speed is done inside the speed limits supported for the V9938/58.

On the V9938 there's no such built-in feature. This VDP wasn't designed with faster CPUs in mind. So all turbo machines containing this VDP will have to implement some kind of throughput control externally.
OTOH, the V9958 was clearly designed to support turbo CPU speeds. A 7.14MHz Z80 can access it's register without trouble, and for VRAM access it has the built-in waitstate generator. This CPU-VRAM waitstate generator controls the throughput so it's always done at the correct speed. But it has to be enabled on one of the V9958 registers (sorry, I don't have the datasheet with me right now).

On the 1st post I listed the throughput mechanism of each native turbo MSX machine. Trying to explain them better:
- The most widely known is the MSX Turbo-R's, done externally by the S1990. It's the heavily criticized "S1990 VDP I/O slowdown". This one reduces the throughput by issuing waitstates to the R800 CPU only. It controls the throughput both for registers and VRAM.
- For the Panasonic MSX2+ machines, there seems to be the "6140140 VDP I/O slowdown". This one *seems* to reduce the throughput by reducing the CPU clock to 3.57MHz much like some homebrew kits. I don't know if controls the throughput only for the VRAM or also for the registers.
- The CIEL ExpertTurbo uses the V9958 built-in waitstate generator. Please have a look at the V9958 datasheet for more details. But it would be very important to sample the V9958 pin-26 behavior to emulate it properly, because this pin will be perfectly synced with the VRAM free access slots you described. If the CPU tries to write data at a moment that it would be lost, a WAIT signal is issued by the V9958. The Victor HC-95T is said to have this feature too, but only an owner of such machine could confirm this by checking the VDP pin-26 connection.
- Last, but not least, is the Victor HC-95V. This one seems to use a combination of the HD64B180 CPU built-in waitstate generator for all I/O, plus a specific BIOS+SubROM for this CPU. The turbo switch on the front panel also switches the BIOS+SubROM set that the machine will use to boot. A very interesting and unusual design. Only a full dump of its dual-BIOS ROM could reveal its mysteries. Smile

By Manuel

Ascended (15552)

Manuel's picture

09-04-2013, 21:48

There is also a dump of the HC-95 BIOS in turbo mode. This machine is owned by msxholder, and he said it's the HC-95(A). It has 256kB and V9958.

By PingPong

Prophet (3339)

PingPong's picture

09-04-2013, 22:39

wouter_ wrote:

@PingPong: Indeed, that was also my impression. There are quite some inefficiencies in the VDP and especially in the command engine part. Though of course you have to see this in context. I can imagine that in the early 1980's designing a chip running at 21MHz was no easy task.

By contrast, look at the time slot allocation between video/cpu on the Amiga Original Chip Set. Even in this hw there is some take -over between the blitter and cpu. However, the logic is far more simple (no different ram access modes, fixed timeslot allocation between cpu and blitter with only the option for the blitter to get priority over cpu accesses). Far simpler, far fast. The key difference: a 16 bit data/address bus. Try to think to your timings with a 16 bit bus...

By sd_snatcher

Prophet (3020)

sd_snatcher's picture

10-04-2013, 04:20

@PingPong

Interesting enough, this picture shows that the Amiga-1000 even used exactly the same DRAM chips (4464-12) that the V9938 used, with the same speed specs. You're right, the only difference is that it's a 16-bit bus. But the Amiga RAM is shared to the entire computer, including PCM sound and CPU. It's impressive that they could obtain such a huge speed difference. Clearly Commodore engineers did a far better job than Yamaha's.

@Wouter

Another question just came into my mind right now: How does the use of the superimpose affect the blitter performance? Because on this case, theoretically the clock for the blitter comes from the V99x8 crystal, but the clock for the raster comes from the /DLCLK pin-3 of the V99x8 (usually connected to a Sony V7010 chip). These clocks will not be necessarily synchronized. This has to affect the access slot allocation.

By wouter_

Champion (412)

wouter_'s picture

10-04-2013, 11:30

sd_snatcher wrote:

Another question just came into my mind right now: How does the use of the superimpose affect the blitter performance? Because on this case, theoretically the clock for the blitter comes from the V99x8 crystal, but the clock for the raster comes from the /DLCLK pin-3 of the V99x8 (usually connected to a Sony V7010 chip). These clocks will not be necessarily synchronized. This has to affect the access slot allocation.

I didn't do any measurements in combination with superimpose, so the following is mostly guessing:

I think a very important aspect of superimpose (and digitize) is that both video signals DO get synchronized. Maybe this is why in these modes the length of a display line is only 1365 clocks instead of 1368, so that there is some slack to be able to actively re-synchronize on each line?? (e.g. re-insert some idle cycles so that the line length matches the external line length)

Anyway, in superimpose mode the VDP still has to fetch bitmap/sprite data and schedule CPU/command accesses. This only works if there are no collisions between all these VRAM accesses. (I don't see a reason why these VRAM accesses will be scheduled differently compared to non-superimpose mode). But this implies that both 'raster' and 'blitter' must be driven by the same clock.

So I *guess* that either:
a) (less likely?) The DLCLK is used for everything in superimpose mode
b) (more likely?) The VDP clock is used, but occasionally there are idle cycles inserted/deleted from the schedule to keep both video signals in sync. If you look at that big timing diagram in my document you see that there are some locations where this could be done (e.g. at t=1328 or t=1336). And indeed these inserted/deleted cycles will have a (small) influence on the speed of command execution (of course depending on what you're comparing with).

... and for emulation none of this matters ... we can 'construct' the emulated crystals so that they don't drift compared to each other ;)

By PingPong

Prophet (3339)

PingPong's picture

11-04-2013, 18:42

@wouter_: Looking more in depth to the diagram, i'm a bit suspicious. For example there are some specific periods that appear to start shifted by 2 cycles or so when comparing the modes "screen off" "sprite off", "sprite on". This sound very strange to me. To have, for example an access slot shifted by 2 cycles depending on screen/sprite, the internal VDP logic should use specific counters to allocate time slots. In my experience i've observed that is rarely a taken approach because it needs more transistors. Normally to allocate time slots it is preferred to decode a single counter. In this way however, time slots should begin always in fixed points.

Another thing make me suspicious. the non-even spreaded timeslots. Again, if those timeslots were derived from a single counter, the allocation should be without gaps and in fixed points.
for example. looking at cycle 162: there is the start of a cpu/cmd access slot, but only with sprites on/off. With screen on the start is delayed by 2 cycles (164). Why? this is surprising me a lot.

Are you sure, there is no subtle things in the logic analyzer that make results a bit wrong?

By wouter_

Champion (412)

wouter_'s picture

11-04-2013, 19:03

@PingPong: You are right, this is strange. While analyzing the measured data, I found this suspicious myself. So I did double and triple check, and it really does match the measurements (lots of different measurements). I'd be very happy to send you the raw data so you can verify yourself. Also when this schema is implemented in openMSX it produces results that very closely match tests on the real hardware, so I'm fairly confident it is correct.

By PingPong

Prophet (3339)

PingPong's picture

11-04-2013, 21:22

So the conclusion is: the vdp is full of secrets, and a very very strange beast.
I also suspect that it's development weren't been so easy but instead a suffered project.

By hit9918

Prophet (2858)

hit9918's picture

13-04-2013, 17:37

Wow great stuff! Big smile

Glancing quickly thru the document, the port 98 influence is shocking.
The test was with pure OUT and HMMV? I hope that with outi is less slowdown and that HMMM gets less slowdown.

The set adjust is not explored yet, but it is already clear that at least some of the acess will be shifted?
That gap in "left erase signal" is very suspicious.
I would like the diagram to say at which point the vsync goes active.

By hit9918

Prophet (2858)

hit9918's picture

13-04-2013, 19:18

@PingPong, bitness means nothing, the 9938 got same bandwidth as the 16bit Amiga!!! Smile

The Amiga got 640 pixels 16 colors display, and then blitter and cpu is halted, so that is the bandwidth.
The 9938 got 512 pixels 16 colors, and some slow blitter and cpu still available.
So 9938 bandwidth is in same ballpark!

The 9938 got 8 databus pins, no? So its databus pins gotta be double speed.
Theory: databus speed is no issue, memory timings is the issue.
Two sets of memory are used, amiga forms 16bit, 9938 loads two times 8bit.

By wouter_

Champion (412)

wouter_'s picture

13-04-2013, 20:11

hit9918 wrote:

Glancing quickly thru the document, the port 98 influence is shocking. The test was with pure OUT and HMMV? I hope that with outi is less slowdown and that HMMM gets less slowdown.

The test with HMMV and a long sequence of OUT (#98),A instructions slowed down the command approx 2x (in display on, sprites on mode). In all other situations the slowdown will be less. The Z80 instruction OUT(n),A takes 12 cycles, so a long sequence of this instruction sends a request to the VDP every 72 VDP cycles. On the other hand an OTIR instruction takes 23 Z80 cycles (per iteration) or 138 VDP cycles between CPU-VRAM requests. A sequence of OUTI leaves 108 VDP cycles between CPU requests. So this leaves more room for VDP command execution.

hit9918 wrote:

The set adjust is not explored yet, but it is already clear that at least some of the acess will be shifted?
That gap in "left erase signal" is very suspicious.

Indeed, my *guess* would also be that this large gap is related to set adjust.

hit9918 wrote:

I would like the diagram to say at which point the vsync goes active.

I'd like to know this as well. But sorry, our measurements didn't include vsync.

By NYYRIKKI

Enlighted (5300)

NYYRIKKI's picture

13-04-2013, 21:28

Just wanted to say: Very interesting stuff, great job!

By PingPong

Prophet (3339)

PingPong's picture

14-04-2013, 16:10

hit9918 wrote:

@PingPong, bitness means nothing, the 9938 got same bandwidth as the 16bit Amiga!!! Smile
The Amiga got 640 pixels 16 colors display, and then blitter and cpu is halted, so that is the bandwidth.
The 9938 got 512 pixels 16 colors, and some slow blitter and cpu still available.
So 9938 bandwidth is in same ballpark!

Please make the same speculations on amiga 320x200 screen mode. ;-). there is a huge potential for the blitter, better synched with fixed "access slots".
Anyway, the VDP's blitter could have been designed more cleverly, allowing for a better use of access slots.

By PingPong

Prophet (3339)

PingPong's picture

14-04-2013, 21:49

@Wouther: so, after the rev. engineered, how much long is the time needed for the vdp to carry out a complete vram write /read issued by the CPU? (worst case). About 3,4us?

By wouter_

Champion (412)

wouter_'s picture

15-04-2013, 10:24

@PingPong: What do you want to know _exactly_? The time between the moment the CPU performs the OUT instruction and the moment that data actually gets written to VRAM? Or the minimum amount of time there should be between OUT instructions coming from the CPU? Both are closely related but not 100% the same. I'll assume you mean the latter.

The largest gap between access slots (appears in the mode screen on, sprites on) is 70 VDP cycles. A CPU (or command) request has to be pending for 16 cycles before the VDP starts handling it. So combined there have to be 70+16=86 VDP cycles (or 14.33 Z80 cycles or about 4.0us) between CPU-VRAM requests.

A OUT (n),A instruction takes 12 Z80 cycles and I did verify that a long sequence of this instruction does cause missed VRAM requests. A OUT(C),r instruction takes 14 cycles, that's still (barely) too fast, so I'd expect that a sequence of this instruction also occasionally goes wrong, but I didn't actually test this. I believe that all other ways to send data to the VDP via port #98 should be fine (_any_ instruction in between the above OUT instructions or an OUTI or OTIR).

Does this answer your question?

By hit9918

Prophet (2858)

hit9918's picture

15-04-2013, 19:20

@PingPong, what I wanted to say is that often databus pincount is overrated. It is said to be "8bit versus 16bit" when the actual issues lie somewhere else. In 512 pixel mode, 9938 got similar bandwidth as Amiga. Looks like in this case, 16bit is not better than 8bit. But, as you already said, 9938 lacks the logic for fast blitter in screen 5.

picture google "amiga lorraine". The chipset prototype, boards full of cables and TTL chips or something. The deal with Amiga is a huge amount of logic, and that actualy free of bugs.

And I would describe it as "Amiga actualy is NOT a genious architecture (programming model)".
For example, DMA address registers are 32bit. That is not genious architecture. An address is an address, what is more to say. Well what one got to say is that the competition lacks register bits.
If 9918 had full register bits for pattern address, then increasing address by 1 would do vertical scroll.
Similar on 9938 increasing address by 1 would do horizontal scroll.

By PingPong

Prophet (3339)

PingPong's picture

15-04-2013, 21:59

wouter_ wrote:

Does this answer your question?

Yes, completely. thx!

So pratically the v9938 is about twice as fast (in active area) than the old tms.
For others methods of sending data to vram that are not otir, or outi sequence, i think they are rarely used. i can't think a common use of out(c),r or out (0x98),a. the more common otir or outi are instead safe.

By PingPong

Prophet (3339)

PingPong's picture

15-04-2013, 22:16

hit9918 wrote:

If 9918 had full register bits for pattern address, then increasing address by 1 would do vertical scroll.
Similar on 9938 increasing address by 1 would do horizontal scroll.

? I'm not sure of you are meaning. Horizontal scroll?????

Anyway, bitness counts. if the amiga blitter got a 8 bit register, the speed of the memory moves would been slower. And the SMS vdp is another example.
A point of v9938 versus the amiga blitter is the easyness of programming. You think about pixels, even in byte move operations. By contrast the amiga blitter force you to think at a lower level, minterms, and you must handle pixel aligned operations.

with v9938 you send the same values for a graphical operation regardless of screen mode, color depth, resolution, memory arrangement of video data.

What is unclear to me is why they have designed a soooo slow logic for the internal vdp operations, that working only with vdp registers, should have been made literally fast ( for the era ) on a 21Mhz chip.

By hit9918

Prophet (2858)

hit9918's picture

16-04-2013, 01:25

Theory: change the adjust register by max 8 pixels every scanline, then no blitter wreck.
(in a scroller it often jumps by 15 pixels)

In the timing diagram, left erase signal, sprites on, the gap between cycle 130 - 162: (162 - 130) = 32 vdp cycles = 8 pixels.

Question is at which point is the adjust register read to delay counter.
And at which point does cpu get irq + jitter and then write the register.

Maybe these questions can be ignored when changing by max 4 pixels per scanline.
Plus above mentioned uncertainties, the cpu will end up changing max 8 pixels per scanline = ok.

Setting the adjust register in 4 steps in 4 rasterlines.
set adjust, wait hblank, set adjust, wait hblank, set adjust, wait hblank, set adjust.

By hit9918

Prophet (2858)

hit9918's picture

16-04-2013, 02:50

@PingPong:
"? I'm not sure of you are meaning. Horizontal scroll?????"
The screen 5 display address, if one could increment it by 1 byte, then the blittercopy would not be needed.
The screen 2 pattern address, if one could increment it by 1 byte, then get vertical rolling charset.

"Anyway, bitness counts. if the amiga blitter got a 8 bit register, the speed of the memory moves would been slower."

yes, but Smile
yes, but there was the surprise that 9938 got plenty DMA slots, but for funny reasons the blitter misses most slots.
yes, but I wonder why Amiga actualy not got double bandwidth when it got double width bus?
When 9938 does hires 16 colors, then Amiga given the 16bit should have hires 16 colors plus cpu not slowed down a bit, no?

About having bigger blitter register size, the screen 5 LMMM practically goes 4bit.
But that register is actualy 8bits wide, for the screen 8 LMMM.
If 9938 would have 16bit databus/register but unchanged style, its screen 5 LMMM would still do 4bit Smile2

Missing is microcode for NMMM Smile that does two nibbles in one step in screen 5.
That would need sprites stored in two scroll positions. But get twice speed transparent draw.
And missing is microcode for TMMM, screen 8 transparent draw without read-modify-write, going as fast as copy.

By hit9918

Prophet (2858)

hit9918's picture

16-04-2013, 04:12

The bitness topic is still bugging me Wink The 9938 gets screen 8 bandwidth by "multi bank", I.e. by acessing two separate memories. CAS0 vs CAS1 in the diagram.

Easy doubling of bandwidth. Except that it smells like needing more wires. Maybe the "board wire cost" is similar to 16bit.

I would say the 9938 practically is 16bit. When it comes to display DMA.

But things look different on the blitter side Tongue
it does not do multi-bank, makes it 8bit,
to add insult to injury the LMMM is 4bit,
to further add insult to injury slow logic misses available memory slots.

By Salamander2

Expert (124)

Salamander2's picture

16-04-2013, 06:04

Is it possible to implement some kind of hardware like Ademir Carchano's ADVRAM to speed up something?

By PingPong

Prophet (3339)

PingPong's picture

16-04-2013, 22:25

hit9918 wrote:

About having bigger blitter register size, the screen 5 LMMM practically goes 4bit.
But that register is actualy 8bits wide, for the screen 8 LMMM.
If 9938 would have 16bit databus/register but unchanged style, its screen 5 LMMM would still do 4bit Smile2

the byte /pix registers are already sized to the maximum n. of bytes that a pixel color use ( 8 bit).So in this i agree with you.
but:

my guess is that when the vdp, for a copy operation does:
NX-- (decrement the width of the rectangle)
if (NX < 0)
{
NX=oldNX
NY--;
etc.
}
else
{
SX++;
DX++;
}
those operations are performed in 8 bit steps internally like the z80 does for a inc HL.
i suspect the internal logic of vdp is 8 bit, decrements take more than 1 clock, comparisons and decisions even more...

By hit9918

Prophet (2858)

hit9918's picture

17-04-2013, 03:14

But the worst thing is the situation with 9938 scroll. There one practically got NO blitter Tongue

What is bugging me is that one cant do software sprites (with cpu), because every frame there is scrollcopy of a 16 pixel wide column to the other buffer.

An idea: cleanup the software sprites in both buffers, then it does not hurt when the scrollcopy did "copy a column messed with sprites".

I think of a tiny amount of software sprites removing sprite load in important places. Think of the horizontal player bullets in nemesis or swords in goblins, not many bytes to draw, but lots sprite load removed.

timing:
18 cycles to save background to RAM
2x18 cycles to restore both screens
18 cycles to draw (transparent draw with outi as in streetfighter 2 topic).

Just one sprite 16x16 pixels on screen 8 would need 80 rasterlines oops.
Still, with horizontal going player bullets which cause lot flicker, it might be worthwhile to draw 3 thin bullets with cpu.

By wouter_

Champion (412)

wouter_'s picture

17-04-2013, 10:07

PingPong wrote:

...
those operations are performed in 8 bit steps internally like the z80 does for a inc HL.
i suspect the internal logic of vdp is 8 bit, decrements take more than 1 clock, comparisons and decisions even more...

This is only speculation, but I *guess* that internally the VDP command engine runs at 1/8 of the VDP clock. This guess is based on the observation that all command engine operations take an (integer) multiple of 8 cycles. See also the section 'speculation on the slowness of the command engine' in my document.

Assuming this 1/8 clock speed and the overhead of access slot arbitration (again see document), the timing values fit better if add/sub only takes a single cycle. Afterall these values are only 9 or 10 bits wide. I *think* it's actually cheaper in hardware to use a 10-bit adder than to execute the operation in two 8-bit steps. Alternatives are that the add/sub/compare operations indeed take multiple cycles, but that some of these operations are executed in parallel. But this requires even more hardware, so I think it's less likely.

Anyway, lot's of possibilities. But there are no possible tests that can show exactly how the VDP implements this stuff. And because it has no observable effect it also doesn't matter (for us).

By hit9918

Prophet (2858)

hit9918's picture

17-04-2013, 17:32

@wouter_ , "16 cycles in advance of an access slot the VDP checks whether there is either a pending CPU or command request"

The 16 cycles are because of dynamic slot handling cpu versus vdp?
Did you find that out with cpu tests?

By hit9918

Prophet (2858)

hit9918's picture

17-04-2013, 21:18

@Salamander2,
advram is direct cpu acess to vram. it doesnt change blitter speed.
It asks butchering the mainboard and special software and then get little speedup.
A 9990 cartridge got very much more performance.

By wouter_

Champion (412)

wouter_'s picture

17-04-2013, 23:18

hit9918 wrote:

The 16 cycles are because of dynamic slot handling cpu versus vdp?
Did you find that out with cpu tests?

Yes, I assume those 16 cycles (or 2 cycles at 1/8 the clock speed?) are somehow required to handle the arbitration between the CPU and command engine. Possibly this is because the CPU, the command engine and the VRAM subsystems run at different clock speeds??? Maybe someone with more hardware knowledge can comment on this.

In any case this '16 cycles'-thing was the simplest rule I could come up with that allowed me to explain all the measurements. For example in some tests we did see that even though the CPU had already send a new request, the VDP did NOT yet handle it even though there was an idle access slot available. Or in other tests we saw that the CPU had send a request, but instead of handling that CPU request the VDP actually executed a command engine request in the next access slot (normally CPU has priority over commands). This 16-cycles-rule could explain all these anomalies.

By sd_snatcher

Prophet (3020)

sd_snatcher's picture

18-04-2013, 00:13

@wouter_

As it was common on most digital circuits of the 80s and early 90s, the crystal frequency is probably divided by two right on the input as part of a cheap way to adjust the signal to TTL levels. This means that the real internal VDP clock is probably DHCLK (pin-2, 10.74MHz). I don't know if any of this would matter on emulating it.

The R800 seems to be designed exactly in this fashion: The 28.63MHz is divided by 2 on the input, then the real CPU internal clock seems to be the VCLK pin-74, 14.31MHz, described as クロック出力, that google translates to me as "Clock output". Then this clock is again divided by two to provide the 7.14MHz BUS clock on the SYSCLK pin-72, described as システムクロック出力, that google translates to me as "system clock output", where "system" is probably the motherboard.

The instruction clock cycles shown on the MSX magazine documents about the R800 certainly used the SYSCLK to make things easier for programmers to understand, in a world that would only see commonplace 2x CPU clocks much later, on the 486DX2 CPU.

Oh, yes: And I completely agree that the crystal drifting on the previous posts wouldn't matter for emulation. Smile

@Salamander2

As hit9918 said, implementing ADVRAM on existing machines would be very hard. Speculating, another hypothetic way to speed up the VDP could be to overclock it. But that would require:

1) Provide the CPU clock externally (like a turbo kit)
2) Supply a 5.37MHz TTL clock to the VDP pin-3, taking care that this pin can also be an output.
3) Remove the VDP 21MHz clock and connect it to a software controllable clock multiplier, like the ICS570A. Connect the CLK/2 output of the ICS570A to the VDP clock input. Configure the ICS570A to output x1 by default on CLK/2 output.
4) Change the VRAMs for ones two times faster (-6 minimum)
5) Use the CPU to configure between turbo (x2) or normal (x1) speeds and of course to set the pin-2 as input on boot.

All that is completely a draft and nobody even knows if the VDP can really support such clock. But in theory it could double the blitter speed.

Maybe that would only be feasible for new motherboard designs. But then, if performance is the goal, it would be much better to use the FPGA V9958 code from the OCM. It supports a turbo blitter mode that is AFAICR, 500% faster than the original V9958.

@hit9918

I also agree that the V9990 would have a much better performance than ADVRAM, and both would not benefit any of the existing software. The only way to make the existing blitter intensive games run better is to speed up the V99x8 blitter. And that is only possible by overclocking it or by using a faster implementation in FPGA like the OCM. The OCM runs those games like a dream with CPU & blitter turbos enabled.

By PingPong

Prophet (3339)

PingPong's picture

18-04-2013, 22:08

hit9918 wrote:

@Salamander2,
advram is direct cpu acess to vram. it doesnt change blitter speed.
It asks butchering the mainboard and special software and then get little speedup.
A 9990 cartridge got very much more performance.

I fully agree, hit. on the site there are speed comparisons between vdp and cpu based approach, but to be honest the gain is somewhat limited, and test are written to flavor z80 cpu. For example, setting a pixel in screen 8 use only 1 write operation. If you compare this with the same screen 5 operation, you have skipped:

- The read operation needed to mask the byte (in screen 5)
- The masking operation (even in TIMP mode)
- The write back operation.
- The pixel->vram address calculations giving by the fact that in screen 8 you need only to set H=y and L=x and you have the vram address in HL. By contrast, the same in screen5 requires shifting HL contents.

So the same test, apparently faster on z80 CPU is not even so fast in real situations

Of course, this is true with a z80 @3Mhz cpu. If one got a Pentium CPU able to access VDP VRAM, i think the cpu based approach is faster.

By PingPong

Prophet (3339)

PingPong's picture

19-04-2013, 11:17

wouter_ wrote:

This 16-cycles-rule could explain all these anomalies.

@wouter: when you say "the vdp command engine is stalled when a vram request is pending" what are you meaning from the below possibilities ?
[assuming a byte copy operation]
Case 1:
a) Read src VRAM byte
b) write dst VRAM byte
c) internal vdp command register update (increments, decrements, move to next line etc)
d) while (b) is still executing [stall]
e) goto (a)

Case 2:
a) Read src VRAM byte
b) write dst VRAM byte
c) while (b) is still executing [stall]
c) internal vdp command register update (increments, decrements, move to next line etc)
e) goto (a)

Se Case (2) is worst, because does sacrifice a bit of parallelism. The register update, teoretically does not need, to be completed, that vram is available.

What pattern do you think is used ?

By wouter_

Champion (412)

wouter_'s picture

19-04-2013, 12:32

PingPong wrote:

... What pattern do you think is used?

I think the VDP uses your second model (case 2). Let me try to explain why I believe this is the case.

Let's for example look at the HMMM command (same as in your two cases). The only thing we could measure are the actually VRAM accesses. For this purpose only the arrival time of these accesses matters, and more specifically the time difference between the accesses. Let's use the following notation:

R 24 W 64 R 24 W 64 R ..

Here 'R' represents a VRAM read, W represents a VRAM write and the numbers are the number of VDP cycles between (the start of) the accesses. The above sequence is what you measure when there are plenty of (free) access slots available, so when the command engine doesn't have to wait for VRAM. Let's now look at an example where a VRAM write does have to wait for an access slot:

R 32 W 64 R ...

So the time between R->W has increased, but W->R remains exactly the same. In case of your first model you would instead expect to see 'R 32 W 56 R' (so the lost time between R->W is caught-up between W->R). Unfortunately this is not what's happening.

You are correct that the register updates could happen in parallel with the VRAM-waits, but this does have a cost: you now need a separate register that holds the address of the pending write request (currently the address can be constructed from the x and y coordinate registers).

It seems that often when there's a trade-off between speed vs extra hardware cost, the V9938 designers choose for the lower hardware cost. Another example is the trade-off in the sprite rendering: the Y-coordinates of the visible sprites are re-fetched (extra time) instead of temporarily storing all Y-coordinates in an internal buffer (extra registers).

By PingPong

Prophet (3339)

PingPong's picture

19-04-2013, 15:00

wouter_ wrote:

So the time between R->W has increased, but W->R remains exactly the same. In case of your first model you would instead expect to see 'R 32 W 56 R' (so the lost time between R->W is caught-up between W->R). Unfortunately this is not what's happening.

I think your assumptions are correct. Unfortunately.

Quote:

You are correct that the register updates could happen in parallel with the VRAM-waits, but this does have a cost: you now need a separate register that holds the address of the pending write request (currently the address can be constructed from the x and y coordinate registers).

It seems that often when there's a trade-off between speed vs extra hardware cost, the V9938 designers choose for the lower hardware cost. Another example is the trade-off in the sprite rendering: the Y-coordinates of the visible sprites are re-fetched (extra time) instead of temporarily storing all Y-coordinates in an internal buffer (extra registers).

They are good followers (unfortunately) of karl guttag school Tongue Crying

By hit9918

Prophet (2858)

hit9918's picture

21-04-2013, 00:10

"when there's a trade-off between speed vs extra hardware cost,
the V9938 designers choose for the lower hardware cost"

But in a bigger picture, I see the 9938 actualy overengineered with too many transistors.
Prime example: without blitter but with a scrollregister the 9938 would have been the better gamer.

By PingPong

Prophet (3339)

PingPong's picture

21-04-2013, 08:43

@wouther. i suspect that FPGA implementations of vdp are using a less accurate approach. I'm wondering how your findings can be fit on a FPGA implementation of v9938.

By PingPong

Prophet (3339)

PingPong's picture

21-04-2013, 08:59

hit9918 wrote:

But in a bigger picture, I see the 9938 actualy overengineered with too many transistors.
Prime example: without blitter but with a scrollregister the 9938 would have been the better gamer.

They started well, by using some 'optimized' modes of read (burst, page mode).
But some parts are really sloppy designed. For example the sprites:
They may have linked the color attributes to the pattern no, because they had time to calculate the color address from the already fetched pat no.
a simple counter should allowed they to stop sprite pre-parsing, leaving little more bandwidth.

About the command engine. if a bit more faster in internal calculations and not stalled for a pending memory access, even with the same band available, the speed of the command would have been improved a considerably, expecially in vblk.

By sd_snatcher

Prophet (3020)

sd_snatcher's picture

21-04-2013, 17:09

@wouter_

What do you think about the possibility that the /DHCLK is probably the real internal clock? Nearly all the clock timings shown are even numbers, except for "15" and "79" on the "Sprites enabled" line. But those two readings seem like a small sampling errors, because both are:

1) Misaligned with respective the reading on the previous lines
2) Break the balance on the size of a sprite read: the first read gets too long, and the consecutive sprite read gets too short.

By PingPong

Prophet (3339)

PingPong's picture

21-04-2013, 20:03

@wouter, it is possible to tweak the openMSX vdp implementations taking the same memory slots and memory operation durations (cycles) available on the real thing, but instead by shortening the 32/64 cycles needed to the command engine for various steps by a factor of 8 (so 4/8 cycles)?

I'm curious to see the performance gain resulting from the greater chance to find an available slot, because of the smallest gaps.

By wouter_

Champion (412)

wouter_'s picture

21-04-2013, 20:33

PingPong wrote:

@wouther. i suspect that FPGA implementations of vdp are using a less accurate approach. I'm wondering how your findings can be fit on a FPGA implementation of v9938.

I did already talk about this with Alex a few times (Alex implemented the command engine in the OCM). He confirms that the current FPGA implementation is very different. E.g. it has multiple adders/comparators in parallel and IIRC it has more available VRAM bandwidth. Alex suspected that the FPGA size of the VDP (in logical elements) could be reduced by implementing it more similar to the real VDP. Though I doubt Alex has concrete plans to work on improving this. (I'm sure he'd appreciate any help).

By wouter_

Champion (412)

wouter_'s picture

21-04-2013, 20:37

PingPong wrote:

@wouter, it is possible to tweak the openMSX vdp implementations ...

It's certainly possible and not even that hard .. just replace some values in the src/video/VDPCmdEngine.cc file. But I don't have any plans to make such an option available in an openMSX release. It has a very limited use, and it's easy enough to experiment with the values yourself if you build your own version. Let me know if you need help with this.

By wouter_

Champion (412)

wouter_'s picture

21-04-2013, 20:53

sd_snatcher wrote:

What do you think about the possibility that the /DHCLK is probably the real internal clock? Nearly all the clock timings shown are even numbers, except for "15" and "79" on the "Sprites enabled" line. But those two readings seem like a small sampling errors, because both are:
1) Misaligned with respective the reading on the previous lines
2) Break the balance on the size of a sprite read: the first read gets too long, and the consecutive sprite read gets too short.

I *think* this is not the case because if you look at e.g. the timing difference between the RAS and the CAS signals, these differ by only one clock cycle.

About that anomaly for sprites=on at cycle 15 and 79. You can see the same thing at 1251 and 1315. But also at 2, 66, 1238 and 1302. At all these moments there starts a burst read of 3 bytes and it takes 13 instead of the expected 14 cycles (I believe I do mention this in the document). So it's not the case that the first read is taking too long and the second too short. Instead _both_ bursts are too short. I was very suspicious of this result myself, but after double checking I can really confirm this is what the measurements show. Note that without this too short access pattern, the accesses wouldn't fit in the time line.

BTW thanks for looking at these results in so much detail. I do appreciate you verifying my findings.

By hit9918

Prophet (2858)

hit9918's picture

21-04-2013, 21:12

I feel like the detailed woes must be balanced with detailed 9938 praising Smile

That office chip lacks the scroll, but scroll was found out, so now ironicaly there is a niche where 9938 got TOP performance because of the office chip nature:
a scrolling screen 7 game.

There is nice bitmap brothers games in 16 colors dithering for Amiga and ST.
The 9938 could do these backgrounds in finer grained dithering in screen 7.
When dithering appears less harsh, you can get a wider color range.

One would like to see more MSX2 titles, but one got to say that a 16bit title is more work.
Yes, the 9938 praising article just claimed 16bit Big smile

p.s.
Amiga in 640 modes 16 colors, the blitters magnitude advantage is GONE, cpu crippeled in same way,
and in general purpose usage the Amiga sprites are hopeless.
While MSX2 got cpu and TMS sprites available as always.

By hit9918

Prophet (2858)

hit9918's picture

22-04-2013, 21:31

Reading the first sentence of my previous posting again...
I for sure didn't complain about detailed talk of the timing, that is just great stuff Smile
What I meant is one always ends up in details, then one also gets into mentioning the quirks,
so I wanted to improve the mood by adding the happy sides of the VDP.

By PingPong

Prophet (3339)

PingPong's picture

26-04-2013, 20:28

@Wouter:

Scenario: sprites on, active area, z80 doing a continous out (0x98),a, vdp doing a high speed byte fill.
- You told us that there are 31 access slots x scanline available to CPU and VDP
- You also told us that the cmd speed is cut by about 50% because of cpu accesses.
- I'm assuming a very simple model in vram access slots & cmd engine: no delays, access slot aligned to when you need to access memory etc.
- In this situation, the z80 can eat about 18-19 vram accesses x scanline right?
- If there are only 31 total accesses the remaining available for cmd engine are : 31-18 = 13, right? (best case, simplified model, without delays etc.)
Now, you told us that the command run at 50% of the speed, with 13 accesses. So without the cpu the full speed require about 26 accesses. There are still on a scanline 31-26=5 accesses available.

So pratically the cmd engine is not extremely starved, even in the worst bandwidth situation.
So the gain in speed in less constrained bandwith situation is not directly due to the more availability of accesses, but mainly instead by the fact that the time between access slot is shorter, so the cmd engine got stalled more briefly than under high bandwith pressure.

Not sure if i'm clear, but if this is true, then Sad

By wouter_

Champion (412)

wouter_'s picture

27-04-2013, 01:09

@PingPong:
Your 'story' in your previous post is not completely correct. I think there are two main mistakes:

1) You're assuming that all access slots are equal. This would be the case if they are equally spread over time, but they are not. For example in the mode sprites on, there is an access slot at t=170. This is only 8 cycles after the previous slot, but most other slots in this mode are spaced wider apart. I've often seen scenarios where both the CPU and command engine were starved for VRAM bandwidth, but still this slot at t=170 was not always used. So even though there are 31 slots per line, usually not all can actually be used. (Actually even if the slots are all equally spaced apart, there might still be scenarios were not all slots can be used, though it will likely occur less frequently than what is currently the case for the slot at t=170).

2) You assume that a HMMV command at _full_speed_ needs 26 access slots per line. That's not completely true, this is the amount of slots it manages to use in sprites on mode (from the calculations in your post, but see below for more accurate values). But in this mode the command is already limited by available slots. Because e.g. in mode screen off the command runs faster (it manages to use more slots). And even in mode screen off, the HMMV command occasionally has to wait.

Let's look at it from another angle: ideally the HMMV command want to access VRAM every 48 cycles. There are 1368 cycles in a line, so that's 28.5 accesses per line. On average in mode screen off, the HMMV command actually manages to use about 27.9 slots per line. For sprites-off it's about 22.1 slots. And for sprites-on it's about 21.0 slots. All without concurrent CPU access. (That value '50%' I already gave a few times was only a rough estimate .. based on these numbers it seems that 60% is closer to reality.)

By PingPong

Prophet (3339)

PingPong's picture

27-04-2013, 01:53

wouter_ wrote:

@PingPong:

1) You're assuming that all access slots are equal. This would be the case if they are equally spread over time, but they are not. For example in the mode sprites on, there is an access slot at t=170.

Yes, i'm assuming an even distribution of access slot (that is not true). This was for simplicity. Also assuming the same probability for all access slots to be used

Quote:

... Let's look at it from another angle: ideally the HMMV command want to access VRAM every 48 cycles. There are 1368 cycles in a line, so that's 28.5 accesses per line. On average in mode screen off, the HMMV command actually manages to use about 27.9 slots per line. For sprites-off it's about 22.1 slots. And for sprites-on it's about 21.0 slots. All without concurrent CPU access. (That value '50%' I already gave a few times was only a rough estimate .. based on these numbers it seems that 60% is closer to reality.)

OK, in screen off we are closer to the 28.5 accesses, all others situations are a bit less, say 25% less than full speed.

So, correct me if i'm wrong:
the vdp does not use all accesses on a scanline, mainly because the spacing and uneven distribution between access slots, so it's slowed down. BUT:
even when the 'access slot penalty' is less heavy (screen off), the n. of access slots used is not greater than the available (theoretical) access slots in the more restrictive situation (31 accesses with screen & sprites on).

This is an indication that memory bandwith is not the real problem in VDP cmd engine slowness. The real problem is the cmd engine slowness itself. Otherwise, the mentioned command should go slow in active area but the increase of speed should be *a lot* if you turn off the screen, instead of being 25% faster only. Right?
Instead the 25% increase seems to be more related to the shorter amount of time the cmd engine has to wait (in screen off) when is ready to access memory. the shorter the amount of time, the higher the speed of the state machine -> more data processed.

By wouter_

Champion (412)

wouter_'s picture

27-04-2013, 12:20

@PingPong:
Yes and no. You're correct that in this specific case (HMMV) the total amount of available VRAM-bandwidth reserved to the command engine is _on_average_ large enough to keep up with the command engine. Though this is useless if the command engine cannot effectively use this available bandwidth.

Let's take another example, the YMMM command. The timing for this command is '40 R 24 W'. So in the ideal case, this command wants to access VRAM 42.75 times per line. So clearly for YMMM in mode sprites on, VRAM-bandwidth _is_ a bottleneck. The speed difference for YMMM in screen-off mode vs sprites-on is roughly a factor 1.9x.

I only thought about this just now: Of all the commands HMMV has the highest pixel-rate, but YMMM is the most VRAM-bandwidth demanding. So my prediction that the HMMV is the most affected by concurrent CPU access is probably wrong, YMMM is likely slowed down more.

So really the command engine is slowed down both by the speed of the command engine itself and by the distribution of available access slots. It depends on the situation which of these two factors is dominant, but in all situations both factors need to be taken into account to explain the actual execution speed.

By PingPong

Prophet (3339)

PingPong's picture

27-04-2013, 16:04

yes, it does vary by the command. However the most used commands are HMMV / LMMV, HMMM/LMMM.
Do you know any use of YMMM? [if this was used from some sw]. I find it almost unuseful, so i've forgot about it.
Honestly i've traded this for a linear vram copy command, instead or to make more orthogonal the vdp command set (it is not).
Anyway the internal operations of VDP (register updates, register compare etc. not memory accesses) are a lot slow IMHO.

This maybe because of the fact the internals run at 1/8 of master clock. But i cannot see any tech limit that can make hard to realize this kind of logic running @21Mhz. Even YMMM does only 42 accesses on scanline. but the vdp, in vblank has about 3 times the bandwidth available, so why waste this with a very slow engine?

By hit9918

Prophet (2858)

hit9918's picture

28-04-2013, 10:38

@PingPong,
it is hard to figure what makes blitter speed, lots misc stuff involved.

For example, the 16 cycles for vram slot allocation.
HMMM 64 R 24 W
HMMM (6×8+16) R (1×8+16) W
88 cycles

without the 16 cycles phases, it would be
HMMM 48 R 8 W, it would be 56 cycles, 1.6x faster.

BUT! looking at the typical 32 cycles distance of available slots in display area, rounding things up to multiples of 32,
the result in BOTH cases is
HMMM 64 R 32 W

So now it looks like "the 16 cycles for slot allocation didnt hurt", surprise. The character is "misc".
But in blank area things could be different.

With "no 16 cycles for slot allocation plus fast dream logic", it would be HMMM 32 R 32 W.
Factor 1.5x faster, a bit low for "lots dream logic". Same speed logic with 16bit registers and doublebanked memory use would give 2x speedup.

By PingPong

Prophet (3339)

PingPong's picture

28-04-2013, 11:22

hit9918 wrote:

@PingPong,

So now it looks like "the 16 cycles for slot allocation didnt hurt", surprise. The character is "misc".

@hit9918:
i suspect that it does matter. In the real case I assume that access slot cycles are not perfectly aligned with engine cycles. So could happen that when the cmd engine want to access memory this misses the window even for a single cycle. In this situation the next possibility is 32 cycles (or more) later.

What I'm pointing out is that while i can accept the 16 cycle delay i find somewhat slower the internal state changes.
It's true that having a faster internal operation does not improve a lot speed in bandwidth constrained operations, but in vblank things are a lot different. So, again, why they made a soooo damn snail machine state?

By hit9918

Prophet (2858)

hit9918's picture

28-04-2013, 12:00

Speculation about what blitter takes cycles for.

YMMM 40 R 24 W
HMMM 64 R 24 W

looks like SX coordinate business needs 24 cycles.

LMMV 72 R 24 W
LMMM 64 R 32 R 24 W

The LMMV needing 8 more cycles for the first R than LMMM:
Like maybe 8 cycles for LD akku,register (8 cycles is one cycle in the 1/8 blitter clock)

LMMM does read the second byte and complete the OP in 32 cycles:
OP akku,(from memory) in 32 cycles.

Then the first 72 cycles of LMMV is:
24 cycles for SX coordinate business
8 cycles for something unknown
8 cycles LD akku,register
32 cycles OP akku,(from memory)
---
72 cycles, "72 R"

LMMM:
24 cycles for SX coordinate business
8 cycles for something unknown
32 cycles LD akku,(from memory)
---
64 cycles, "64 R",
and then
OP akku,(from memory) in "32 R".

YMMM:
8 cycles for something unknown
32 cycles LD akku,(from memory)
--
"40 R"

LD (memory),akku is allways "24 W".

mhm, does blitter have undocumented opcodes? Smile

By hit9918

Prophet (2858)

hit9918's picture

28-04-2013, 12:06

The "8 cycles for something unknown" could be opcode decoding.

By hit9918

Prophet (2858)

hit9918's picture

28-04-2013, 12:30

So, the speculation written in blitter cycles:

YMMM:
1 opcode decode
4 ld akku,(memory)
;-- "40 R"
3 ld (memory),akku
;-- "24 W"

LMMV:
1 opcode decode
3 SX coordinate
1 ld akku,register
4 OP akku,(memory)
;-- "72 R"
3 ld (memory),akku
;-- "24 W"

LMMM:
1 opcode decode
3 SX coordinate
4 ld akku,(memory)
;-- "64 R"
4 OP akku,(memory)
;-- "32 R"
3 ld (memory),akku
;-- "24 W"

HMMM:
1 opcode decode
3 SX coordinate
4 ld akku,(memory)
;-- "64 R"
3 ld (memory),akku
;-- "24 W"

"SX coordinate" needs much. while in the memory writes there seem to be no such efforts.
more guessing: "SX coordinate" includes routing the nibble in question to low or high nibble of akku.
while when writing, it is more plain "write akku to address DX/2".

now, because of lack of microcode space, HMMM as well as LMMM on screen 8 involves waiting for the same nibble gear, without actualy needing the results Tongue
just speculation Wink

By PingPong

Prophet (3339)

PingPong's picture

01-05-2013, 11:43

@Wouter: for r#9 s0 & s1 the datasheet say "select simultaneous mode". Do you know what does it mean? What is the "simultaneous mode"?

By wouter_

Champion (412)

wouter_'s picture

01-05-2013, 17:48

I didn't check the VDP datasheet, but isn't that superimpose mode? Where the VDP simultaneously displays the external video source and the normal MSX video.

By PingPong

Prophet (3339)

PingPong's picture

01-05-2013, 23:07

thx, for explanation.
Looking at tms vdp (msx1) datasheet in order to compare memory bandwidth i've found that the msx1 vdp got only 128 accesses x scanline.

Despite this, i think the tms vdp uses those accesses in a more efficient way.
the scanline rendering uses only 96 memory operations for each scanline (1 x nametable, 1xcolor attribute, 1 x pattern) x 32.
the sprite rendering uses 4x6=24 accesses
the cpu accesses are 8 x scanline

All this reach 128 mem.operation x scanline.

I still find surprising, that with 150 accesses the v9938 had not got better performances...

By hit9918

Prophet (2858)

hit9918's picture

03-05-2013, 16:20

@PingPong,
9918:
32 + 4*5 sprite
3*32 bytes screen 2
~9 cpu (guess)
--
157 bytes

9938:
32 + 8*5 sprite
128 bytes screen 5 without multibanking
~16 cpu (guess)
--
216 bytes

All these go without multibank. Still 9938 reads much more than 9918.
The document seems to say that screen 5 goes bursting, i.e. without multibank still some bandwidth optimization.
With multibank in screen 8, it is another +128 bytes.

By PingPong

Prophet (3339)

PingPong's picture

03-05-2013, 19:48

hit9918 wrote:

@PingPong,
9918:
32 + 4*5 sprite
3*32 bytes screen 2
~9 cpu (guess)
--
157 bytes

Something is not correct. I've read that Guttag told us there are 128 memory access in a scanline. How can tms vdp read 157 bytes?

Quote:

9938:
32 + 8*5 sprite
128 bytes screen 5 without multibanking
~16 cpu (guess)
--
216 bytes

Not clear also this. wouter told us 1368 cycles/scanline. If a memory operation is 8 cycles aligned the n. of access is 1368/8=171
Plus the vdp reads 8 * 6 sprites not 8 x 5 as you pointed out.....

By hit9918

Prophet (2858)

hit9918's picture

03-05-2013, 21:14

I just counted kind of the bytes I get in the end, 9938 got quite some more.

I found a quote:
"When we did the DRAM interface, we were really pushing the DRAM cycle performance. In fact at the time (1977) running the 9918 at 5.4MHz was considered very fast (most other chips were 3MHz or slower) and doing a DRAM cycle every two clocks was really pushing it."

The 5.4Mhz dotclock is around 342 cycles per scanline, "a DRAM cycle every two clocks", looks like 9918 got 171 slots.
171 potential slots, while on 9938 I observe around 216 real used bytes,
not to speak of the 16bit screen 8 which is a whole different league,
so what is bugging you? Smile

I think the lowest hanging fruit to extend 9938 is 16bit sprite DMA (doublebanking like screen 8).
double sprite DMA would give 64 sprites, 8 per scanline, 16 color sprites, wow!

32*2 = 64 bytes Y DMA.
8*5*2 = 8*10 bytes other DMA. 10 bytes per sprite for X,P,pixels. 8 pixel bytes, 16 pixels in 4 planes -> 16 colors.

This gives more than having double of a slow blitter.
Aside that 16bit blitter would not be double speed, because LMMM still would go in 4bit business.

The sprites for 16bit DMA would need Y bytes in a separate array (I think the sega master got this, too).

By hit9918

Prophet (2858)

hit9918's picture

03-05-2013, 22:30

@PingPong,
"If a memory operation is 8 cycles aligned the n. of access is 1368/8=171"

In display area, slots are rather 6 cycles wide.
So 9938 seems to start out 8/6 more dense, maybe 4/3 is the factor of increased requirements on RAM chips.
However, additionaly screen 5 DMA does burst.
And then, screen 8 DMA goes 16bit.

And the blitter is an office & BASIC blitter really Tongue

And that target it actualy meets very good:
It takes little amount of code space in BASIC ROM and got fast font printing.
LMMC is unique, "blit image from main RAM" or "source data is from cpu algorithm",
like convert monochrome font to 16 color screen with just some RRCA, jump on carry, OUT.

By PingPong

Prophet (3339)

PingPong's picture

03-05-2013, 22:37

hit9918 wrote:

....
so what is bugging you? Smile

I've read somewhere in karl guttag docs that there were only 128 accesses, but also what you found is from karl, so probably you are right, there are 171 accesses.

Quote:

... This gives more than having double of a slow blitter.

consider this, please. the actual blitter is not a speedy beast, but i think it is *almost* enough. What make the blitter not enough then?. The missing horizontal scroll register. So you need to scroll by brute force.

Assume for example you have 8 16x16 sw sprites that you manage via vdp cmd. the frame rate is not sooo horrible.
there are few msx2 games that uses sw sprites or sw sprites + vert scroll registers. the results are not sooooo worst in terms of fps. You can even mix those sprites with hw ones used for fast moving objects or bullets.
Obviuously if one use all the limited horse power by using cmd engine to do bruce force scroll, there is no vdp cmd engine power left . Think about the V9958, the things are not so dramatic.

I also suspect that the v9938 had got internally the h. scroll, but was dropped due to time to market reasons, then re-introduced with the v9958.

By hit9918

Prophet (2858)

hit9918's picture

03-05-2013, 23:12

Amiga is 2D gamer but in 3D the bitplanes are odd.
On 9938, screen 8 texture cubes demos are easy.

single color polygons: amiga line and fill modi too is "for BASIC". games use the cpu.
have a left word mask, a right word mask, and on the hat of the triangles the masks meet the same word, dammit.
I feel like except huge polygons, LMMV is faster.

Is there 256 color demos?

By wouter_

Champion (412)

wouter_'s picture

04-05-2013, 10:15

PingPong wrote:

...Not clear also this. wouter told us 1368 cycles/scanline. If a memory operation is 8 cycles aligned the n. of access is 1368/8=171...

Note that I never said that ALL V9938 VRAM accesses take 8 cycles. Only the accesses originating from the command engine subsystem are spaced at least 8 cycles apart.

Other VRAM accesses only take 6 cycles for single a single access, 4 cycles for every additional access in a burst and even only 2 cycles if you take multi-bank mode into account. Please reread the "VRAM accesses" section from my document for details.

Also note that you're over-generalizing stuff. You cannot simply count the number of accesses and deduce any meaningful cycle count from this without taking the access pattern into account. Let's take sprite rendering as an example. This indeed takes 32 + 8x6 VRAM reads. And with the _current_ V9938 access pattern all these reads combined take 424 cycles to execute (again see my document). But with another access pattern you can get very different result, e.g. naively assuming each access takes 6 cycles would result in a count of 480 cycles. If the sprite data was fetched in an 'optimal' way (maximally using burst mode, and not re-fetching the y-coord) then the required cycle count could be as low as 322 cycles (if I counted correctly).

I just wanted to show these theoretical numbers (322 vs 424 vs 480) to indicate that the calculations in your last few posts don't make much sense. At least on V9938 you really need to take the access pattern into account.

By hit9918

Prophet (2858)

hit9918's picture

04-05-2013, 14:50

@PingPong,
the 128 slots figure would be 256 pixels, smells like a figure of the visible area.

96 slots are display in screen 1 / screen 2.
Leaves 32 slots for sprite Y and cpu.

Each slot 8 pixels apart, 5.33 cpu cycles apart. I assume dotclock is 1.5x cpu clock.

VDP test made 27 cycles. 27 cycles / 5.33 cycles = 5.06 . match. Every 5th of the "8 pixel slot" is for cpu.
Taking the precisely 5 slots distance: 5 * 5.33 cycles = 26.65 cycles.

Now I WONDER whether about cpu with separate crystal! Making events of +-1 cycle difference.
So the 27 cycles would only work with cpu and VDP in sync with same crystal.
Maybe that is cause of ideas "clone VDP is slower" when it is separate crystals?

28 / 26.65 = 1.05 . 28 cycles works with crystals 5% within spec.

By PingPong

Prophet (3339)

PingPong's picture

04-05-2013, 15:26

wouter_ wrote:
PingPong wrote:

...Not clear also this. wouter told us 1368 cycles/scanline. If a memory operation is 8 cycles aligned the n. of access is 1368/8=171...

Note that I never said that ALL V9938 VRAM accesses take 8 cycles. Only the accesses originating from the command engine subsystem are spaced at least 8 cycles apart.

Other VRAM accesses only take 6 cycles for single a single access, 4 cycles for every additional access in a burst and even only 2 cycles if you take multi-bank mode into account. Please reread the "VRAM accesses" section from my document for details.

I already know. i also know that my calculations are approximate. I was applying the same model of TMS VDP (that is more linear) to V9938. on this chip you have a master clock freq., then a n. of clocks for each access slot. So the total access slots is easy to calculate. Plus there are no holes, gaps, or tricky access patterns. A very complex logic without any reasonable advantage.

Even TMS vdp is better/clearly designed, if one take into account the slowest ram & clock speed.

By hit9918

Prophet (2858)

hit9918's picture

04-05-2013, 16:10

problems.
from TMS manual:

                    |            |  VDP  | Time waiting for |  Total
   Condition        |    Mode    | Delay | an access window |  time
------------------------------------------------------------------------
Active Display Area |   Text     | 2 us  |   0  -  1.1  us  | 2 - 3.1 us
------------------------------------------------------------------------
Active Display Area |  Graphics  | 2 us  |   0  -  5.95 us  | 2 - 8   us
                    |    I,II    |       |                  |

The total of 8us rounded up are the good old 29 cpu cycles.
While 5.95 us "wait for acess window" is 21.2 cpu cycles.

Divided by 5.33 cpu cycles (8 pixels), 21.2/5.33 = 3.98. OOPS.
Now it looks like every 4th 8-pixel-chunk is cpu acess window.

And the other figure just acidentaly matched so nicely. Maybe because the actual VDP delay is 8 pixels. Would be 1.49 us.
When "2 + 5.95 = 8"... the manual got figures a la "not exact, round up when in doubt".

Surprise, looks like 9918 port IO as well cant use all available memory slots.
The 16 cycles delay figure of 9938 would be 4 pixels (of screen 2), twice as fast as 9918 "VDP delay".

Summary of theories:
port IO in 40 pixels = 26.66 cpu cycles.
Some machines may make it in 27 cycles, with drifting cpu crystal in 28 cycles.

Maybe leave it at the good old 29 cycles which nicely is outi + jp nz.
But outi + 2x nop in 28 cycles is seducing.

By PingPong

Prophet (3339)

PingPong's picture

04-05-2013, 18:31

hit9918 wrote:

problems.
...
Surprise, looks like 9918 port IO as well cant use all available memory slots.
The 16 cycles delay figure of 9938 would be 4 pixels (of screen 2), twice as fast as 9918 "VDP delay".

I cannot figure why all memory slot cannot be used. Karl told us there are 8 slot for cpu free...
A scanline is 228 z80 clock ticks. You should write every 27 (this works almost on my sony hb 75p) clock ticks.
228/27=8.4
i'm getting more and more confused...

By hit9918

Prophet (2858)

hit9918's picture

04-05-2013, 23:00

"Karl told us there are 8 slot for cpu free..."

And thats true when interpreting it as "in the end, the cpu manages to catch roughly 8 slots".

But it can catch an odd number, 8.4 .

And that is possible by the acess window spec in the manual 5.95us.
Which is 32 pixels, every 4th char.
In a 342 pixels rasterline it fits 10.6875 times.

Maybe it is 10 slots, and the last two at wider distance.
So then there would be 2 slots in 2.68 slots distance:

2.68 / 2 = *5.95 = 7.97 i.e. the 8 us one gets in the end, another match.

Another time something seems to match.
But those assumed wider slots would be in border area.
screen 3 too got sprites, it should be same in border area, but screen 3 got faster acess.
Well at least the manual says.

It is mysterious.

By hit9918

Prophet (2858)

hit9918's picture

04-05-2013, 23:38

p.s.
can't argue with amount of slots, can have all slots close and one wide gap.
that is the one for which you got to make the delay code.
the resulting rate can be an odd amount of slots per scanline like 8.4 .

By PingPong

Prophet (3339)

PingPong's picture

05-05-2013, 09:24

@hit9918: one thing is strange for me considering the era of development: the uneven distribution of access slots.
take, for example the amiga, c64. their arbitration is very simple, it's done by simply assigning, in a static mode, the access slots.
I think also tms vdp do the things in the same way: just divide the available access slots by extracting, from a counter, the right 'bit'. wanna 1/ ? ok. available = counter & (2^something). so for me the theory of uneven access slot is hard to believe. If i remember correctly, during the h. border time, the vdp fetches the y part of the sat.
Maybe is using the bitmap/color/nametable access time to do a fast "sat" read?

It's true, from wouter's timing of accesses i see the v9938 is an exception: a bad exception. accesses are performed something very close (6 cycles). Sometimes, (cmd engine) there is 'modulo 8' window.
wouter suspect that the 8 thing is due to the 1/8 of master clock freq. maybe. oh the other hand i'm asking myself:
"the v9938 already divide by 6 the master clock, for the z80. why not use this clock also for cmd engine? more simpler, reduced wafer size (cost), more faster because can run faster. also maybe could match better the 6 cycles of vram standard operations."
I'm not taking into account the exotic "burst" or "page mode" accesses.

By hit9918

Prophet (2858)

hit9918's picture

06-05-2013, 19:00

@PingPong,
maybe they just couldn't clock it higher as complex as it is.

Initialy, the 9938 felt like "shortcuts, lack of transistors".
Now I feel it is a luxury chip with all design goals met.
Well, the goal "all transistors for office, no transistors for games".

office commercial:
The TMS philosophy of separate vram allows 16 color hires GUI without cpu slowdown unlike Amiga.
STILL, with LMMC the cpu RAM can easily be used as image RAM.
The blitter design needs little bios code.

The first performance problem in symbos is not GUI, but diskette driver disabling interrupt for a long time, things wobble while disk loading.

By PingPong

Prophet (3339)

PingPong's picture

06-05-2013, 20:11

hit9918 wrote:

@PingPong,
maybe they just couldn't clock it higher as complex as it is.

i find very difficult to believe.
i think there is another less-tecnhical reason.
vdp internals ( for others stuff ) already work to a more higher rate.....

By Manuel

Ascended (15552)

Manuel's picture

23-01-2014, 16:42

A superfluous note I hope: this behaviour has been implemented and released in openMSX 0.10.0.