MSX HDMI mutlimedia card

Страница 23/56
16 | 17 | 18 | 19 | 20 | 21 | 22 | | 24 | 25 | 26 | 27 | 28

By maxis

Champion (512)

Аватар пользователя maxis

23-07-2014, 18:20

PingPong wrote:

As far i know the cpc rounds up to the next multiple of 4 T-states the time needed to z80 instructions. so a 21 states otir is rounded up 24. (Things are a little complicated than i've just described), so wait states are here and regardless of VRAM or simple RAM accesses. like on speccy or c64. Does the cpc+ removed the limitation? this sound a little strange to me...

Putting the SAT and the sprite generator table inside the ASIC removes completely the sprite DMA traffic from the memory timing budget. This is quite often done in the arcade HW, BTW. So, the sprites can be overloaded on the fly based on the horizontal interrupt.
In other words the CPC+ sprite HW is seamless for the memory timing.
This is what I was comparing when talking about CPC+ vs V9938. I.e. 320x200 2 bits per pixel + sprites vs V9938 screen5 mode for example. Color depth is not the same, but sprites! Also, I assume, that the CPC+ palette could be easily overloaded on the retrace of the scanline.

Actually there is Prince of Persia port on CPC!

By PingPong

Prophet (3586)

Аватар пользователя PingPong

23-07-2014, 19:32

maxis wrote:

Putting the SAT and the sprite generator table inside the ASIC removes completely the sprite DMA traffic from the memory timing budget. This is quite often done in the arcade HW, BTW. So, the sprites can be overloaded on the fly based on the horizontal interrupt.

this only apply to sprites, i think normal ram & vram are yet slowed down from contention, its not?
btw sprites on cpc are only 16 of course, but you have colorful and advanced zoom. this reduce the amount of sprites needed.
For example, a three coloured sprite on v9938 takes 2 sprites on cpc+ only 1. (and i'm not counting the scanline limit that is a no-limit on cpc+). So v3338 drop down to only 16 effective sprites when we use a true multicolor

By maxis

Champion (512)

Аватар пользователя maxis

23-07-2014, 23:00

Halfaxle wrote:

CPC+ is good, but not enough to be the king of the hill. Only 16 sprites are not enough. etc.

Absolutely, however with clever programming the sprites can be overloaded on the fly. But maybe this technique wasn't used for the game design. CPC+ is an interesting beast for a benchmarking since it is neither the true shared memory nor the VRAM centric architecture.

Halfaxle wrote:

As always (i and maxis are schoolmates), i'd like to say that for "slow" z80 with limited bandwidth we need vdp with wide palette, some scrollable "backgrounds" and many big colorful sprites. It can be done with a "stack" of 9958 and external ramdac if we prefere to use existing chips. Or it can be completely new chip design. And I do'nt think that direct access to vram is good idea.

"New" chip may acts like an upgrade for existing vdp. Logically write-only for standard 99xx ports/registers and fool blooded beyond it. May be we need to occupy extra z80 ports here.

Right, V9958 provides one background + one set of sprites. By having 4 of them the decent video effects can be achieved close to the mid line arcades of the late 80-s. Parallax scrolling, priorities have to be solved by the video bus mixer logic (there is a separate chip in Konami arcades). Maybe somewhere, one of Arduino or Propeller enthusiasts is already building the multi VDP arcade, since there is no shortage of the chip suppply. It wouldn't surprise me if something like this will surface in the nearest future.

Direct access to VRAM only is beneficial for the relatively small video buffer(less than 10 Kbytes). ZX spectrum is a good example of such approach. Otherwise, VRAM should be separate, I agree entirely. In arcades even the video space is rommed and not accessible at all (tile patterns).

In general for 2D arcade games there are either "tiles+sprites" or "sprites for all" approaches. Even the shadows, fog, camera blur in the past were simulated by sprites. 128 sprites per line and all the problems are solved Wink.

Halfaxle wrote:

Well. We have fullhd hdmi screens everywhere now. So several scalable (from 1x to fullscreen) "viewports" for several virtual vdps in new design seems very logical for me. A kind of hardware windows support. Single fullscreen viewport for legacy apps and many logical "windowed" vdps for further use.

Yes, once a single model of 9958 is developed, several instances can be packed into the design. There are 2 limitations, however:
- The physical size of the FPGA;
- The maximum memory bandwidth.

But after all there is a need for the SW support of such an architecture. As we discussed in the past, it would be cool to have the RTOS with legacy window (virtual VDP) per task. Then the approach could be compatible with the legacy software. Let's see how it will go.

Greetings Wink

By maxis

Champion (512)

Аватар пользователя maxis

24-07-2014, 00:06

PingPong wrote:

i think normal ram & vram are yet slowed down from contention, its not?

See, ZX and CPC/CPC+ are based on the time shared memory architecture. So there is no difference whether Z80 is accessing the video space or the program space in terms of the average memory access time.
The video load is quite modest - 80 bytes (say 250ns per access) per line (64us) - around 30% of CPU bandwidth @ 4MHz is eaten for the screen refresh. And for the sake of simplicity I don't here take into account the vertical blanking interval, which makes things even better for CPC.

Bottom line: CPC's 2.8 MHz Z80 is on par with unconstrained MSX Z80 @ 3.57MHz (one WS per M1 cycle). Isn't it a new paradigm?

By PingPong

Prophet (3586)

Аватар пользователя PingPong

24-07-2014, 08:43

maxis wrote:

Bottom line: CPC's 2.8 MHz Z80 is on par with unconstrained MSX Z80 @ 3.57MHz (one WS per M1 cycle). Isn't it a new paradigm?

the problem here is that the M1 wait is avoidable with a simple operation, faster vram chips.
(and also a stupid choice, probably heritage from other similar hw (Sega SG1000 or coleco?) )

different situation is where the architecture entail CPU+VIDEO HW. I do not have exact info about the timings of CPC/ZX but as far i know, for example on C64 this load can be not so modest. here two calculations:
the VIC-II video hw, puts for every scanline rendering, 40 bytes on screen. On the simpler 6502 this allow uncontended access to ram only in optimal situations. Let's see if we have sprites on scan line. Every sprite is 24pixel wide = 3bytes. if you have 8 sprites on scanline the VIC-II needs to fetch 3*8=24 extra bytes of memory. Plus every 8 scanline, the VIC-II needs to fetch also a kind of pointer for every charater on screen. there are 40 columns, so extra 40 accesses.

In the worst case, there are scanlines were the vic need to load 40+24 extra bytes. Each access, if i remember correctly eat up to the cpu 1us so a total of 64 usecs x scanline (that is 64us itself). That mean CPU frozed for one entire scanline (!)

C64 games tend to load sprites on the fly to have more sprites on screen. so the stolen cycles are more than one can expect.

Plus you said on CPC there is a modest penalty on CPU (30%). I do not know, but 30% is, for me, more than *modest*

INFO: i've heard about that on zx the slow down occours only where the cpu address the first 16 of VRAM (where vram is). I really cannot understand how can be done. the address lines / data bus are the same. In the situation of simultaneous access (CPU+ULA) there should be conflict even if the CPU is pointing a byte outside the first 16K

By maxis

Champion (512)

Аватар пользователя maxis

24-07-2014, 13:05

PingPong wrote:

the problem here is that the M1 wait is avoidable with a simple operation, faster vram chips.
(and also a stupid choice, probably heritage from other similar hw (Sega SG1000 or coleco?) )

MSX has a very elaborate address translation mechanism, which results in the increased propagation delay. This requires 1 waitstate per M1 cycle. This waitstate is unavoidable. I wasn't yet talking about the VDP implications. This is how MSX Z80 loses approx 20% of its performance (5T states instead of 4T states for single byte instruction). So, the equivalent clock speed is about 2.8MHz.

On the CPC side, Z80 @ 4MHz. Taking into account the 50Hz vertical scanning frequency and 200 active lines, we have:
For a single 64us scanline, refresh time - 80*250ns= 20us
Screen has 200 lines per 20ms(50Hz), so (200*20us)/20ms => 20% of the CPU time.

See the point? CPC alone with Z80@4MHz and screen refresh is at least as fast as MSX.

PingPong wrote:

INFO: i've heard about that on zx the slow down occours only where the cpu address the first 16 of VRAM (where vram is). I really cannot understand how can be done. the address lines / data bus are the same. In the situation of simultaneous access (CPU+ULA) there should be conflict even if the CPU is pointing a byte outside the first 16K

Inside ZX48K, 16K RAM and 32K RAM don't share the same data/address/control buses. On the data bus the trick is ultraclever. Instead of putting the octal buffer, the series resistors are in place to isolate the 32K from 16K at the time of the video DMA access from ULA to the lower 16K. So, running the code from the upper 32K is beneficial since it is waitstates free as long as it doesn't hit the lower 16K at a "bad" moment.
What you have heard is the right thing indeed Wink

By hit9918

Prophet (2897)

Аватар пользователя hit9918

24-07-2014, 17:43

"Would this arrangement improve the gaming experience and make the famous titles better (Zanac EX, Aleste, ...)?"
To make fatpixels or to make software multiplexer zero gameplay? Tongue
This is not robocop, it is genre where TMS is at home.

Another thing. The MSX already has a VDP babel. 9918 9938 9958 9990 are asking for games (9958 with software sprites + scroll is actualy much different to 9938).
The best you can do is to be compatible.
I know fpga maker likes to add some special feature. But the software situation is opposite.

The 9990 is the one that works as VDP cartridge without problems.
Does the konami compile stuff run on VDP cartridge? Then 9938 still has more software.
Boy the VDP babel going forth and back... sega and cpc plus can make things only worse.

There has been the idea of sega jukebox.
Which is something massively different as "MSX feature enhancement".
It is about code written for sega master system and not for some MSX babel.
This is how things look on the software side.

By PingPong

Prophet (3586)

Аватар пользователя PingPong

24-07-2014, 19:59

maxis wrote:

MSX has a very elaborate address translation mechanism, which results in the increased propagation delay. This requires 1 waitstate per M1 cycle. This waitstate is unavoidable. I wasn't yet talking about the VDP implications. This is how MSX Z80 loses approx 20% of its performance (5T states instead of 4T states for single byte instruction). So, the equivalent clock speed is about 2.8MHz.

this is a little exagerate i think. the 20% slowdown is only applied when you do NOP, or LD A,B instructions. Unfortunately (or fortunately) the majority of instructions take from 7 to 16 T-States. If the instruction is a opcode of 1 byte len you merely loose 1 T-State over 7-16 T-States. I think the loss is approximately 10-15% on average.
About the reason of M1 wait state i'm a bit surprised that it is due to external logic. even in those early days chip logic was faster enough. There are some msx users that by cutting (phisically, sacrilege !) the M1 connections were able to gain speed with only 1 trouble: the vdp access. As you know there is no hw handshaking so delay access loop are sw based. the faster execution times caused glitch in vdp interaction. However, for sw that uses BIOS call this is not an issue, because the VDP BIOS is soooooooooooooo slow that making a little more speedy did not cause any problem ;-)

about cpc can you point me on schematics /tech docs that show me how things are arranged, please? I'm curious.

By hit9918

Prophet (2897)

Аватар пользователя hit9918

24-07-2014, 20:55

The M1 thing is that z80 makes refresh signals in every opcode and then the actual fetch gets sharp
and with M1 wait you get this back to the usual 3 cycles acess like data acess.
because of this 2/3 cycles thing I guess with M1 you can run 150% higher Mhz.

The CPC has a cycle table stretched to multiple of 4.
When it goes without M1 wait and that in shared RAM, I guess refresh is made somewhere else.
http://k1.spdns.de/Vintage/Schneider%20CPC/Das%20Schneider%2...
The numbers in the table are actualy divided by 4.

By maxis

Champion (512)

Аватар пользователя maxis

25-07-2014, 18:07

hit9918 wrote:

The best you can do is to be compatible.
I know fpga maker likes to add some special feature. But the software situation is opposite.

I agree that the goal to be compatible with the legacy 99x8 is important.

But in the latest off topic discussion, PingPong, Halfaxle you and myself were comparing different approaches and known to date architectures, their proses and conses.
As Halfaxle said, multi-V9938/9958 gaming platform config is yet to be built and that would be beneficial.
On the other hand, SEGA released their arcade version with two SMS VDPs stacked together...

Also, IMHO, if more platforms can be made compatible within MSX universe - more legacy software to run we will have. Playsoniq success justifies such a development vector.

It would be great having PoP on MSX, wouldn't it? On one hand, the SW development from the scratch is less likely to happen, than the CPC compatibility plug. On the other, the FPGA is flexible enough to run the CPC heart, .....

Procyon card is much more than just the V9958 substitute, at least from my prospective ....

Страница 23/56
16 | 17 | 18 | 19 | 20 | 21 | 22 | | 24 | 25 | 26 | 27 | 28