VDPX: The Quest for High-Performance Retro Graphics
I haven't been following the discussion at all for time management reasons. But I'm wondering if you're already implementing it, and if so, what you are using for it. Naturally it would be very helpful to stick it in the 1chipMSX.
By Google
@MagicBox: do you think the card can fit on a std msx cartridge?
@Edwin: I'm already designing it. The target platform is an Altera Cyclone III and will use quite some FPGA resources to realize VDPX because of its sheer speed. (Memory blocks are used to cache sprite patterns in order to be able to process 128 sprites well within a pixel clock). I doubt it will fit in OCM unless it has been designed with a large capacity/highspeed FPGA to begin with.
@Pingpong: I'm aiming for the size of an FMPAC, a bit taller most likely.
good. I always felt limited by the very sluggidh speed copy of V9990 and its poor sprite performance.
A cyclone 3 for an MSX vdp seems like somewhat beyond retro to me. There's probably more power in there than every msx hardware extension combined.
I'm also wondering about using the memory bits for sprite patterns. Are you planning on retrieving all sprite patterns at some point during vblank?
I want some extras at Sprite attribute table that define the view of the sprite if it is possible...
ok there is 128 sprites of 16x16 ... I imagine MULTICOLORED pixel per pixel???? that is not mentioned
but......
first request) ¿is too hard if with the color value of each sprite pixel in the pattern, is also, a transparency/solid value atleast in a scale of 16 levels?
(second request)
setting 2 bits in the SPRITE MODE register... can you do this???
by example
BIT A B
0 0 = normal 128 sprites of 16x16
0 1 = 64 sprites BUT 32x16
1 0 = 64 sprites BUT 16x32
1 1 = 32 sprites BUT 32x32
I mean that the VDP does the following automatically....
when is 32x16
The VDP will use the XY coord. of sprite 0 to positioning sprite 0 AND 1, but 1 with offset +x16px
so if you move sprite 0, automatically moves the sprite 1,
this way is THE SAME SPRITE ENGINE but handling different the attribute table,
so, if you JOINS 2 sprite horizontally, or 2 sprites vertically, or 4 sprites in a block of 32x32... you will have biggers sprite just handling ONE xyPOS. LESS work for CPU....
third request) The collision sprite can retreive a list of collisions, is like this.... set up in hardware a collision's buffer within one frame, so, when rendering the sprites, the engine detects the colissions, for each collision put a new register in the list, then, set up a flag to inform that there is something at the buffer, if starts a new frame without having the buffer inspected by the CPU, the sprite collision system will be disable, until the cpu read and emptys the buffer....
Ofcourse the buffer will no cointain the information of collision for each pixel of each sprite collision, I want in the table something like
SPRITE X COLLIDES with sprite Y
Sprite Z collides with sprite E
.....
and registers to control the buffer
<ThereIsColisionFlag>
<CpuReadedIt>
<BufferLen>
(or maybe using simple buffer pointers at VRAM like a STACK.... but really is recommended a STOP flag that prevent the engine from filling it again.
if one pair of sprites collides in a lot of pixels, that only the first detected helps, that is easy, just the buffer don't accepts repeated data comparing with the last stacked input.
A cyclone 3 for an MSX vdp seems like somewhat beyond retro to me. There's probably more power in there than every msx hardware extension combined.
I'm also wondering about using the memory bits for sprite patterns. Are you planning on retrieving all sprite patterns at some point during vblank?
Well, the Cyclone 3's come in different capacity grades. I'm using the 3C16, the 3rd grade. The reason mainly is that it offers enough M9K blocks to do what I want in addition to having enough pins to connect the video DAC, RAM and CPU bus. For the selected package (240 QFP) only the slowest speedgrade of 8 is available. But it's fast enough for VDPX. The internal systemfrequency will be 200MHz, using the PLL and an external 50MHz oscillator. The Fmax when using the M9K blocks is 238MHz.
To maintain this speed which is required to fully process 128 sprites, the sprite engine is 'multi-core'. There will be 4 processing cores that each can process a whole sprite in one clock-tick. (That is, determine if and which pixel to render for a given screen X/Y, processing sprite x, y, pattern and flip bits). To do this, the sprite engine is an 8-stage pipe-line.
There will be plenty pure logic capacity left in the FPGA which I could use for integrating things like an SCC at no additional cost. Ofcourse, the FPGA isn't retro, it's a modern high-performance part. But that's the beauty of this all 
As for the sprite patterns, yes, the are retrieved from VRAM once per frame, 3 border scan-lines before the first content scanline (Y=0). 128 256color 16x16 sprites is 32KB worth of RAM. The VRAM has a 16-bit databus, only 16K read operations need to be done. It's done as a burst-read. The entire cache is filled in like one full scanline. Whenever the CPU accesses VRAM, the CPU access is 'inserted'. However since the CPU is so slow compared to VDPX, the additional CPU access cycles hardly extend the loading time.
Because all sprite patterns are cached, during normal screen rendering, VDPX will only need to access VRAM to access the 8 layer nametables and the 8 corresponding pattern pixels. Concurrently, the sprite engine will process the sprites from the cache while the layer engine processes the layers. When both are done, the final merge is done (sprite pixel if it was valid and is on top of the resulting layer pixel if there was any). About 60% of the time between pixelclocks is used for rendering. 40% goes unused to allow CPU access without delays as well as blitter time. CPU access is synchronized with the VRAM arbiter; I'm using 100MHz SRAM, meaning a read only takes 2 200MHz cycles. There will be about 60 cycles per pixel clock. 36 Cycles are needed for rendering. Synchronized CPU access only takes 2 cycles every 4 pixel clocks. The CPU will never be able to choke VDPX in its operation
@flyguille:
Each sprite pixel can be set to one of the 256 available colors. Color 0 is always the transparant color. As for collision detection, that's already been worked out by using a category system (see thread in general discussion). A cross-collision table is too complex to make and would negatively impact performance as it require a multi-pass of attributes. Still the CPU would have to examine all the collision entries. Using the category detection system, no additional passes are needed and the CPU gets to know instantly if any collision is worth checking out.
Sprite modes I've been thinking about though, for up to 32x32 sprites. I'll see if/how I can implement this in the current engine. It certainly is useful.
Alphablending at this time I won't be supporting, maybe in a later design revision when the initial design has finished. The FPGA has plenty multiply blocks to do blending calculations. If Alphablending is implemented, it certainly will not be on a per-pixel basis.
A suggestion, message posted on May 06 2008, 15:13
Sprite Mode has been implemented in the Sprite Cores at no performance penalties ^^. Like described, the following modes are now supported:
00: 128 16x16 01: 64 16x32 10: 64 32x16 11: 32 32x32
Sprite X/Y flip continues to work on all sprite modes.
@MicroTech: Yes, it's easy to upgrade VRAM sizes. I've sort of been thinking to upgrade to 1MB, or even 2MB if the memory chips won't be exponentially more expensive.

By MagicBox
Master (198)
02-10-2008, 20:19