How do you handle visibility? I use checking hit of the object AABB against the camera frustum (that is the rectangle camera position (x, y) until (x+resX, y+resY) ).
Ah sorry for answering late, I missed your message.
Firstly of course I only do culling when it’s needed. E.g. the player sprite is always on screen so needs no culling, and the visual effects sprites all wrap around so neither do they.
What I do cull is map objects like (currently only) tile animations. For those indeed if it’s outside the viewport I cull it, and otherwise I clamp the tile that’s drawn.
Currently I always check all objects against the viewport, however I think if there’s many objects it makes sense to try to optimise that. One approach would be to group them in larger “zones” (e.g. 256x256 areas) and only check visibility on objects whose zone overlaps with the viewport. These zones could also be organised in an quadtree rather than a grid.
Maybe someone else has some other clever optimisation suggestion?
A little update;
So as I mentioned earlier I am considering using screen 11. Especially the 15-bit RGB colours are appealing to get more SNES-like graphics, rather than the 9-bit palette colours which are more MegaDrive-like, and it would show off the MSX2+ better too. I haven’t decided yet but I’m making it easier to experiment and flip the switch if I do.
The last couple of days I took the step to render the background tiles directly from ROM with the CPU, rather than VDP copies from a tile map in page 2. Let’s reiterate the benefits:
- I can use as many tiles as fit into my ROM, rather than 256 max.
- I free a page 2 because the tile map does not need to be in VRAM anymore.
- The VDP has more free time and it can do something else like copy animations.
- Performance is identical to VDP copies, probably faster in 8bpp modes.
- It’s easier to switch to screen 11 if I decide to.
For the 256×4 row draw when scrolling vertically I use CPU VRAM access, whereas for the 4×256 column draw when scrolling horizontally I use HMMC.
For the row draw I output a series of 16×4 blocks, which is best done with CPU VRAM access because the VRAM pointer only changes once every 16 pixels. Were I to use HMMC I would have to render line-by-line, and looking up each tile’s pixel data 4 times is more expensive. Also it would keep the VDP occupied.
I set the VRAM pointer bits 15-17 (r#14) once, and after that only manipulate the lower 14 bits. I always write the tiles starting at x-coordinate 0, because if I don’t write the tile at (240, 252)-(255,255) last the VRAM pointer can move to the next 16K bank.
For the column draw I output a series of 4×16 blocks, which is best done with HMMC where I can simply output the pixel data in order. If I were to use CPU VRAM access I would have to change the VRAM pointer every 4 pixels, and also would need to update register 14 regularly, which is slower.
After ever 4 pixels I increase the tile pixel data index by a stride of 12 pixels. I could avoid that by rearranging the tile pixel data in ROM to be in vertical strips, gaining performance at the cost of doubling the tile memory. For the HMMC I use the trick described here to avoid having to specify the first byte already when executing the command.
If I were to use this in an 8bpp mode I would have to transfer twice as many bytes, but the overhead of setting VRAM pointer addresses and increasing the stride would be the same, only the number of outi
s would double. A quick test shows that the background drawing for 8bpp goes up from 11 ms to 16.5 ms. A 50% time increase for double the bandwidth, not bad.
---
Lastly an update about the sprite update refactor I discussed earlier and that I made a let’s code video about; I decided to park that idea (@DarkSchneider can say I told you so ;)). The final speed benefit was not as great as I hoped it would be, and it imposed some limitations on the sprites. I might dig it up later but for now I’ll stick with the old system.
Good news, you can get a nice result in Screen 11. Only one game has been attempted in this mode, it's MKid. It was very promising but it was never finished. This was to take time to make beautiful graphics. How do you intend to create graphics? Screen 11 designer is not very advanced for it.
How do you handle visibility? I use checking hit of the object AABB against the camera frustum (that is the rectangle camera position (x, y) until (x+resX, y+resY) ).
Ah sorry for answering late, I missed your message.
Firstly of course I only do culling when it’s needed. E.g. the player sprite is always on screen so needs no culling, and the visual effects sprites all wrap around so neither do they.
What I do cull is map objects like (currently only) tile animations. For those indeed if it’s outside the viewport I cull it, and otherwise I clamp the tile that’s drawn.
Currently I always check all objects against the viewport, however I think if there’s many objects it makes sense to try to optimise that. One approach would be to group them in larger “zones” (e.g. 256x256 areas) and only check visibility on objects whose zone overlaps with the viewport. These zones could also be organised in an quadtree rather than a grid.
Maybe someone else has some other clever optimisation suggestion?
Yes I already thought about the typical area based organisation. My problem is that most objects are dynamic, so they can move to any place. In some cases could be a good solution, i.e. if you have chests, fixed NPC (non-moving), and things like that, could split between movable objects and non-movable. But if you already have all your elements as Objects, then have to add that extra distinction, as a NPC is going to be a dynamic object, because you can talk and etc. (it has actions) but it doesn't move.
- Performance is identical to VDP copies, probably faster in 8bpp modes.
So slow is the VDP?
For the row draw I output a series of 16×4 blocks, which is best done with CPU VRAM access because the VRAM pointer only changes once every 16 pixels. Were I to use HMMC I would have to render line-by-line, and looking up each tile’s pixel data 4 times is more expensive. Also it would keep the VDP occupied.
I speak from memory, but I think the virtual pages are organized vertically, so when you access the most right px, the next one is the 1st of the next line, at least in 1-page scrolling mode like is the case. In that case, the 256x4 block could be copied only modifying the source HL, using a scanline drawing method.
Reading the method used, you probably already have the tiles on ROM in a linear instead a bitmap way. Set the 1st values for HL and VRAM, copy 16 px, when set HL to the 1st value of the 2nd tile, copy 16px, etc., after copying the tile 16 (the most right one), set HL to the same value than the 1st one + width (8 bytes for SC5 or 16 for SC11).
You probably have to compute these values already, the source for each tile, considering the offset. So you already have something like:
[T0, T1, ..., T15]; TX = each 1st source position to read from tiles for the line.
Then set 1st VRAM position values, set offset to 0, read T0 into HL, add offset (initial value 0), copy 16 px, read T1 into HL, add offset, copy 16 px, until the last one, then add tile-width-bytes to offset. On next iteration (the next scanline), repeat, but in this case offset will be tile-width-bytes (8 or 16 depending mode) x number of lines copied, so it will copy the next tile line into the corresponding position. Repeat until finish.
As the offset value to add is not going to be ever greater than 8-bit value, can track this value into A register, so on each line simply ADD A, tile-with-bytes, then for adding use an unused regs pair (I'll say randomly DE), load A into E, D would be always 0, then ADD HL, DE on the "add offset" step.
Please consider that I am talking all from memory and quick occurrences so some more work to the idea could be required.
For the column draw I output a series of 4×16 blocks, which is best done with HMMC where I can simply output the pixel data in order. If I were to use CPU VRAM access I would have to change the VRAM pointer every 4 pixels, and also would need to update register 14 regularly, which is slower.
Those small commands are a pain. I can only think about how to use the CPU while the VDP is busy. Not sure but you probably precompute all operations and then inject them all in a row, waiting for the current command to finish to put the next one. In this case I can only thing about not doing that precompute, and compute the next command while executing the current one, saving the CPU time used for precomputing the values.
If I were to use this in an 8bpp mode I would have to transfer twice as many bytes, but the overhead of setting VRAM pointer addresses and increasing the stride would be the same, only the number of outi
s would double. A quick test shows that the background drawing for 8bpp goes up from 11 ms to 16.5 ms. A 50% time increase for double the bandwidth, not bad.
In this case I would look more for latency than % factor, as framerate at the end depends on that. More latency is also less time for computing logic. Then is more something about the software design itself, if most elements are static (don't execute logic until you interact with them), no real-time combat, etc. then probably is good.
Good news, you can get a nice result in Screen 11. Only one game has been attempted in this mode, it's MKid. It was very promising but it was never finished. This was to take time to make beautiful graphics. How do you intend to create graphics? Screen 11 designer is not very advanced for it.
I make graphics with Aseprite on Mac so I need an approach that works in that environment. I posted some thoughts a few years ago, but essentially I’m thinking of the following approach:
Conceptually, considering screen 11 as two layers on top of each other, the front one with 16 colour palette (9-bit) and transparency like screen 5, and the back one YJK with 12499 colours (~15-bit). Then I’ll primarily draw the image as a any screen 5 image, but in places where I want more colours I will "punch a hole" in it with transparency to paint on the YJK layer.
Initially I’d treat the YJK layer as one-quarter the horizontal resolution, giving each 4×1 block a single colour. This just to simplify the drawing process. In the image editor this means to use palette colours as much as possible, and when picking colours not in the palette keep different colours 4 pixels apart. It won’t get the most out of screen 11, but hopefully has a pretty easy learning curve.
After getting more accustomed to YJK I can hopefully get enough of a feel for the system to start using multiple colours within the 4×1 areas using combinations that fit the YJK restrictions, and gradually make the transparent palette colour holes larger making it more "screen 11" and less "screen 5".
I have an Aseprite script which quantises my colours to the screen 5 palette, I would extend it to also convert the non-palette colours to YJK limits (best match). This way I can see the image as it would look on MSX2+ while I'm editing. Maybe also show the colour ramps with the selected colour in them.
- Performance is identical to VDP copies, probably faster in 8bpp modes.
So slow is the VDP?
Or the CPU is so fast .
In-between commands the VDP is idling for about 300 cycles while the CPU submits the next command. This slows down small copies. And the CE polling time in-between was too short to do any other meaningful calculations.
For the row draw I output a series of 16×4 blocks, which is best done with CPU VRAM access because the VRAM pointer only changes once every 16 pixels.
I speak from memory, but I think the virtual pages are organized vertically, so when you access the most right px, the next one is the 1st of the next line, at least in 1-page scrolling mode like is the case. In that case, the 256x4 block could be copied only modifying the source HL, using a scanline drawing method.
For me changing the source bank and address (86 cycles) is more expensive than changing the destination VRAM address (56 cycles). Either approach is valid though, and depending on your tile and tile map layout it may be different.
For the column draw I output a series of 4×16 blocks, which is best done with HMMC where I can simply output the pixel data in order.
Those small commands are a pain.
Ah, no, I start a single 4×256 HMMC, and then just send bytes for the series of blocks to it in sequence from the CPU.
A quick test shows that the background drawing for 8bpp goes up from 11 ms to 16.5 ms. A 50% time increase for double the bandwidth, not bad.
In this case I would look more for latency than % factor, as framerate at the end depends on that. More latency is also less time for computing logic. Then is more something about the software design itself, if most elements are static (don't execute logic until you interact with them), no real-time combat, etc. then probably is good.
Since the background is drawn in the borders of the screen, the screen can scroll immediately, only a few frames later the newly drawn pixels scroll into the active display area.
About latency in general, I spread things out between 2 frames so the latency is a bit more due to that. That’s ok for an RPG, probably not for a shmup. I scroll diagonally at 2x speed so I draw 1024 pixels / frame. A unidirectional shmup at 1x speed only needs to draw 256 pixels / frame. So more CPU time naturally becomes available for all the dynamic action on screen.
OK I confused about the HMMC command.
as a NPC is going to be a dynamic object, because you can talk and etc. (it has actions) but it doesn't move.
Not sure how it will be in Grauw's RPG, but I know plenty of RPGs where NPCs walk around in town.
Yes in that case even worse. So less sure about if handling all that structure is worth compared to checking directly.
Most NPCs will be static position objects with tile animations.
Maybe some NPCs will walk around, they will then be dynamic position objects using sprites, and their areas / paths will ensure no two are ever on the same vertical position to avoid sprite flicker. There won't be too many though, since their overhead is quite high; path finding, collision detection, etc. Also it's additional work for not so much gain.
I'm more concerned about tile animations (flowing water, blowing grass, torches, chimney smoke, etc). I will try to avoid a partitioning structure, but if I model them as static objects as I do now then there may be so many of them in a scene that some sort of optimisation may be needed. But then just for static position objects. Moving ones will not be optimised, I should just reduce their number if necessary.