3D rasterisation

Pagina 2/4
1 | | 3 | 4

Van Grauw

Ascended (9398)

afbeelding van Grauw

21-06-2017, 21:05

As for current performance, rendering 12 polygons during the slowest frames:

  1. Transforming matrices and vertices: 0.6 frames (7 ms)
  2. Preparing polygons for scanline rendering: 0.2 frames (3 ms)
  3. Scan line rendering: 3.4 frames (57 ms)
  4. Span buffer insertion: 0.8 frames (13 ms)
  5. Updating screen: 1.6 frames (27 ms)

First some short notes: * The transformation phase can have some easy optimisations to reduce the nr. of multiplications. * Span buffer insertion can speed up if the polygons are inserted in z- or x-sorted order. * Updating the screen is pretty optimised but dealing with screen 5 is annoying, and screen 7 / 8 is just as fast. * No culling is currently implemented.

But as you can see the scan line rendering algorithm is the slowest part of the code. This is mostly because every scanline the state of all the active edges need to retrieved from memory, advanced, and then stored back and move to the next edge. Do this × polygon count × line count, and you get a pretty expensive piece of code. The current implementation can be optimised of course, but only so far (~40%).

Instead I’m considering to ditch the scanline algorithm and have 212 span buffers instead. Then I can render one polygon at a time with a simple loop, keeping the complete state in registers.

Should be a lot faster (maybe ~75%?), but the downside is that 212 span buffers takes 212K of memory. To run on an ST it will require a (slower) external memory mapper, which I hope people own. I still think I should go this way though, what do you think?

(Ditching the scanline algorithm is too bad though in a way, because the current algorithm is pretty elegant I think, nicely implemented with a single linked list and a sentinel, and deals well with polygons of any number of edges, even complex ones.)

Van edoz

Prophet (2302)

afbeelding van edoz

21-06-2017, 21:13

Wow! Those are so cool! Would be a dream for a screensaver in SymbOS Big smile

Van hit9918

Prophet (2897)

afbeelding van hit9918

21-06-2017, 22:04

plain vram fill MSX2 18 cycles per byte, 256x192 makes 247 milliseconds.
57+13+27 is 97 ms, it is 40%. if it's R800 then I'd dream of more punch.

I dont understand detail, but the thing with the next being the x sounds very good
insert things without copy tails of lists

Quote:

After each scanline, the span buffer is flushed into the RLE image buffer

is it possible without conversion
or does it maybe make things faster in comparison with previous buffer

Van hit9918

Prophet (2897)

afbeelding van hit9918

21-06-2017, 22:07

I got no grasp of details, scan line rendering vs span buffer insertion, I thought rendering = insert in list.

Van Grauw

Ascended (9398)

afbeelding van Grauw

22-06-2017, 01:38

hit9918 wrote:

plain vram fill MSX2 18 cycles per byte, 256x192 makes 247 milliseconds.
57+13+27 is 97 ms, it is 40%. if it's R800 then I'd dream of more punch.

Yeah, it’s spending a lot of time in the rendering stage still. Not very optimised yet, and I hope there’s also many optimisation tricks possible that I don’t know about yet.

By the way, R800 waits 54 cycles per byte @ 7.16 MHz. So a complete 256x212 fill takes 409 ms in screen 8, and 205 ms in screen 5. And on Z80 OUT (n),A + DJNZ takes 26 cycles, unrolled OUTI from pre-filled memory can be faster, but only for larger byte counts and it won’t benefit the turboR.

hit9918 wrote:
Quote:

After each scanline, the span buffer is flushed into the RLE image buffer

is it possible without conversion
or does it maybe make things faster in comparison with previous buffer

Yes it is possible, I could delta-update the screen from span buffers directly and skip the whole RLE representation.

The spans-to-RLE conversion takes 0.24 frames (4 ms) to complete, would be nice to save it.

I use the RLE representation because it is much more compact, as little as 2 bytes per line, 512 worst case. A 16K RLE buffer will be more than enough to fit nearly every type of image. A span buffer takes 1024 bytes per line (w/ 16-bit z). That’s 212K of memory per full image.

Fitting just one of those 212K buffers in memory is already challenging the ST, if I need one more for the previous frame then even the GT won’t have enough.

And there are uses for having even more than two image buffers; e.g. to have v-synced page flipping to remove screen tearing, you need two buffers for the old state, one for each page. Also it may be useful to render a couple of frames ahead into even more buffers, to be able to balance out the framerates. Also I could use the memory for other things like a background image. So using the RLE buffers gives a lot of memory breathing room and flexibility.

What I think might be an idea though, is to do the spans-to-RLE conversion while delta-updating the screen at the same time. This way, a lot of the cost for the conversion will be absorbed by the 54 ms wait for the VDP.

hit9918 wrote:

I got no grasp of details, scan line rendering vs span buffer insertion, I thought rendering = insert in list.

It is, your thought is correct. I just measured the insert separately and deducted it from the rendering measurement (total was 4.2). The span buffer code is quite optimised whereas the scanline rendering is not, and code-wise it is a rather separate task, so I thought it was useful to split.

Van turbor

Champion (451)

afbeelding van turbor

21-06-2017, 23:39

As author of the 3D routines in Calculus and SandStone I would like tell how I did some things in my sandstone engine, since this was the most avanced one.

I didn't have time (yet) to look at the code of Grauws engine, but the engine I made was based on triangles. I stored a list of vertices and then each triangle consisted of listing which 3 vertices were part of the triangle. I also stored the normal vector for the triangle surface and the middle of the triangle, as well as a desired detail level (=simply draw triangle outline, outline+medians, filled,filled+flatshading,texuremapped,texturemapped+shading).

One of the optimizations was that I started with only calculating the Z-component of the normal vector to determine if I needed to calculate the other points. If it is pointing away, you do not need to draw the triangle since you are looking at its backside... On MSX if you do not need to calculate something, simply don't :-)
As a bonus: the z-component of the normal vector determines the shading of the triangle (good old default flat-shading with the light shining along the z-axis...)

As far as rendering is concerned. When drawing the filled triangles I needed to calculate two bresenham lines to track the left and right border of the triangle. Once I got the coordinates I simply gave the VDP command to draw the line.
In most cases (depending on the on-screen size of the triangle) by the time I had determined the two points of the next line, the VDP had finished drawing the line.
I also optimized the VDP line drawing command by not setting up all registers each time, only the one that actually needs to be altered. Ofcourse I decided if I was going to draw horizontal or vertical lines, depending on the maximum width vs maximum height of the triangle.

When drawing texture-mapped triangles I always used horizontal lines so that I could simply OUT color codes in screen 8. There was no depth correction for the texture map but a simple linear offsetting, so I could calculate the offset list once for the max width line...

I stored the maxim and minimum x and y coordinates used when drawing the previous frame, so that I could used the VDP blockfill to clean the previous drawing. While this blockfill was being performed by the VDP, I calculated all the z-components of the normals as mentioned above, and if I determined that I needed to draw the triangle as visible, I calculated the Z component of the center of the triangle. These were then all sorted using a bucket sort algorithm, so that I could use a simple painters algorithm to draw the triangles from back to front...

Of course there were a lot of other optimizations to reduce the amount of multiplications (no R800 opcodes were harmed during the making of my engine) since this was rather expensive on a Z80. Even using the (a^2+b^2)/4 lookup list trick (explained at http://www.msxcomputermagazine.nl/mccw/92/Multiplication/en....)

Van Grauw

Ascended (9398)

afbeelding van Grauw

22-06-2017, 01:34

Awesome David, very interesting.

Storing the face normal is a nice one, I was already thinking on how to tackle backface culling. In screen space it involves a cross product I think, but this seems simpler because you can do it in model space way earlier in the process. Costs three multiplications rather than two, but I’m sure that can be won back, like skipping the transform of culled vertices.

The middle point is also nice. My depth calculation currently goes as follows: average z of each vertex pair is stored on the edges, and for each scanline the z is the average of the two edges. My way to avoid an annoying divide by 3 Smile. It’s still a bit flaky though (see z-fighting in the video), but I hope backface culling will help, or maybe 16-bit depth… Anyway, doing it once upfront may be cheaper, perhaps also more exact. Although I saw from the Sandstone beta code that you were also struggling with z-fighting.

To update spans which changed between frames, currently I do everything with the CPU but parallelising with the VDP could be interesting; if the span exceeds a certain minimum length I could issue a VDP command for it, and just let it run while I continue. Drawing vertical lines will be difficult though I think with my horizontal scanline-based spans approach.

Painter’s is effectively what I’d be doing if I would z-sort before inserting into the span buffer, I could skip the depth comparisons as well as omit storing depth info in the span buffer… Hmm.

The blockfill I don’t need because I don’t clear the previous frame, rather I use it. This means I have a ton of spare VDP command engine time… what to do :D.

The multiplication you mentioned, I was looking at it earlier, but is there a way to do something like that for 16-bit multiplications without needing a 256K lookup-table? And would perchance a similar trick exist for division or reciprocal? :)

Oh, did you use any divisions at all? Maybe you used orthogonal projection so no need for perspective divide?

Some questions on tracing the edges;

Did you use Bresenham for interpolating the x or fixed point additions? I’m currently using the former but have some code shelved which does the latter, the lines still look nice despite floating point error, but that 16-bit divide is a bit scary. I think the simplified tracing might not compensate for the extra overhead of the division (need to measure).

Also, how did you deal with x-major Bresenham stepping? I currently do an 8-bit division to determine the step size, and then use Bresenham with the remainder to decide when I need to add 1 more (Bresenham’s Run-Slice algorithm). However if I could avoid that division… Also the lines I get are a bit misaligned (acceptable, but still), because the spans are rounded up, if you understand what I mean.

Van turbor

Champion (451)

afbeelding van turbor

22-06-2017, 22:50

About division, yes I have a few of them but if possible I try to avoid them since the multiplication routine is faster.
In calculus I draw wireframes with depth correction, so there I use the z of the vertex to lookup the multiplication factor in a nice table. That factor is applied to both x and y components. Using an extra offset with the z-component in this lookup table will create a simply way to implement the zoom in that demo part. So the division for depth correction is mitigated by using the lookup and a fast multiplication.

In SandStone there is no depth correction, since this would screw up using the z-component of the normalvector as visible vs non-visible indicator. Think of a default wireframe cube:

+------+
|\    /|
| +--+ |
| |  | |
| +--+ |
|/    \|
+------+

If you twist it a few degrees the side panels will still not be visible if you apply depth correction, while the z-component would indicate that you need to draw it...

+------+
||    /|
|+--+  |
||  |  |
|+--+  |
||    \|
+------+

The lookup table is still used for the zoom but the z-component of the vertex is ignored in that engine.

Bresenham.
No fixed point additions, no run-slice algorithm is used. There is some self-modifying code in my engine and there are 4 routines depending on the angle that the line is drawn (0-45, 45-90, 90-135,135-180) the first and last simply loop a few times until a new y is calculated. This keeps the code simpler and is still fast enough. If this loop is needed then it also implies that the VDP is drawing longer lines, so a little more time can be spend anyway, otherwise the code could simply be waiting for the VDP command to end anyway Wink

Van Grauw

Ascended (9398)

afbeelding van Grauw

23-06-2017, 16:13

turbor wrote:

In SandStone there is no depth correction, since this would screw up using the z-component of the normalvector as visible vs non-visible indicator.

Ah, right... that’s very good to know, I didn’t realise that.

I think there’s a way that does work with perspective projection; take the dot product of the face normal with the camera-to-face vector (any face vertex will do). If the result is positive or zero, then the camera is looking at the rear or the side of the polygon.

turbor wrote:

Bresenham. There are 4 routines depending on the angle that the line is drawn (0-45, 45-90, 90-135,135-180) the first and last simply loop a few times until a new y is calculated. This keeps the code simpler and is still fast enough. If this loop is needed then it also implies that the VDP is drawing longer lines, so a little more time can be spend anyway, otherwise the code could simply be waiting for the VDP command to end anyway Wink

I definitely get the simplicity argument. It was my initial implementation before I read about run-slice.

I figured if a horizontal line goes right across the screen, that’s a lot of add-and-compare loops Smile. The loop of the 8-bit divide is pretty similar in size, but has a fixed 8 iterations. Long horizontal lines should benefit, but short or sloped lines become slower. I wonder for an average scene which performs better.

Btw, I realised I can skip the division in the second and third octant, because it’ll always be 0. D’oh.

Van Grauw

Ascended (9398)

afbeelding van Grauw

15-07-2017, 01:31

Grauw wrote:

Instead I’m considering to ditch the scanline algorithm and have 212 span buffers instead. Then I can render one polygon at a time with a simple loop, keeping the complete state in registers.

Should be a lot faster (maybe ~75%?), but the downside is that 212 span buffers takes 212K of memory. To run on an ST it will require a (slower) external memory mapper, which I hope people own.

Some additional thoughts on this;

By rasterizing polygons one-by-one instead of all at once, I can have a single “polygon rasterizer unit” which gets reused. This saves a lot of data because I don’t need to keep edge interpolation values for all polygons in memory. It’s also easier to work with, should simplify implementing things like clipping etc.

About the buffer size concerns, by employing the painter’s algorithm like David does in Calculus and SandStone, I could half the buffer size to 106K because I wouldn’t need to store depth information anymore. Span buffer insertion would also be faster. A shame to lose the z-buffer though.

Pagina 2/4
1 | | 3 | 4