Thursday, May 14, 2020

RayCasting engine on the ZXSpectrumNext - Part4

So I'm going to skip sprites for now. They're basically the same as walls, but with transparency, though there are some fancy optimisations on it later that I will go into. This one will be big, as some of the code functions are huge.... be warned!

For now, I want to talk about up scaling everything to full Layer 2. Lores rendering at 128x96 is one quarter the fill rate we need for full Layer 2, and if we just scaled up, this would result in a frame rate of about 8 to 12 frames per game cycle - or about 4 fps. Obviously too slow. So we needed to seriously up our rendering game. Before doing anything, I actually had to scale everything up, so it was rendering in Layer 2 - all be it slowly. Then I could start the hard work.

The first thing I wanted to do was get rid of all the divides, and there was enough in there that it hurt a lot. So I created a very large one over table, so that I could convert from something like 100/4, to 100*0.25. Multiples are much quicker, but using a table basically means the table holds the answer - more or less. This got 1000 T-State function, down to about 150 T-States. That's a big win, especially given that this is done several times a vertical scan. In all, this shaved off about a frame, a good start!

Next I took a look at the rendering. That Texture_Vertical() function is just way too slow. I tried several short cuts but what I really needed was custom vertical code - for each span. What I did was to write a C# program to auto generate Z80 code, that given the base address of a screen line, and the base of a texture, would render a span exactly - without the need for loops, calculations - or anything in fact.

Here's a bit of code that's been generated to render a 16 pixel high span. In these functions DE points to the TOP of the span on the screen, and HL points to the base of the TILE column. The 48K screen is mapped into the lower memory - over lapping the ROM in Write Only mode, meaning D=0, and E=the layer 2 column to render. In fact, we can ignore D, and this can be setup by the function, as the span height indicates the Y coordinate it'll start on.

; Render height of 16
; T-States = 443
; Bytes    = 84
Render_16:
 ld d,68
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 ret

What about a smaller one - 4 pixels?
; Render height of 4
; T-States = 134
; Bytes    = 25
Render_4:
 ld d,74
 ldws
 ld a,15
 add hl,a
 ldws
 add hl,a
 ldws
 dec a
 add hl,a
 ldws
 inc a
 add hl,a
 ldws
 ret

As you can see, the first instruction moves to the correct Y, and you can see we scale nice and evenly. A gets loaded with the texel delta, and is added to HL. The code generate keeps track of registers and if it doesn't need to change, it can generate more optimal code. What about 11 pixels? This is a little dirtier.

; Render height of 11
; T-States = 278
; Bytes    = 52
Render_11:
 ld d,71
 ldws
 ld a,5
 add hl,a
 ldws
 dec a
 add hl,a
 ldws
 inc a
 add hl,a
 ldws
 add hl,a
 ldws
 add hl,a
 ldws
 dec a
 add hl,a
 ldws
 inc a
 add hl,a
 ldws
 add hl,a
 ldws
 add hl,a
 ldws
 dec a
 add hl,a
 ldws
 ret
You can see here, A moves from 5 to 4, then back to 5, yet the auto generation code knows that an INC and DEC is the best way to get it there, keeping the code as fast as possible. auto generated code is cool, because every time you think of a new "trick", you can simply add it, and regenerate everything! No need to manually put the improvement into every function, it's all automatic.

So, here's a long one.... 96 pixels.

; Render height of 96
; T-States = 1507
; Bytes    = 230
Render_96:
 ld d,28
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ret
You can see there's combinations of Store,Store, INC, Store, Store. This is doing by a mixture of load a,store a, inc d's and LDWS, a new ZX Next opcode which INC's D and L. As tiles are 256 byte aligned, a column of 64 pixels is always inside a page, so LDWS is a very quick way to load, store and INC both D and L (14Ts). You just have to make sure the tiles are rotated correctly in memory.

The even cooler thing about this, is that it can "pre-clip" as well. Simply jump to a function that can draw a span scaled to 512 pixels, and the auto generated code will PRE-CLIP it. Here's a section of a 413 pixel scaled line...

; Render height of 413
; T-States = 1944
; Bytes    = 335
Render_413:
 ld d,0
 ld a,19
 add hl,a
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ldws
In this case as we are scaling up, a pixel is loaded, then stored over and over again, hence the LD A,(HL) followed by the multiple LD (DE),A's. And if you look at the top, you can see this...
 ld d,0
 ld a,19
 add hl,a
We set the start of the line, already clipped to 0 (instead of minus), and HL is clipped, with 19 texels being outside the screen. The generator also keeps track of fractions, meaning it'll track sub texel coordinates cleanly, allowing for hires sub pixel accuracy in clipping and rendering - something you couldn't do if the spectrum had to calculate everything. It also means a vertical clip is just these 3 instruction. That's very cool, and quick.

This completely replaces the Texture_Vertical() function, meaning I just page in correct function into ROM space (in read mode), while WRITE mode is the layer 2 screen, and we jump to it. That's it.... this obviously give a massive speed boost.

Sprites are similar, but use LDIX for sprite rendering, and it has to compensate DE and HL appropriately. Here's a simple sprite render...

; Sprite32 Render height of 43
; T-States = 567
; Bytes    = 94
SPRender32_43:
 ld d,77
 ldix
 dec e
 inc d
 ldix
 dec e
 inc d
 ldix
 dec e
 inc d
 inc hl
 ldix
 dec e
 inc d
 ldix
 dec e
 inc d
 inc hl
 ldix
 dec e
 inc d
 ldix
 dec e
 inc d
 inc hl
 ldix
 dec e
 inc d
 ldix
 dec e
 inc d
 inc hl
 ldix
 dec e
 inc d
 ldix
 dec e
 inc d
 inc hl
 ldix
 dec e
 inc d
 ldix
 dec e
 inc d
 inc hl
 ldix
 dec e
 inc d
 ldix
 dec e
 inc d
 inc hl
 ldix
 dec e
 inc d
 ldix
 dec e
 inc d
 inc hl
 ldix
 dec e
 inc d
 ldix
 dec e
 inc d
 inc hl
 ldix
 dec e
 inc d
 ldix
 ret

We also use compressed vertical columns in sprite rendering, so if a full column is empty, it's skipped. This allows smaller sprites to be rendered much more quickly.
But... imagine we had a 64x64 sprite, and it had an 8x8 pixels sprite at the bottom - ammo you'd pick up or something. That's a LOT of space to "try" and render, especially when you're very close to it. This would in fact fill the screen with "empty" pixels, slowing everything down.

So, what we do is detect different types of small sprites. 8 pixels off the bottom, 16, 24, 32, 48. This means ONLY the lower pixels get rendered, and the rest are "skipped". So an 8x8 sprite in the bottom of a 64x64, ONLY renders 8x8 scaled pixels, and as you get closer, the pixels go off the bottom of the screen, and we don't spend all our CPU time trying to render empty pixels.

The C# function is pretty large, but here's a small section of it to give you the "gist". This is how it decides to skip texels....

                // de = screen address
                // hl = texture column address
                // ldws used. (de)=(hl): inc d: inc l
                // if clipped off the top, then skip down...
                byte clip = (byte)Math.Floor(V);
                switch (clip)
                {
                    case 0:
                        break;                  // moving 1. already done in ldws
                    case 1:
                        buffer.Add(0x2C);       // moving 2. One already done... do another "inc l"
                        asm += "\tinc\tl\n";
                        TCount += 4;
                        break;
                    case 2:
                        buffer.Add(0x2C);       // moving 3. One already done... do 2 more "inc l"s
                        buffer.Add(0x2C);
                        asm += "\tinc\tl\n";
                        asm += "\tinc\tl\n";
                        TCount += 8;
                        break;
                    case 3:
                        buffer.Add(0x2C);       // moving 3. One already done... do 3 more "inc l"s  (4Ts each)
                        buffer.Add(0x2C);
                        buffer.Add(0x2C);
                        asm += "\tinc\tl\n";
                        asm += "\tinc\tl\n";
                        asm += "\tinc\tl\n";
                        TCount += 12;
                        break;
                    default:
                        buffer.Add(0x3e);               // Anything over 4, means we use an "ADD"
                        buffer.Add((byte)clip);         // "LD A,$XX"  -  7Ts       
                        AReg = clip;
                        TCount += 7;
                        buffer.Add(0xed);               // "ADD HL,A"  -  8Ts             (add hl,$xxxx = 16Ts)
                        buffer.Add(0x31);
                        TCount += 8;
                        asm += "\tld\ta," + clip.ToString() + "\n";
                        asm += "\tadd\thl,a\n";
                        break;
                }


There's one final cheat I do, which I'll talk about next time....


No comments: