Friday, February 28, 2020

RayCasting engine on the ZXSpectrumNext - Part3

SO now that I had the basic ray casting going, it was time to get some textures in. I toyed with a few ways of doing this, from normal code, to using the new instructions. I'd gotten it down to about 56 T-States per pixel when Jim Bagley started asking about the engine. He'd been thinking of doing one, and had a 48T-state rendering loop.... We chatted about it, and I noticed his wasn't quite right, but I used my rendering, and some of his methods to get mine down to the 48Ts anyway.

texture_macro macro
        exx             ; 4
        ld      a,(de)  ; 7 read texel
        add     hl,bc   ; 11
        ld      e,h     ; 4
        exx             ; 4 = 30

        ld      (hl),a  ; 7
        add     hl,de   ; 11 = 48

The idea here is to make a macro then unroll the loop 76 times - the height of the lowres render area. We then work out how many pixels we've to render, then we jump into the middle of this code. This works "okay" for the most part...

The biggest problem with this method, is the setup. The TextureVertical() call takes a load of variables to work out everything

; *************************************************************************************
; Fill a textured vertical line in lowres
; e = X pixel on screen - where column lives
; d = start (Y pos)
; l = end (Y pos)
; a = line height (number of pixels to render)
; h = full pixel height (FULL height - including clipped area)
; b = Tex X (column to render)
; c = tile to render
; *************************************************************************************
From this list of things, we then work out texture scaling values, tile+texture address, bank it's in, screen address+column offset, not to mention the point we need to "jump" to render all the pixels. Lastly, we also need to "clip" off the top of the screen, although this ends up being a very tight, simply loop adding the deltas until it's "on" screen. This code looks pretty ugly... but here it is...

        ld   (tmp+1),a
        ld   a,h
        ld   (tmp),a
        ld   a,b
        ld   (texel_X+1),a
        ; work out tile bank, and offset
        ld   a,c
        and  1
        ld   (tile_id+1),a
        ld   a,c
        srl  a
        add  a,TilesSeg
        NextReg $52,a

        ld   a,l
        sub  d
        push af  ; store length

        ; work out lowres screen address (y*128)+x
        ld   a,e
        ld   e,0
        sra  d  ; *128
        rr   e
        add  a,e  ; +x
        ld   e,a
        ld   hl,ScreenAddress
        add  hl,de
        ld   de,$0080
        exx   ; screen rendering in "alternate" registers

        ld   a,(tmp)
        ld   de,Scales
        add  de,a

        pop  af
        add  a,a
        ld   hl,ScaleLookup
        add  hl,a

        ld   a,(de)  ; get texel scale
        ld   c,a
        inc  d
        ld   a,(de)
        ld   b,a

        ld   e,(hl)  ; get jump table
        inc  hl
        ld   d,(hl)
        ld   hl,VTLoop
        add  hl,de
        ld   (Jmp+1),hl

        ld   a,(tile_id+1)
        add  a,a
        add  a,a
        add  a,a
        add  a,a  ; = $1000
        ld   h,a
        ld   l,0
        add  hl,Tiles
        ex   de,hl
        ld   hl,(texel_X)
        srl  h  ; /2 = *128
        rr   l
        srl  h  ; /4 = *64
        rr   l
        add  hl,de
        ex   de,hl
        ld   h,e
        ld   l,0

        ; clip top?
        ld   a,(tmp+1)
        and  a
        jr   z,@SkipCLip
@Clip:  add  hl,bc
        dec  a
        jp   nz,@Clip
        ld   e,h
Jmp     jp   $0000

So the engine will call this 128 times, once per column of the screen. So this setup time is very expensive, and something I'd need to try and optimise at some point. I did have a large scaler divide table in here, to help speed up working out texture scaling. It took up a huge amount of space, but got rid of 128 ugly divides - which take about 1000Ts per divide(ish), so it was well worth it.

Now that this was in, I had the very basics of a running engine.... it was working, it showed it was very possible. Yes, I still had to add 3D sprites in, but remembering how slow 3D games were on the original machine, this was very workable.

If this had come out back in the day, I'd have wet my pants!

So back to optimisation. I unrolled the 16bit divide I was using which shaved a little off, and did some other work to save Tstates here and there. But all these were just tiny amounts. What I needed, was a big saving....
I change the start/end line calculation with a simple table look up, which did save a chunk of clipping code, which was a good start - a couple hundred Tstates per column (so 128*200-ish), which is pretty good, and then I added to this by optimising the tile address calculation, but i was back to saving a few Tstates here and there.

I decided to look at the DDA stepping code, and I shuffled some of the instructions around, saving a memory read inside the map look up code, which happens every step. From here, I decided I'd like to make the MULs and DIVs macros, so I didn't have to call them. There's about 20 or so calls each vertical, so 20-ish*128*Call+ret Tstates. This was again a fair saving, but it doesn't half balloon the code!! This made debugging tricky, so I stuck it on a compile flag to make testing easier.

I added in some more textures, so that I could get a little depth to the walls, and this really did make a huge difference. This was done by simply adding darker textures, and all texture numbers were then multiplied by 2, and the "side" it hit, added on. All these changes made the engine run in 2 to 3 frames, which was definitely in the ballpark I was aiming for!

You can see from the video, that this is definetly usable. Yes, we still have sprites to add.... but it's pretty fast and fun to run around in.

Next up was adding in some 3D sprites....

Tuesday, February 11, 2020

RayCasting engine on the ZXSpectrumNext - Part2

Now we had the prototype, the first thing to do was a direct port to Z80. To do this, I went through each line of the C# like this...

// calculate ray position and direction
    // Int16 cameraX = (Int16)(2 * ((xx << 8) / screen_width) - 0X100); //x-coordinate in camera space
    short cameraX = (short)(xx <<<< 2);
    cameraX = (short)(cameraX - 0x100);
and wrote an untested Z80 version  - like this
ld a,0
     ld  (xx),a

     ; Int16 cameraX = (Int16)(2 * ((xx << 8) / screen_width) - 0X100);
     ; cameraX = (short)(xx << 2);
     ; cameraX = (short)(cameraX - 0x100);
     ld   l,a  
     ld   h,0
     add  hl,hl  ; x/128
     add  hl,hl  ; *2
     ld   de,$100
     xor  a
     sbc  hl,de
     ld   (cameraX),hl

I added the C# code in as comments in order to keep track - as there's a LOT of code to put in, it's also a great reference when doing the actual port and looking for bugs.

The next thing I needed was a fast, signed 16bit x 16bit multiply. I got an unsigned one from the Z80 C library, and I then needed to make it a signed version. Signed multiples are easy enough, you simply XOR the top 2 bits of each value, and remember if it's a 1 or not. You then take the ABS() of these values and multiply them using the "unsigned" 16x16 multiple... then on exit, if the xor answer from the start was 1, you negate the answer. Job done.

; ****************************************************************************************
; multiplication of two 16-bit numbers into a 32-bit product
; enter : de = 16-bit multiplicand = y
;         hl = 16-bit multiplicand = x
; exit  : hlde = 32-bit product
;         carry reset
; uses  : af, bc, de, hl
; ****************************************************************************************
     ld   b,l                  ; x0
     ld   c,e                  ; y0
     ld   e,l                  ; x0
     ld   l,d
     push hl                   ; x1 y1
     ld   l,c                  ; y0

     ; bc = x0 y0
     ; de = y1 x0
     ; hl = x1 y0
     ; stack = x1 y1

     mul                       ; y1*x0
     ex   de,hl
     mul                       ; x1*y0

     xor  a                    ; zero A
     add  hl,de                ; sum cross products p2 p1
     adc  a,a                  ; capture carry p3

     ld   e,c                  ; x0
     ld   d,b                  ; y0
     mul                       ; y0*x0

     ld   b,a                  ; carry from cross products
     ld   c,h                  ; LSB of MSW from cross products

     ld   a,d
     add  a,l
     ld   h,a
     ld   l,e                  ; LSW in HL p1 p0

     pop  de
     mul                       ; x1*y1

     ex   de,hl
     adc  hl,bc

With this done, I could now do the basic 16 bit maths I needed like this...

; var rayDirX = dirX + ((planeX * cameraX)>>8);
     ld   hl,(cameraX)
     ld   de,(planeX)
     call SMul_16x16           ; exit  : hlde = 32-bit product
     ld   h,l
     ld   l,d                  ;>>8
     ld   de,(dirX)
     add  hl,de
     ld   (rayDirX),hl

You can see, that once it's fit into 8.8 maths, a lot of of complexity falls away. Aside from the 16x16 multiply, you can see the shift 8 is actually just taking the whole byte from one register to another. This basic process is relatively quick, however you have to do hundreds of them - which is the real speed issue we'll need to tackle later.

There's a few of these "blocks" to convert, but the biggest target was the delta stepping. It's important to get that as fast as possible. There are 3 different stepping functions, X axis, Y axis, and a general that moves across both axis at once - this is the one that'll be hardest to optimise and keep the speed up with. It's important to get this one as fast as possible, because stepping across the map until you hit a block will be executed hundreds if not thousands of times, especially in large open rooms.

So here's the C# code I need to port.....
while (true)
          //jump to next map square, OR in x-direction, OR in y-direction
          if (sideDistX < sideDistY)
               sideDistX += deltaDistX;
               mapX += stepX;
               side = 0;
               sideDistY += deltaDistY;
               mapY += stepY;
               side = 1;

           //Check if ray has hit a wall 
           int map_index = (mapY * MAP_WIDTH) + mapX;
           last_tile = map.worldMap[map_index];                    
           if (last_tile != 0) break;
I spent some time fiddling with register layouts and the rest, trying to keep it all in registers as memory access is painful.
; --------------------------------- General ---------------------------
       ; while (true)
       ; jump to next map square, OR in x-direction, OR in y-direction
       ld   a,(mapX)
       ld   c,a
       ld   a,(mapY)
       ld   b,a
       ld   a,(stepX)
       ld   d,a
       ld   a,(stepY)
       ld   e,a
       ld   hl,(sideDistX)   ; 16
       ld   iy,(sideDistY)   ; 20
       ld   de,(deltaDistX)  ; 20
       ld   bc,(deltaDistY)  ; 20
       ld   ixl,$30          ; side
       xor  a                ; and at the end of the loop clears carry
       ; if (sideDistX < sideDistY)
       ld   a,l              ; 4
       sbc  a,iyl            ; 8
       ld   a,h              ; 4
       sbc  a,iyh            ; 8

       jr   nc,@ix_greaterthan 
       ; sideDistX += deltaDistX;
       add  hl,de            ; 11
       ;mapX += stepX;
       exx                   ; 4
       ld   a,c              ; get mapX
       add  a,d              ; add stepX
       ld   c,a

       ; side = 0;
       ld   ixl,$30          ; 9Ts  ($30 for $3000 base address)
       jp   @skip_branch

       ;sideDistY += deltaDistY;
       add  iy,bc            ; 15Ts
       ;mapY += stepY;
       ld   a,b              ; get mapY
       add  a,e              ; add stepY
       ld   b,a

       ; side = 1;
       ld   ixl,$20          ; 9Ts  ($20 for $2000 base address)
       ld   a,c              ; mapX

       ld   h,b              ; mapY
       ld   l,0
       srl  h                ; *64
       rr   l
       srl  h
       rr   l
       add  hl,Map           ; 16
       add  hl,a             ; A already mapX
       ld   a,(hl)           ; get map entry
       and  a
       jp   z,@KeepLooping

       ld   (lastblock),a  
       ld   a,ixl
       ld   (side),a
       ld   a,b
       ld   (mapY),a
       ld   a,c
       ld   (mapX),a
So you can see I've managed to keep it all in registers - even though I had to use the alt set, and ix and iy. But that's still much faster than saving values, and reloading others from memory. The X and Y axis ones are similar, but without the branches and doesn't need as many registers. The last part is simply working out how how to draw the column and drawing it. This is simply a case of working out the screen address and plotting a vertical line, clipping to the top and bottom of the screen - simple compared to the rest of the stuff we've just done! Once that's done - and once I spent a day or so debugging it and getting it all working, I was left with this....

This was the first publicly shown version, and took about a month and a half of my spare time to get running (more or less).

Monday, February 03, 2020

RayCasting engine on the ZXSpectrumNext - Part1

So I've been making some good progress on my Ray Casting Engine - which is technically what a Wolfenstein engine is, and I thought it'd be fun to write how I did it. It's a large, complex engine and didn't come about over night. In fact, as I'd never written one of these before, I spent about a month (in the evenings) working out how to do it in the first place, before even starting on any Z80.
Believe it or not, I actually wrote 3 different versions before touching Z80, and then a further 2 to help debug it. but more on that later....

I did buy the Wolfenstein engine book, but actually, didn't really need it. What I ended up using was this website.

This was a great place to start, as it gives a workable demo, but I needed a framework to put it into. I went with GameMaker: Studio 2, as I'm intimately familiar with it, and it meant i could jump right in. This gave me a screen size of 640x480 (same as the demo)

Once I'd cut and paste (pretty much), the example into GameMaker, I could start to try and figure out how it was working. The goal was to get the maths down to 8.8 fixed point, so that it would fit in Z80's 16bit registers, and I could handle the maths quickly. But before doing that, I needed to do a 16.16 fixed point version. Doing this meant I could verify that it worked, without getting too close to the maths limits, as 8.8 would be cutting pretty close. In fact these were both 15.16 and 7.8 "signed", as vectors etc would be in all directions, and so could be negative.

Converting to fixed point is pretty straight forward, basically for all 16.16 numbers,  you have to multiply all numbers by 65536, or use a <<16. So 15.453 would be 15.453<<16 which is 1,012,727 or $F73F7 in HEX. I personally think of all numbers in hex, as $F73F7 ( $000F_73F7 ), where $000F is the whole number (which is 15), and $73F7 is the fraction. This is perfect, because it means to get the whole number, you just have to take the upper 16bits, usually by doing a >>16.

There's a heap of pages on fixed point maths, so I'll just say here that the basics are when you multiply a 16.16 number by another 16.16 number, you then >>16 to get the final answer. So...

var a = $F73F7;            // 15.453
var b = $83687;            // 8.213
var ans = (a*b)>>16;    // == $7EEA54  (126.9153)

And that's basically it. So after converting to 16.16 (as shown below), I was happy to try and get it into 8.8

This simply meant copying the above code and doing shifts of 8 rather than 16. I did also have to reduce the screen size to 128x76, down from 640x480 that the demo used. The original Wolfie used 304x152, we're a little under quarter the size. But since I can only hold a number from 0 to 127 in 7.8, and as I'm targeting the Spectrum Next's "lores" mode (127x96), then this all fits pretty snugly.

Once I swapped everything over to use >>8 and <<8 type maths, I started noticing the odd missing line on the screen. This turned out to be when DX or DY was 0 exactly - stepping on the axis exactly. This is due to the maths no longer being accurate enough. There are times when you do a 1/X so that you can avoid lots of divisions (i.e. 10*0.5, is faster than 10/2 as 1/2 = 0.5). In 7.8 fixed point you really need to do $10000/X, but that's out of range, so I'm stuck with doing $7fff/X. This has knock on effects, but you can cope with them later. Note: technically it's $100, but as you need to shift up by 8 before doing an actual divide in fixed point, that makes it $10000, which is what is out of range, so you're stuck with $7FFF.

After porting to 7.8 I did hit another issue, rotating the player's view was going "nuts". This was because the original demo rotated a vector constantly, and while floating point could handle the accuracy, 7.8 just "drifted" and these vectors stretched and went bizarre. To combat this, I created a table of 256 angles using floating point, and then taking it down into 7.8 fixed point for storing. This means the player now has an "angle" he's facing, and I simply look up a perfect vector for that angle.

Once I had these issues fixed up, I started to do a Z80 port, only to discover GameMaker doesn't really do the job of fixed point-ing properly. This is due to the typeless nature of GameMaker, and that many calculations are either done in doubles, or 64bit. This isn't useful for my needs, I need something very strongly typed, so that I can make sure it all "fits", before taking the leap into Z80.

So..... I needed a C/C++ or C# framework, where I could use Int16's directly. I decided to rip the guts out of my #CSpect framework, and use that to give me a basic bitmap and keyboard input. I then ported ALL GameMaker code (bot 8.8 and 16.16 versions!) over to C# so I had a good debugging framework.

One extra bit I needed to do in C#, was to deal with signed >>8. C/C++ and C# doesn't do signed shifts the way Z80 does, so I wrote a small signed shift right function to use instead of a simple >>8. This will be replaced in Z80 with actual shifts if needed, although usually  you just take the top 16 bits directly and need no shifts at all.

The signed shift function isn't fancy, it's just designed to work as I'd expect....

Once all THAT was done.... Actually, I want to just pause here. It may seem like I'm just through stuff together at a great rate of knots, but saying "I'll just do this.... there, done that", actually this all took some time. In all, I spent about a month, reading about the engine, and getting these prototypes up and running. It's important to know there is effort involved in this like this, no matter how experienced you are, you still need to do the grunt work. And this is all BEFORE doing any Z80 at all really!!

Now that all this "prep" work was done..... I'd figured out how the engine worked - mostly, I have a prototype that I could step through along with the Z80 one, so that I could see what answers the Z80 should be giving - this is invaluable on any complex bit of code (and if you don't fully understand the engine). My goal is to almost line for line port the C#, so that should mean the answers and variables should get exactly the same answers. So stepping the Z80 and C# will give exactly the same answers, and if they don't, then the Z80 is wrong and I'll be able to figure out why.

Now comes the hard bit.... writing the Z80 port!