Sunday, April 05, 2020

#CSpect V2.12.13

So this includes a few fixes users have found in the last version, including an important OS one - hence the quick release.

#CSpect V2.12.13 changes
  • Fixed GetRegister() callback
  • Fixed NextReg 0x05 reading
  • Setup some of the Nextreg on reset (also helps reset a bit better)
  • Fixed bit 3 of new NextReg 0x8E, which fixes dot commands (and sprite editor exiting)
  • Changed number of scanlines to 262 (HDMI) for 60Hz
  • Fixed a crash in plugin creation
  • Added pluging "eAccess.NextReg_Read" type (untested)

Friday, April 03, 2020

#CSpect V2.12.9

Been a while since the last update. This one adds the 2 new Next Registers that are required for the very latest NextOS - which is the main reason for doing this. There are a lot of other fixes as well....

#CSpect V2.12.9 changes
  • 50hz/60hz bit now set in NextReg 5
  • -debug added to the command line. You can now "start" in the debugger.
  • Minor Fix to 60Hz audio
  • Fixed AY partial port decoding
  • Fixed a minor reset stall when it was waiting on a HALT before the reset
  • NextReg 0x8c added
  • NextReg 0x8e added
  • Added AltRom1 support
  • Fixed a serial port "send" bug. Now sends the whole byte
  • Fixed tile attribute byte bit 7 - top bit of palette offset was being ignored
  • Fixed esxDOS date/time function (M_GETDATE $8E)
  • NEX file 1.3 now parsed... should load more things
  • NEX files get IRQs switched off when loaded (as per machine)
  • Fixed esxDOS date is wrong
  • Fixed Border colour now sets the palette entry as well
  • Fixed GetRegister() in plugins not working
  • Fixed Copper run then loop not working
  • Fixed Top line (in 256 pixel high view) is missing
  • Fixed DI followed by HALT should stop the CPU
  • Fixed Hires tilemap has clipping on far right
  • F8 not steps over conditional jumps and branches that go backwards. Branches forwards are taken
  • Fixed ADD HL/DE/BC,A no longer affects flags
  • Fixed OUTI so that B is decemented first. (OTIR was already like that)
  • Minor change to 60hz audio
  • NextReg 0x69 can now be read
  • DMA Continue mode added
  • Fixed a timing issue with DMA, so timing much better

Friday, February 28, 2020

RayCasting engine on the ZXSpectrumNext - Part3

SO now that I had the basic ray casting going, it was time to get some textures in. I toyed with a few ways of doing this, from normal code, to using the new instructions. I'd gotten it down to about 56 T-States per pixel when Jim Bagley started asking about the engine. He'd been thinking of doing one, and had a 48T-state rendering loop.... We chatted about it, and I noticed his wasn't quite right, but I used my rendering, and some of his methods to get mine down to the 48Ts anyway.

texture_macro macro
        exx             ; 4
        ld      a,(de)  ; 7 read texel
        add     hl,bc   ; 11
        ld      e,h     ; 4
        exx             ; 4 = 30

        ld      (hl),a  ; 7
        add     hl,de   ; 11 = 48

The idea here is to make a macro then unroll the loop 76 times - the height of the lowres render area. We then work out how many pixels we've to render, then we jump into the middle of this code. This works "okay" for the most part...

The biggest problem with this method, is the setup. The TextureVertical() call takes a load of variables to work out everything

; *************************************************************************************
; Fill a textured vertical line in lowres
; e = X pixel on screen - where column lives
; d = start (Y pos)
; l = end (Y pos)
; a = line height (number of pixels to render)
; h = full pixel height (FULL height - including clipped area)
; b = Tex X (column to render)
; c = tile to render
; *************************************************************************************
From this list of things, we then work out texture scaling values, tile+texture address, bank it's in, screen address+column offset, not to mention the point we need to "jump" to render all the pixels. Lastly, we also need to "clip" off the top of the screen, although this ends up being a very tight, simply loop adding the deltas until it's "on" screen. This code looks pretty ugly... but here it is...

        ld   (tmp+1),a
        ld   a,h
        ld   (tmp),a
        ld   a,b
        ld   (texel_X+1),a
        ; work out tile bank, and offset
        ld   a,c
        and  1
        ld   (tile_id+1),a
        ld   a,c
        srl  a
        add  a,TilesSeg
        NextReg $52,a

        ld   a,l
        sub  d
        push af  ; store length

        ; work out lowres screen address (y*128)+x
        ld   a,e
        ld   e,0
        sra  d  ; *128
        rr   e
        add  a,e  ; +x
        ld   e,a
        ld   hl,ScreenAddress
        add  hl,de
        ld   de,$0080
        exx   ; screen rendering in "alternate" registers

        ld   a,(tmp)
        ld   de,Scales
        add  de,a

        pop  af
        add  a,a
        ld   hl,ScaleLookup
        add  hl,a

        ld   a,(de)  ; get texel scale
        ld   c,a
        inc  d
        ld   a,(de)
        ld   b,a

        ld   e,(hl)  ; get jump table
        inc  hl
        ld   d,(hl)
        ld   hl,VTLoop
        add  hl,de
        ld   (Jmp+1),hl

        ld   a,(tile_id+1)
        add  a,a
        add  a,a
        add  a,a
        add  a,a  ; = $1000
        ld   h,a
        ld   l,0
        add  hl,Tiles
        ex   de,hl
        ld   hl,(texel_X)
        srl  h  ; /2 = *128
        rr   l
        srl  h  ; /4 = *64
        rr   l
        add  hl,de
        ex   de,hl
        ld   h,e
        ld   l,0

        ; clip top?
        ld   a,(tmp+1)
        and  a
        jr   z,@SkipCLip
@Clip:  add  hl,bc
        dec  a
        jp   nz,@Clip
        ld   e,h
Jmp     jp   $0000

So the engine will call this 128 times, once per column of the screen. So this setup time is very expensive, and something I'd need to try and optimise at some point. I did have a large scaler divide table in here, to help speed up working out texture scaling. It took up a huge amount of space, but got rid of 128 ugly divides - which take about 1000Ts per divide(ish), so it was well worth it.

Now that this was in, I had the very basics of a running engine.... it was working, it showed it was very possible. Yes, I still had to add 3D sprites in, but remembering how slow 3D games were on the original machine, this was very workable.

If this had come out back in the day, I'd have wet my pants!

So back to optimisation. I unrolled the 16bit divide I was using which shaved a little off, and did some other work to save Tstates here and there. But all these were just tiny amounts. What I needed, was a big saving....
I change the start/end line calculation with a simple table look up, which did save a chunk of clipping code, which was a good start - a couple hundred Tstates per column (so 128*200-ish), which is pretty good, and then I added to this by optimising the tile address calculation, but i was back to saving a few Tstates here and there.

I decided to look at the DDA stepping code, and I shuffled some of the instructions around, saving a memory read inside the map look up code, which happens every step. From here, I decided I'd like to make the MULs and DIVs macros, so I didn't have to call them. There's about 20 or so calls each vertical, so 20-ish*128*Call+ret Tstates. This was again a fair saving, but it doesn't half balloon the code!! This made debugging tricky, so I stuck it on a compile flag to make testing easier.

I added in some more textures, so that I could get a little depth to the walls, and this really did make a huge difference. This was done by simply adding darker textures, and all texture numbers were then multiplied by 2, and the "side" it hit, added on. All these changes made the engine run in 2 to 3 frames, which was definitely in the ballpark I was aiming for!

You can see from the video, that this is definetly usable. Yes, we still have sprites to add.... but it's pretty fast and fun to run around in.

Next up was adding in some 3D sprites....

Tuesday, February 11, 2020

RayCasting engine on the ZXSpectrumNext - Part2

Now we had the prototype, the first thing to do was a direct port to Z80. To do this, I went through each line of the C# like this...

// calculate ray position and direction
    // Int16 cameraX = (Int16)(2 * ((xx << 8) / screen_width) - 0X100); //x-coordinate in camera space
    short cameraX = (short)(xx <<<< 2);
    cameraX = (short)(cameraX - 0x100);
and wrote an untested Z80 version  - like this
ld a,0
     ld  (xx),a

     ; Int16 cameraX = (Int16)(2 * ((xx << 8) / screen_width) - 0X100);
     ; cameraX = (short)(xx << 2);
     ; cameraX = (short)(cameraX - 0x100);
     ld   l,a  
     ld   h,0
     add  hl,hl  ; x/128
     add  hl,hl  ; *2
     ld   de,$100
     xor  a
     sbc  hl,de
     ld   (cameraX),hl

I added the C# code in as comments in order to keep track - as there's a LOT of code to put in, it's also a great reference when doing the actual port and looking for bugs.

The next thing I needed was a fast, signed 16bit x 16bit multiply. I got an unsigned one from the Z80 C library, and I then needed to make it a signed version. Signed multiples are easy enough, you simply XOR the top 2 bits of each value, and remember if it's a 1 or not. You then take the ABS() of these values and multiply them using the "unsigned" 16x16 multiple... then on exit, if the xor answer from the start was 1, you negate the answer. Job done.

; ****************************************************************************************
; multiplication of two 16-bit numbers into a 32-bit product
; enter : de = 16-bit multiplicand = y
;         hl = 16-bit multiplicand = x
; exit  : hlde = 32-bit product
;         carry reset
; uses  : af, bc, de, hl
; ****************************************************************************************
     ld   b,l                  ; x0
     ld   c,e                  ; y0
     ld   e,l                  ; x0
     ld   l,d
     push hl                   ; x1 y1
     ld   l,c                  ; y0

     ; bc = x0 y0
     ; de = y1 x0
     ; hl = x1 y0
     ; stack = x1 y1

     mul                       ; y1*x0
     ex   de,hl
     mul                       ; x1*y0

     xor  a                    ; zero A
     add  hl,de                ; sum cross products p2 p1
     adc  a,a                  ; capture carry p3

     ld   e,c                  ; x0
     ld   d,b                  ; y0
     mul                       ; y0*x0

     ld   b,a                  ; carry from cross products
     ld   c,h                  ; LSB of MSW from cross products

     ld   a,d
     add  a,l
     ld   h,a
     ld   l,e                  ; LSW in HL p1 p0

     pop  de
     mul                       ; x1*y1

     ex   de,hl
     adc  hl,bc

With this done, I could now do the basic 16 bit maths I needed like this...

; var rayDirX = dirX + ((planeX * cameraX)>>8);
     ld   hl,(cameraX)
     ld   de,(planeX)
     call SMul_16x16           ; exit  : hlde = 32-bit product
     ld   h,l
     ld   l,d                  ;>>8
     ld   de,(dirX)
     add  hl,de
     ld   (rayDirX),hl

You can see, that once it's fit into 8.8 maths, a lot of of complexity falls away. Aside from the 16x16 multiply, you can see the shift 8 is actually just taking the whole byte from one register to another. This basic process is relatively quick, however you have to do hundreds of them - which is the real speed issue we'll need to tackle later.

There's a few of these "blocks" to convert, but the biggest target was the delta stepping. It's important to get that as fast as possible. There are 3 different stepping functions, X axis, Y axis, and a general that moves across both axis at once - this is the one that'll be hardest to optimise and keep the speed up with. It's important to get this one as fast as possible, because stepping across the map until you hit a block will be executed hundreds if not thousands of times, especially in large open rooms.

So here's the C# code I need to port.....
while (true)
          //jump to next map square, OR in x-direction, OR in y-direction
          if (sideDistX < sideDistY)
               sideDistX += deltaDistX;
               mapX += stepX;
               side = 0;
               sideDistY += deltaDistY;
               mapY += stepY;
               side = 1;

           //Check if ray has hit a wall 
           int map_index = (mapY * MAP_WIDTH) + mapX;
           last_tile = map.worldMap[map_index];                    
           if (last_tile != 0) break;
I spent some time fiddling with register layouts and the rest, trying to keep it all in registers as memory access is painful.
; --------------------------------- General ---------------------------
       ; while (true)
       ; jump to next map square, OR in x-direction, OR in y-direction
       ld   a,(mapX)
       ld   c,a
       ld   a,(mapY)
       ld   b,a
       ld   a,(stepX)
       ld   d,a
       ld   a,(stepY)
       ld   e,a
       ld   hl,(sideDistX)   ; 16
       ld   iy,(sideDistY)   ; 20
       ld   de,(deltaDistX)  ; 20
       ld   bc,(deltaDistY)  ; 20
       ld   ixl,$30          ; side
       xor  a                ; and at the end of the loop clears carry
       ; if (sideDistX < sideDistY)
       ld   a,l              ; 4
       sbc  a,iyl            ; 8
       ld   a,h              ; 4
       sbc  a,iyh            ; 8

       jr   nc,@ix_greaterthan 
       ; sideDistX += deltaDistX;
       add  hl,de            ; 11
       ;mapX += stepX;
       exx                   ; 4
       ld   a,c              ; get mapX
       add  a,d              ; add stepX
       ld   c,a

       ; side = 0;
       ld   ixl,$30          ; 9Ts  ($30 for $3000 base address)
       jp   @skip_branch

       ;sideDistY += deltaDistY;
       add  iy,bc            ; 15Ts
       ;mapY += stepY;
       ld   a,b              ; get mapY
       add  a,e              ; add stepY
       ld   b,a

       ; side = 1;
       ld   ixl,$20          ; 9Ts  ($20 for $2000 base address)
       ld   a,c              ; mapX

       ld   h,b              ; mapY
       ld   l,0
       srl  h                ; *64
       rr   l
       srl  h
       rr   l
       add  hl,Map           ; 16
       add  hl,a             ; A already mapX
       ld   a,(hl)           ; get map entry
       and  a
       jp   z,@KeepLooping

       ld   (lastblock),a  
       ld   a,ixl
       ld   (side),a
       ld   a,b
       ld   (mapY),a
       ld   a,c
       ld   (mapX),a
So you can see I've managed to keep it all in registers - even though I had to use the alt set, and ix and iy. But that's still much faster than saving values, and reloading others from memory. The X and Y axis ones are similar, but without the branches and doesn't need as many registers. The last part is simply working out how how to draw the column and drawing it. This is simply a case of working out the screen address and plotting a vertical line, clipping to the top and bottom of the screen - simple compared to the rest of the stuff we've just done! Once that's done - and once I spent a day or so debugging it and getting it all working, I was left with this....

This was the first publicly shown version, and took about a month and a half of my spare time to get running (more or less).

Monday, February 03, 2020

RayCasting engine on the ZXSpectrumNext - Part1

So I've been making some good progress on my Ray Casting Engine - which is technically what a Wolfenstein engine is, and I thought it'd be fun to write how I did it. It's a large, complex engine and didn't come about over night. In fact, as I'd never written one of these before, I spent about a month (in the evenings) working out how to do it in the first place, before even starting on any Z80.
Believe it or not, I actually wrote 3 different versions before touching Z80, and then a further 2 to help debug it. but more on that later....

I did buy the Wolfenstein engine book, but actually, didn't really need it. What I ended up using was this website.

This was a great place to start, as it gives a workable demo, but I needed a framework to put it into. I went with GameMaker: Studio 2, as I'm intimately familiar with it, and it meant i could jump right in. This gave me a screen size of 640x480 (same as the demo)

Once I'd cut and paste (pretty much), the example into GameMaker, I could start to try and figure out how it was working. The goal was to get the maths down to 8.8 fixed point, so that it would fit in Z80's 16bit registers, and I could handle the maths quickly. But before doing that, I needed to do a 16.16 fixed point version. Doing this meant I could verify that it worked, without getting too close to the maths limits, as 8.8 would be cutting pretty close. In fact these were both 15.16 and 7.8 "signed", as vectors etc would be in all directions, and so could be negative.

Converting to fixed point is pretty straight forward, basically for all 16.16 numbers,  you have to multiply all numbers by 65536, or use a <<16. So 15.453 would be 15.453<<16 which is 1,012,727 or $F73F7 in HEX. I personally think of all numbers in hex, as $F73F7 ( $000F_73F7 ), where $000F is the whole number (which is 15), and $73F7 is the fraction. This is perfect, because it means to get the whole number, you just have to take the upper 16bits, usually by doing a >>16.

There's a heap of pages on fixed point maths, so I'll just say here that the basics are when you multiply a 16.16 number by another 16.16 number, you then >>16 to get the final answer. So...

var a = $F73F7;            // 15.453
var b = $83687;            // 8.213
var ans = (a*b)>>16;    // == $7EEA54  (126.9153)

And that's basically it. So after converting to 16.16 (as shown below), I was happy to try and get it into 8.8

This simply meant copying the above code and doing shifts of 8 rather than 16. I did also have to reduce the screen size to 128x76, down from 640x480 that the demo used. The original Wolfie used 304x152, we're a little under quarter the size. But since I can only hold a number from 0 to 127 in 7.8, and as I'm targeting the Spectrum Next's "lores" mode (127x96), then this all fits pretty snugly.

Once I swapped everything over to use >>8 and <<8 type maths, I started noticing the odd missing line on the screen. This turned out to be when DX or DY was 0 exactly - stepping on the axis exactly. This is due to the maths no longer being accurate enough. There are times when you do a 1/X so that you can avoid lots of divisions (i.e. 10*0.5, is faster than 10/2 as 1/2 = 0.5). In 7.8 fixed point you really need to do $10000/X, but that's out of range, so I'm stuck with doing $7fff/X. This has knock on effects, but you can cope with them later. Note: technically it's $100, but as you need to shift up by 8 before doing an actual divide in fixed point, that makes it $10000, which is what is out of range, so you're stuck with $7FFF.

After porting to 7.8 I did hit another issue, rotating the player's view was going "nuts". This was because the original demo rotated a vector constantly, and while floating point could handle the accuracy, 7.8 just "drifted" and these vectors stretched and went bizarre. To combat this, I created a table of 256 angles using floating point, and then taking it down into 7.8 fixed point for storing. This means the player now has an "angle" he's facing, and I simply look up a perfect vector for that angle.

Once I had these issues fixed up, I started to do a Z80 port, only to discover GameMaker doesn't really do the job of fixed point-ing properly. This is due to the typeless nature of GameMaker, and that many calculations are either done in doubles, or 64bit. This isn't useful for my needs, I need something very strongly typed, so that I can make sure it all "fits", before taking the leap into Z80.

So..... I needed a C/C++ or C# framework, where I could use Int16's directly. I decided to rip the guts out of my #CSpect framework, and use that to give me a basic bitmap and keyboard input. I then ported ALL GameMaker code (bot 8.8 and 16.16 versions!) over to C# so I had a good debugging framework.

One extra bit I needed to do in C#, was to deal with signed >>8. C/C++ and C# doesn't do signed shifts the way Z80 does, so I wrote a small signed shift right function to use instead of a simple >>8. This will be replaced in Z80 with actual shifts if needed, although usually  you just take the top 16 bits directly and need no shifts at all.

The signed shift function isn't fancy, it's just designed to work as I'd expect....

Once all THAT was done.... Actually, I want to just pause here. It may seem like I'm just through stuff together at a great rate of knots, but saying "I'll just do this.... there, done that", actually this all took some time. In all, I spent about a month, reading about the engine, and getting these prototypes up and running. It's important to know there is effort involved in this like this, no matter how experienced you are, you still need to do the grunt work. And this is all BEFORE doing any Z80 at all really!!

Now that all this "prep" work was done..... I'd figured out how the engine worked - mostly, I have a prototype that I could step through along with the Z80 one, so that I could see what answers the Z80 should be giving - this is invaluable on any complex bit of code (and if you don't fully understand the engine). My goal is to almost line for line port the C#, so that should mean the answers and variables should get exactly the same answers. So stepping the Z80 and C# will give exactly the same answers, and if they don't, then the Z80 is wrong and I'll be able to figure out why.

Now comes the hard bit.... writing the Z80 port!

Monday, January 06, 2020

#CSpect V2.12.5

Fixed plugin loading. This was due to the new loading/reset system that was nuking loaded plugin mappings.
I've also added a "Tick" to the plugin interface which gets called once per emulated frame, along with a debugger call which allows you to tell CSpect to enter the debugger. This is handy if you need to debug the operation that "just" happened.
Lastly... I've included a very simple plugin example, along with the interface .CS files to look at.

#CSpect V2.12.4 changes
  • Border now comes from fallback colour if paper/ink mode is 255 (as per hardware)
  • Fixed Plugin loading. Was broken due to new system loading+resetting.
  • Added new "Tick" to iPlugin interface, called once per emulated frame
  • Added Debugger() call to CSpect interface, allowing you to enter the debugger from the emulator
  • Added Plugin example (and current interface)

Friday, January 03, 2020

#CSpect V2.12.4

Another minor fix to fix a memory access bug, and a new MMC folder issue that appeared in the last couple of versions.

SNasm 2.0.21
  • Added “SLL” instruction

#CSpect V2.12.4 changes
  • Fixed memory reading of Layer2 mapping in $2000-$3fff
  • Fixed MMC path, it was being reset when loading SNA/NEX from the command line.

Wednesday, December 25, 2019

#CSpect V2.12.3

This is a minor update to fix Lowres and ULA scrolling. ULA X scrolling now uses the 2 new NextRegs - $26+$27.
It also fixes an annoying startup issue where if you didn't start in the EXE path, it just wouldn't start properly. This was due to the new plugin system, but should now be fixed. (fingers crossed). I've also added the missing LDWS instruction to SNASM

SNasm 2.0.20
  • Added “LDWS” instruction

#CSpect V2.12.3 changes
  • Lowres scrolling fixed.
  • Added ULA Scrolling registers $26 and $27.
  • Fixed HL being set to $10000 after reading a file at the end of memory...
  • Fixed a command line startup issue when not started in the EXE path

Sunday, December 22, 2019

#CSpect V2.12.2

This update is a little different, as it add's the ability for users to write their own "plugins". These plugins can take over memory read/write actions, port in/out's and Next Register access - or all of the above!
They can also query the Next's memory, ports, Next registers and Z80 registers, making it pretty useful for folk to add new toys, or add things like logging or profiling etc.

#CSpect V2.12.2 changes
  • Changed 320x256 mode to use Y/X orientation (0,0)=$0000, (1,0)=$0100, (2,0)=$0200. (0,1)=$0001,(0,2)=$0002 etc...
  • Plugin system can now get/set Z80 registers
  • A better reset on load of an SNA/Nex file. Should be a complete system reset now...
  • 320x256 Layer 2 screen mode added
  • NextReg $70 added ($10=320x256 mode)
  • NextReg $71 added. MSB of Layer 2 XScroll added
  • Added Port 0x123B extended memory mapping mode added - using bit 4 to select (untested)
  • A new plugin system added, so that folk can add their own toys or support custom hardware. See CSpectReadme.txt

Friday, December 06, 2019

#CSpect V2.11.9

EDIT: Updated to V2.11.10 - Couple of new command line options for the next dev team

This update is really just to release fixes the Next team have had for a while. With all the changes to NextOS, and the hardware changes, I figured I should release this even though 60Hz mode isn't working right. So most things should work fine, but 60hz mode certainly has issues....

SNasm V2.0.19 changes
  • You can now reference a local label. i.e. LD HL,LABEL@LOCAL

#CSpect V2.11.10 changes
  • added -major and -minor to let you set the CORE version number
  • added -emu to force the setting of the emulation bit in nextreg 0
  • added debugger command "nextreg ," so you can set a next register from the debugger

#CSpect V2.11.9 changes
  • Updated audio to be 16bit so DAC isn't squished to oblivion. (should just sound better in general)

#CSpect V2.11.8 changes
  • Fixed "all RAM mode". 2 configurations weren't working right.
  • Core version number updated to 3.0
  • -60 now does proper 60hz, including reduced lines and proper 60hz audio *not working fully yet*
  • -com2=??? added. You can now have a second UART that can be used to send to a real Pi
  • port 0x153b (bit 6) added to switch UARTs from Wifi to Pi
  • Reg $7f now defaults to $FF on power up
  • When no UART2, now returns $FF
  • .ROM files (for OS loading) now come from the same path as the IMG file, rather than program folder
  • Mono "textmode" added
  • CP/M mode now works
  • Proper 60Hz Layer2 and Sprites now working
  • Coms with the Pi (and it not being there), no longer crashes CSpect.
  • null comport now returns 0, as per hardware.
  • Tilemap clip window right edge fix - I think.
  • 1 Bit Tilemaps should now work when tiles are pre-defined in bank 10/11