Tuesday, August 25, 2020

#CSpect V2.12.36

#CSpect V2.12.36/35 changes
  • Fixed F_RENAME to use IX instead of HL (now working)
  • Fixed sprite clipping (I think)
  • First pass at Tiles and Tilemaps in bank 7
  • Faked esxDOS loading now returns an "access denied" error, if it tried to open a file that is write protected
  • Transparent border now working (ULA off)
  • Added F_RENAME (untested)


Saturday, July 11, 2020

#CSpect V2.12.34

Quickly added in the last missing screen mode...

Edit: Fixed a minor issue where hex editing was case sensitive... so you needed to press SHIFT+letter... doh.

Edit: Updated for a couple of new dev features

#CSpect V2.12.34 changes
  • Fixed HEX editing in the memory window so that it's case insensitive. (don't have to press SHIFT)
  • ReadMe.txt now holds the "manual" part of this text file, and this is now "just" the version info.


#CSpect V2.12.33 changes
  • You can now click on the memory window to edit bytes directly by typing HEX values into it ENTER or click off to stop
  • You can now press KEYPAD_ENTER to display the game screen while in the debugger

#CSpect V2.12.32 changes
  • Fixed reg $8E bit 3 usage - don't change RAM mapping
  • Fixed Layer 2 palette colour offsets (NextReg $70)
  • NextRegs $6E and $6F now return top 2 bits as 0


Wednesday, July 01, 2020

#CSpect V2.12.31

Quickly added in the last missing screen mode...

#CSpect V2.12.31 changes
  • Added DAC's B,C and D
  • Added SAVE to the debugger: SAVE "NAME",add,len, SAVE "NAME",BANK:OFFSET,length, SAVE "NAME",BANK:OFFSET,BANK:OFFSET
  • You can now read DMA Registers
  • CTRL+F7 now executes to cursor in the debugger
  • Added the demo "mod_player.nex" added to #CSpect package


Tuesday, June 09, 2020

#CSpect V2.12.30

Quickly added in the last missing screen mode...

#CSpect V2.12.30 changes
  • Layer 2 palette offset added to 256,320 and 640 modes.
  • Fixed a crash when a write to E3 happens, and the roms haven't been loaded.
  • Fixed a crash in the debugger when you entered a bank:offset and the offset was an invalid symbol
  • Tilereg $6E set to $2C on power up - same as hardware. (don't assume though, set it)
  • Tilereg $6F set to $0C on power up - same as hardware. (don't assume though, set it)

SNasm V2.0.23 changes
  • Fixed a crash when an expression has a divide by 0


Sunday, May 17, 2020

#CSpect V2.12.29

Quickly added in the last missing screen mode...

#CSpect V2.12.29 changes
  • 640x256x4 Layer 2 mode added


#CSpect V2.12.28

Fixed DMA Audio playback using the pre-scaler

#CSpect V2.12.28 changes
  • Fixed DMA prescaler (usually used for digial audio)


Friday, May 15, 2020

#CSpect V2.12.27

Fixed a crash in AY Audio due to the extra wait states added to memory read/write when using faked esxDOS.
I've also updated the link in the side banner on the right - keep forgetting to do that.....

#CSpect V2.12.27 changes
  • Fixed a crash in AY Audio when loading in large files via fake esxDOS.


Thursday, May 14, 2020

RayCasting engine on the ZXSpectrumNext - Part4

So I'm going to skip sprites for now. They're basically the same as walls, but with transparency, though there are some fancy optimisations on it later that I will go into. This one will be big, as some of the code functions are huge.... be warned!

For now, I want to talk about up scaling everything to full Layer 2. Lores rendering at 128x96 is one quarter the fill rate we need for full Layer 2, and if we just scaled up, this would result in a frame rate of about 8 to 12 frames per game cycle - or about 4 fps. Obviously too slow. So we needed to seriously up our rendering game. Before doing anything, I actually had to scale everything up, so it was rendering in Layer 2 - all be it slowly. Then I could start the hard work.

The first thing I wanted to do was get rid of all the divides, and there was enough in there that it hurt a lot. So I created a very large one over table, so that I could convert from something like 100/4, to 100*0.25. Multiples are much quicker, but using a table basically means the table holds the answer - more or less. This got 1000 T-State function, down to about 150 T-States. That's a big win, especially given that this is done several times a vertical scan. In all, this shaved off about a frame, a good start!

Next I took a look at the rendering. That Texture_Vertical() function is just way too slow. I tried several short cuts but what I really needed was custom vertical code - for each span. What I did was to write a C# program to auto generate Z80 code, that given the base address of a screen line, and the base of a texture, would render a span exactly - without the need for loops, calculations - or anything in fact.

Here's a bit of code that's been generated to render a 16 pixel high span. In these functions DE points to the TOP of the span on the screen, and HL points to the base of the TILE column. The 48K screen is mapped into the lower memory - over lapping the ROM in Write Only mode, meaning D=0, and E=the layer 2 column to render. In fact, we can ignore D, and this can be setup by the function, as the span height indicates the Y coordinate it'll start on.

; Render height of 16
; T-States = 443
; Bytes    = 84
Render_16:
 ld d,68
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 inc l
 inc l
 inc l
 ldws
 ret

What about a smaller one - 4 pixels?
; Render height of 4
; T-States = 134
; Bytes    = 25
Render_4:
 ld d,74
 ldws
 ld a,15
 add hl,a
 ldws
 add hl,a
 ldws
 dec a
 add hl,a
 ldws
 inc a
 add hl,a
 ldws
 ret

As you can see, the first instruction moves to the correct Y, and you can see we scale nice and evenly. A gets loaded with the texel delta, and is added to HL. The code generate keeps track of registers and if it doesn't need to change, it can generate more optimal code. What about 11 pixels? This is a little dirtier.

; Render height of 11
; T-States = 278
; Bytes    = 52
Render_11:
 ld d,71
 ldws
 ld a,5
 add hl,a
 ldws
 dec a
 add hl,a
 ldws
 inc a
 add hl,a
 ldws
 add hl,a
 ldws
 add hl,a
 ldws
 dec a
 add hl,a
 ldws
 inc a
 add hl,a
 ldws
 add hl,a
 ldws
 add hl,a
 ldws
 dec a
 add hl,a
 ldws
 ret
You can see here, A moves from 5 to 4, then back to 5, yet the auto generation code knows that an INC and DEC is the best way to get it there, keeping the code as fast as possible. auto generated code is cool, because every time you think of a new "trick", you can simply add it, and regenerate everything! No need to manually put the improvement into every function, it's all automatic.

So, here's a long one.... 96 pixels.

; Render height of 96
; T-States = 1507
; Bytes    = 230
Render_96:
 ld d,28
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ldws
 ret
You can see there's combinations of Store,Store, INC, Store, Store. This is doing by a mixture of load a,store a, inc d's and LDWS, a new ZX Next opcode which INC's D and L. As tiles are 256 byte aligned, a column of 64 pixels is always inside a page, so LDWS is a very quick way to load, store and INC both D and L (14Ts). You just have to make sure the tiles are rotated correctly in memory.

The even cooler thing about this, is that it can "pre-clip" as well. Simply jump to a function that can draw a span scaled to 512 pixels, and the auto generated code will PRE-CLIP it. Here's a section of a 413 pixel scaled line...

; Render height of 413
; T-States = 1944
; Bytes    = 335
Render_413:
 ld d,0
 ld a,19
 add hl,a
 ld a,(hl)
 ld (de),a
 inc d
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ldws
 ld a,(hl)
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ld (de),a
 inc d
 ldws
In this case as we are scaling up, a pixel is loaded, then stored over and over again, hence the LD A,(HL) followed by the multiple LD (DE),A's. And if you look at the top, you can see this...
 ld d,0
 ld a,19
 add hl,a
We set the start of the line, already clipped to 0 (instead of minus), and HL is clipped, with 19 texels being outside the screen. The generator also keeps track of fractions, meaning it'll track sub texel coordinates cleanly, allowing for hires sub pixel accuracy in clipping and rendering - something you couldn't do if the spectrum had to calculate everything. It also means a vertical clip is just these 3 instruction. That's very cool, and quick.

This completely replaces the Texture_Vertical() function, meaning I just page in correct function into ROM space (in read mode), while WRITE mode is the layer 2 screen, and we jump to it. That's it.... this obviously give a massive speed boost.

Sprites are similar, but use LDIX for sprite rendering, and it has to compensate DE and HL appropriately. Here's a simple sprite render...

; Sprite32 Render height of 43
; T-States = 567
; Bytes    = 94
SPRender32_43:
 ld d,77
 ldix
 dec e
 inc d
 ldix
 dec e
 inc d
 ldix
 dec e
 inc d
 inc hl
 ldix
 dec e
 inc d
 ldix
 dec e
 inc d
 inc hl
 ldix
 dec e
 inc d
 ldix
 dec e
 inc d
 inc hl
 ldix
 dec e
 inc d
 ldix
 dec e
 inc d
 inc hl
 ldix
 dec e
 inc d
 ldix
 dec e
 inc d
 inc hl
 ldix
 dec e
 inc d
 ldix
 dec e
 inc d
 inc hl
 ldix
 dec e
 inc d
 ldix
 dec e
 inc d
 inc hl
 ldix
 dec e
 inc d
 ldix
 dec e
 inc d
 inc hl
 ldix
 dec e
 inc d
 ldix
 dec e
 inc d
 inc hl
 ldix
 dec e
 inc d
 ldix
 ret

We also use compressed vertical columns in sprite rendering, so if a full column is empty, it's skipped. This allows smaller sprites to be rendered much more quickly.
But... imagine we had a 64x64 sprite, and it had an 8x8 pixels sprite at the bottom - ammo you'd pick up or something. That's a LOT of space to "try" and render, especially when you're very close to it. This would in fact fill the screen with "empty" pixels, slowing everything down.

So, what we do is detect different types of small sprites. 8 pixels off the bottom, 16, 24, 32, 48. This means ONLY the lower pixels get rendered, and the rest are "skipped". So an 8x8 sprite in the bottom of a 64x64, ONLY renders 8x8 scaled pixels, and as you get closer, the pixels go off the bottom of the screen, and we don't spend all our CPU time trying to render empty pixels.

The C# function is pretty large, but here's a small section of it to give you the "gist". This is how it decides to skip texels....

                // de = screen address
                // hl = texture column address
                // ldws used. (de)=(hl): inc d: inc l
                // if clipped off the top, then skip down...
                byte clip = (byte)Math.Floor(V);
                switch (clip)
                {
                    case 0:
                        break;                  // moving 1. already done in ldws
                    case 1:
                        buffer.Add(0x2C);       // moving 2. One already done... do another "inc l"
                        asm += "\tinc\tl\n";
                        TCount += 4;
                        break;
                    case 2:
                        buffer.Add(0x2C);       // moving 3. One already done... do 2 more "inc l"s
                        buffer.Add(0x2C);
                        asm += "\tinc\tl\n";
                        asm += "\tinc\tl\n";
                        TCount += 8;
                        break;
                    case 3:
                        buffer.Add(0x2C);       // moving 3. One already done... do 3 more "inc l"s  (4Ts each)
                        buffer.Add(0x2C);
                        buffer.Add(0x2C);
                        asm += "\tinc\tl\n";
                        asm += "\tinc\tl\n";
                        asm += "\tinc\tl\n";
                        TCount += 12;
                        break;
                    default:
                        buffer.Add(0x3e);               // Anything over 4, means we use an "ADD"
                        buffer.Add((byte)clip);         // "LD A,$XX"  -  7Ts       
                        AReg = clip;
                        TCount += 7;
                        buffer.Add(0xed);               // "ADD HL,A"  -  8Ts             (add hl,$xxxx = 16Ts)
                        buffer.Add(0x31);
                        TCount += 8;
                        asm += "\tld\ta," + clip.ToString() + "\n";
                        asm += "\tadd\thl,a\n";
                        break;
                }


There's one final cheat I do, which I'll talk about next time....


Wednesday, May 13, 2020

#CSpect V2.12.26

Fix for a crash in the debugger

#CSpect V2.12.26 changes
  • Fixed a crash inside the debugger
  • Added a new eDebugCommand.ClearAllBreakpoints to the iCSpect plugin interface


Tuesday, May 12, 2020

#CSpect V2.12.25

Another quickie...

#CSpect V2.12.25 changes
  • Tiles over a transparent border now fixed


Sunday, May 10, 2020

#CSpect V2.12.24

Another quick addition/fix for some devs.

#CSpect V2.12.24 changes
  • Lower border colour fixed (I think)
  • Added some wait states to slow down 28Mhz mode to more closely match the real machine


Wednesday, April 29, 2020

#CSpect V2.12.23

Quick addition for some devs.

#CSpect V2.12.23 changes
  • CTRL+F3 will now save a screenshot into the current folder (without shader effects)


Monday, April 20, 2020

#CSpect V2.12.22

A few more fixes, mainly while I've been doing some wifi work, but some other dev finds.

#CSpect V2.12.22 changes
  • Fixed a stupid typo on iCSpect interface to get sprites.
  • NextReg 0x1c can now be read
  • NextRegs 0x61 and 0x62 can now be read correctly
  • Reading/Setting NextReg 255 (which isn't a real one) was crashing the emulator due to plugins
  • Fixed the fallback colour, was only 8 bit instead of 9
  • Fixed reading of the raster line for lower in the screen (upper 64 pixels usually)
  • Fixed some threading issues in the serial coms (on USB hardware)
  • Fixed and changed baud rates, and added in all normal ones (HDMI timing = 27000000/baud )
    • 2000000 = 13
    • 1152000 = 23
    • 921600 = 29
    • 576000 = 46
    • 460800 = 58
    • 256000 = 105
    • 230400 = 117
    • 128000 = 210
    • 115200 = 234
    • 57600 = 468
    • 38400 = 703
    • 31250 = 864
    • 19200 = 1406
    • 9600 = 2812
    • 2400 = 11250
    • 1200 = 22500

SNASM V2.0.22
  • Fixed “align” for use with segments
  • Fixed “org” so it works with segments
  • You can now use _ in binary (ie. %11_00_1111 )

Tuesday, April 14, 2020

#CSpect V2.12.20

Another couple of dev fixes.

#CSpect V2.12.20 changes
  • Fixed a bug in 48k Layer 2 offset mapping for 320 and 640 modes(untested)
  • Changed set/clear breakpoints to be a proper set/clear
  • Port FF is now readable if bit 2 of nextreg 8 is set
  • Removed some debug code that could cause a crash, or a random file appearing


Monday, April 13, 2020

#CSpect V2.12.18

Couple of bug fixes for devs....

#CSpect V2.12.18 changes
  • Fixed Layer 2,320x256 clip window
  • Fixed a crash in Layer 2, 320x256 rendering
  • Fixed NextReg0 (machine type) initialisation bug


Saturday, April 11, 2020

#CSpect 2.12.17

Another day, another version. This one for some devs who are trying to use the 1bit tilemaps, and another minor update for plugin writers.

#CSpect V2.12.17 changes
  • 1Bit tilemaps now use Global Transparency for it's transparency, not the tile index
  • Have added GetSprite() and SetSprite() to iCSpect interface for plugin writers


Friday, April 10, 2020

#CSpect V2.12.16

Another very minor update for plugin writers. This gives access to sprite image memory, and undoes a "fix" that wasn't right....

#CSpect V2.12.16 changes
  • Reverted .NEX border colour palette change, was just wrong.
  • Added PeekSprite(address) and PokeSprite(address,value) to iCSpect interface for plugins


Tuesday, April 07, 2020

#CSpect V2.12.15

This release is specifically to add to the plugin API, allowing plugin devs to control the debugger including single stepping and break points.

#CSpect V2.12.15 changes
  • Plugin API TICK() now called while in debugger mode
  • Plugin API can now control the debugger using the new extended eDebugCommand enum
  • -remote command line switch added, to disable the "visible" part of the debugger

public enum eDebugCommand
    {
        none = 0,
        ///Set remote mode - enables/disables debugger screen (0 for enable, 1 for disable)
        SetRemote,
        /// Get the debug state (in debugger or not - returns 0 or 1)
        GetState,
        /// Enter debugger
        Enter,
        /// Exit debugger and run (only in debug mode)
        Run,
        /// Single step (only in debug mode)
        Step,
        /// Step over (only in debug mode)
        StepOver,
        /// Set a breakpoint [0 to 65535]
        SetBreakpoint,
        /// Clear a breakpoint [0 to 65535]
        ClearBreakpoint,
        /// Get a breakpoint [0 to 65535], returns 0 or 1 for being set
        GetBreakpoint,
        /// Set a breakpoint in physical memory[0 to 0x1FFFFF]
        SetPhysicalBreakpoint,
        /// Clear a breakpoint in physical memory[0 to 0x1FFFFF]
        ClearPhysicalBreakpoint,
        /// Get a breakpoint in physical memory[0 to 0x1FFFFF], returns 0 or 1 for being set
        GetPhysicalBreakpoint,
        /// Set a breakpoint on CPU READING of a memory location[0 to 65535]
        SetReadBreakpoint,
        /// Clear a READ breakpoint [0 to 65535]
        ClearReadBreakpoint,
        /// Get a READ breakpoint [0 to 65535], returns 0 or 1 for being set
        GetReadBreakpoint,
        /// Set a breakpoint on CPU WRITING of a memory location[0 to 65535]
        SetWriteBreakpoint,
        /// Clear a WRITE breakpoint [0 to 65535]
        ClearWriteBreakpoint,
        /// Get a WRITE breakpoint [0 to 65535], returns 0 or 1 for being set
        GetWriteBreakpoint,
    };


Sunday, April 05, 2020

#CSpect V2.12.13

So this includes a few fixes users have found in the last version, including an important OS one - hence the quick release.

#CSpect V2.12.13 changes
  • Fixed GetRegister() callback
  • Fixed NextReg 0x05 reading
  • Setup some of the Nextreg on reset (also helps reset a bit better)
  • Fixed bit 3 of new NextReg 0x8E, which fixes dot commands (and sprite editor exiting)
  • Changed number of scanlines to 262 (HDMI) for 60Hz
  • Fixed a crash in plugin creation
  • Added pluging "eAccess.NextReg_Read" type (untested)


Friday, April 03, 2020

#CSpect V2.12.9

Been a while since the last update. This one adds the 2 new Next Registers that are required for the very latest NextOS - which is the main reason for doing this. There are a lot of other fixes as well....


#CSpect V2.12.9 changes
  • 50hz/60hz bit now set in NextReg 5
  • -debug added to the command line. You can now "start" in the debugger.
  • Minor Fix to 60Hz audio
  • Fixed AY partial port decoding
  • Fixed a minor reset stall when it was waiting on a HALT before the reset
  • NextReg 0x8c added
  • NextReg 0x8e added
  • Added AltRom1 support
  • Fixed a serial port "send" bug. Now sends the whole byte
  • Fixed tile attribute byte bit 7 - top bit of palette offset was being ignored
  • Fixed esxDOS date/time function (M_GETDATE $8E)
  • NEX file 1.3 now parsed... should load more things
  • NEX files get IRQs switched off when loaded (as per machine)
  • Fixed esxDOS date is wrong
  • Fixed Border colour now sets the palette entry as well
  • Fixed GetRegister() in plugins not working
  • Fixed Copper run then loop not working
  • Fixed Top line (in 256 pixel high view) is missing
  • Fixed DI followed by HALT should stop the CPU
  • Fixed Hires tilemap has clipping on far right
  • F8 not steps over conditional jumps and branches that go backwards. Branches forwards are taken
  • Fixed ADD HL/DE/BC,A no longer affects flags
  • Fixed OUTI so that B is decemented first. (OTIR was already like that)
  • Minor change to 60hz audio
  • NextReg 0x69 can now be read
  • DMA Continue mode added
  • Fixed a timing issue with DMA, so timing much better


Friday, February 28, 2020

RayCasting engine on the ZXSpectrumNext - Part3

SO now that I had the basic ray casting going, it was time to get some textures in. I toyed with a few ways of doing this, from normal code, to using the new instructions. I'd gotten it down to about 56 T-States per pixel when Jim Bagley started asking about the engine. He'd been thinking of doing one, and had a 48T-state rendering loop.... We chatted about it, and I noticed his wasn't quite right, but I used my rendering, and some of his methods to get mine down to the 48Ts anyway.

texture_macro macro
        exx             ; 4
        ld      a,(de)  ; 7 read texel
        add     hl,bc   ; 11
        ld      e,h     ; 4
        exx             ; 4 = 30

        ld      (hl),a  ; 7
        add     hl,de   ; 11 = 48
        endm

The idea here is to make a macro then unroll the loop 76 times - the height of the lowres render area. We then work out how many pixels we've to render, then we jump into the middle of this code. This works "okay" for the most part...

The biggest problem with this method, is the setup. The TextureVertical() call takes a load of variables to work out everything

; *************************************************************************************
;
; Fill a textured vertical line in lowres
;
; e = X pixel on screen - where column lives
; d = start (Y pos)
; l = end (Y pos)
; a = line height (number of pixels to render)
; h = full pixel height (FULL height - including clipped area)
; b = Tex X (column to render)
; c = tile to render
;
; *************************************************************************************
From this list of things, we then work out texture scaling values, tile+texture address, bank it's in, screen address+column offset, not to mention the point we need to "jump" to render all the pixels. Lastly, we also need to "clip" off the top of the screen, although this ends up being a very tight, simply loop adding the deltas until it's "on" screen. This code looks pretty ugly... but here it is...

Texture_Vertical:
        ld   (tmp+1),a
        ld   a,h
        ld   (tmp),a
        ld   a,b
        ld   (texel_X+1),a
 
        ; work out tile bank, and offset
        ld   a,c
        and  1
        ld   (tile_id+1),a
        ld   a,c
        srl  a
        add  a,TilesSeg
        NextReg $52,a

        ld   a,l
        sub  d
        push af  ; store length

        ; work out lowres screen address (y*128)+x
        ld   a,e
        ld   e,0
        sra  d  ; *128
        rr   e
        add  a,e  ; +x
        ld   e,a
        ld   hl,ScreenAddress
        add  hl,de
        ld   de,$0080
        exx   ; screen rendering in "alternate" registers

        ld   a,(tmp)
        ld   de,Scales
        add  de,a

        pop  af
        add  a,a
        ld   hl,ScaleLookup
        add  hl,a

        ld   a,(de)  ; get texel scale
        ld   c,a
        inc  d
        ld   a,(de)
        ld   b,a
 

        ld   e,(hl)  ; get jump table
        inc  hl
        ld   d,(hl)
        ld   hl,VTLoop
        add  hl,de
        ld   (Jmp+1),hl

        ld   a,(tile_id+1)
        add  a,a
        add  a,a
        add  a,a
        add  a,a  ; = $1000
        ld   h,a
        ld   l,0
        add  hl,Tiles
        ex   de,hl
        ld   hl,(texel_X)
        srl  h  ; /2 = *128
        rr   l
        srl  h  ; /4 = *64
        rr   l
        add  hl,de
        ex   de,hl
        ld   h,e
        ld   l,0


        ; clip top?
        ld   a,(tmp+1)
        and  a
        jr   z,@SkipCLip
@Clip:  add  hl,bc
        dec  a
        jp   nz,@Clip
        ld   e,h
@SkipCLip:
        exx
Jmp     jp   $0000

So the engine will call this 128 times, once per column of the screen. So this setup time is very expensive, and something I'd need to try and optimise at some point. I did have a large scaler divide table in here, to help speed up working out texture scaling. It took up a huge amount of space, but got rid of 128 ugly divides - which take about 1000Ts per divide(ish), so it was well worth it.

Now that this was in, I had the very basics of a running engine.... it was working, it showed it was very possible. Yes, I still had to add 3D sprites in, but remembering how slow 3D games were on the original machine, this was very workable.


If this had come out back in the day, I'd have wet my pants!

So back to optimisation. I unrolled the 16bit divide I was using which shaved a little off, and did some other work to save Tstates here and there. But all these were just tiny amounts. What I needed, was a big saving....
I change the start/end line calculation with a simple table look up, which did save a chunk of clipping code, which was a good start - a couple hundred Tstates per column (so 128*200-ish), which is pretty good, and then I added to this by optimising the tile address calculation, but i was back to saving a few Tstates here and there.

I decided to look at the DDA stepping code, and I shuffled some of the instructions around, saving a memory read inside the map look up code, which happens every step. From here, I decided I'd like to make the MULs and DIVs macros, so I didn't have to call them. There's about 20 or so calls each vertical, so 20-ish*128*Call+ret Tstates. This was again a fair saving, but it doesn't half balloon the code!! This made debugging tricky, so I stuck it on a compile flag to make testing easier.

I added in some more textures, so that I could get a little depth to the walls, and this really did make a huge difference. This was done by simply adding darker textures, and all texture numbers were then multiplied by 2, and the "side" it hit, added on. All these changes made the engine run in 2 to 3 frames, which was definitely in the ballpark I was aiming for!

You can see from the video, that this is definetly usable. Yes, we still have sprites to add.... but it's pretty fast and fun to run around in.

Next up was adding in some 3D sprites....

Tuesday, February 11, 2020

RayCasting engine on the ZXSpectrumNext - Part2

Now we had the prototype, the first thing to do was a direct port to Z80. To do this, I went through each line of the C# like this...

// calculate ray position and direction
    // Int16 cameraX = (Int16)(2 * ((xx << 8) / screen_width) - 0X100); //x-coordinate in camera space
    short cameraX = (short)(xx <<<< 2);
    cameraX = (short)(cameraX - 0x100);
and wrote an untested Z80 version  - like this
ld a,0
AllLines:
     ld  (xx),a

     ; Int16 cameraX = (Int16)(2 * ((xx << 8) / screen_width) - 0X100);
     ; cameraX = (short)(xx << 2);
     ; cameraX = (short)(cameraX - 0x100);
     ld   l,a  
     ld   h,0
     add  hl,hl  ; x/128
     add  hl,hl  ; *2
     ld   de,$100
     xor  a
     sbc  hl,de
     ld   (cameraX),hl

I added the C# code in as comments in order to keep track - as there's a LOT of code to put in, it's also a great reference when doing the actual port and looking for bugs.

The next thing I needed was a fast, signed 16bit x 16bit multiply. I got an unsigned one from the Z80 C library, and I then needed to make it a signed version. Signed multiples are easy enough, you simply XOR the top 2 bits of each value, and remember if it's a 1 or not. You then take the ABS() of these values and multiply them using the "unsigned" 16x16 multiple... then on exit, if the xor answer from the start was 1, you negate the answer. Job done.

; ****************************************************************************************
; multiplication of two 16-bit numbers into a 32-bit product
;
; enter : de = 16-bit multiplicand = y
;         hl = 16-bit multiplicand = x
;
; exit  : hlde = 32-bit product
;         carry reset
;
; uses  : af, bc, de, hl
; ****************************************************************************************
Mul_16x16:
     ld   b,l                  ; x0
     ld   c,e                  ; y0
     ld   e,l                  ; x0
     ld   l,d
     push hl                   ; x1 y1
     ld   l,c                  ; y0

     ; bc = x0 y0
     ; de = y1 x0
     ; hl = x1 y0
     ; stack = x1 y1

     mul                       ; y1*x0
     ex   de,hl
     mul                       ; x1*y0

     xor  a                    ; zero A
     add  hl,de                ; sum cross products p2 p1
     adc  a,a                  ; capture carry p3

     ld   e,c                  ; x0
     ld   d,b                  ; y0
     mul                       ; y0*x0

     ld   b,a                  ; carry from cross products
     ld   c,h                  ; LSB of MSW from cross products

     ld   a,d
     add  a,l
     ld   h,a
     ld   l,e                  ; LSW in HL p1 p0

     pop  de
     mul                       ; x1*y1

     ex   de,hl
     adc  hl,bc
     ret


With this done, I could now do the basic 16 bit maths I needed like this...

; var rayDirX = dirX + ((planeX * cameraX)>>8);
     ld   hl,(cameraX)
     ld   de,(planeX)
     call SMul_16x16           ; exit  : hlde = 32-bit product
     ld   h,l
     ld   l,d                  ;>>8
     ld   de,(dirX)
     add  hl,de
     ld   (rayDirX),hl

You can see, that once it's fit into 8.8 maths, a lot of of complexity falls away. Aside from the 16x16 multiply, you can see the shift 8 is actually just taking the whole byte from one register to another. This basic process is relatively quick, however you have to do hundreds of them - which is the real speed issue we'll need to tackle later.


There's a few of these "blocks" to convert, but the biggest target was the delta stepping. It's important to get that as fast as possible. There are 3 different stepping functions, X axis, Y axis, and a general that moves across both axis at once - this is the one that'll be hardest to optimise and keep the speed up with. It's important to get this one as fast as possible, because stepping across the map until you hit a block will be executed hundreds if not thousands of times, especially in large open rooms.


So here's the C# code I need to port.....
while (true)
   {
          //jump to next map square, OR in x-direction, OR in y-direction
          if (sideDistX < sideDistY)
          {
               sideDistX += deltaDistX;
               mapX += stepX;
               side = 0;
           }
           else
           {
               sideDistY += deltaDistY;
               mapY += stepY;
               side = 1;
           }

           //Check if ray has hit a wall 
           int map_index = (mapY * MAP_WIDTH) + mapX;
           last_tile = map.worldMap[map_index];                    
           if (last_tile != 0) break;
   }
I spent some time fiddling with register layouts and the rest, trying to keep it all in registers as memory access is painful.
; --------------------------------- General ---------------------------
       ; while (true)
       ; jump to next map square, OR in x-direction, OR in y-direction
       ld   a,(mapX)
       ld   c,a
       ld   a,(mapY)
       ld   b,a
       ld   a,(stepX)
       ld   d,a
       ld   a,(stepY)
       ld   e,a
       exx
 
       ld   hl,(sideDistX)   ; 16
       ld   iy,(sideDistY)   ; 20
       ld   de,(deltaDistX)  ; 20
       ld   bc,(deltaDistY)  ; 20
       ld   ixl,$30          ; side
       xor  a                ; and at the end of the loop clears carry
@KeepLooping:
       ; if (sideDistX < sideDistY)
       ld   a,l              ; 4
       sbc  a,iyl            ; 8
       ld   a,h              ; 4
       sbc  a,iyh            ; 8

       jr   nc,@ix_greaterthan 
       ; sideDistX += deltaDistX;
       add  hl,de            ; 11
       ;mapX += stepX;
       exx                   ; 4
       ld   a,c              ; get mapX
       add  a,d              ; add stepX
       ld   c,a

       ; side = 0;
       ld   ixl,$30          ; 9Ts  ($30 for $3000 base address)
       jp   @skip_branch
@ix_greaterthan:

       ;sideDistY += deltaDistY;
       add  iy,bc            ; 15Ts
       ;mapY += stepY;
       exx
       ld   a,b              ; get mapY
       add  a,e              ; add stepY
       ld   b,a

       ; side = 1;
       ld   ixl,$20          ; 9Ts  ($20 for $2000 base address)
       ld   a,c              ; mapX
@skip_branch:

       ld   h,b              ; mapY
       ld   l,0
       srl  h                ; *64
       rr   l
       srl  h
       rr   l
       add  hl,Map           ; 16
       add  hl,a             ; A already mapX
       ld   a,(hl)           ; get map entry
       exx
       and  a
       jp   z,@KeepLooping

       ld   (lastblock),a  
       ld   a,ixl
       ld   (side),a
       exx
       ld   a,b
       ld   (mapY),a
       ld   a,c
       ld   (mapX),a
@FoundBlock:
So you can see I've managed to keep it all in registers - even though I had to use the alt set, and ix and iy. But that's still much faster than saving values, and reloading others from memory. The X and Y axis ones are similar, but without the branches and doesn't need as many registers. The last part is simply working out how how to draw the column and drawing it. This is simply a case of working out the screen address and plotting a vertical line, clipping to the top and bottom of the screen - simple compared to the rest of the stuff we've just done! Once that's done - and once I spent a day or so debugging it and getting it all working, I was left with this....


This was the first publicly shown version, and took about a month and a half of my spare time to get running (more or less).

Monday, February 03, 2020

RayCasting engine on the ZXSpectrumNext - Part1

So I've been making some good progress on my Ray Casting Engine - which is technically what a Wolfenstein engine is, and I thought it'd be fun to write how I did it. It's a large, complex engine and didn't come about over night. In fact, as I'd never written one of these before, I spent about a month (in the evenings) working out how to do it in the first place, before even starting on any Z80.
Believe it or not, I actually wrote 3 different versions before touching Z80, and then a further 2 to help debug it. but more on that later....

I did buy the Wolfenstein engine book, but actually, didn't really need it. What I ended up using was this website.


This was a great place to start, as it gives a workable demo, but I needed a framework to put it into. I went with GameMaker: Studio 2, as I'm intimately familiar with it, and it meant i could jump right in. This gave me a screen size of 640x480 (same as the demo)


Once I'd cut and paste (pretty much), the example into GameMaker, I could start to try and figure out how it was working. The goal was to get the maths down to 8.8 fixed point, so that it would fit in Z80's 16bit registers, and I could handle the maths quickly. But before doing that, I needed to do a 16.16 fixed point version. Doing this meant I could verify that it worked, without getting too close to the maths limits, as 8.8 would be cutting pretty close. In fact these were both 15.16 and 7.8 "signed", as vectors etc would be in all directions, and so could be negative.

Converting to fixed point is pretty straight forward, basically for all 16.16 numbers,  you have to multiply all numbers by 65536, or use a <<16. So 15.453 would be 15.453<<16 which is 1,012,727 or $F73F7 in HEX. I personally think of all numbers in hex, as $F73F7 ( $000F_73F7 ), where $000F is the whole number (which is 15), and $73F7 is the fraction. This is perfect, because it means to get the whole number, you just have to take the upper 16bits, usually by doing a >>16.

There's a heap of pages on fixed point maths, so I'll just say here that the basics are when you multiply a 16.16 number by another 16.16 number, you then >>16 to get the final answer. So...

var a = $F73F7;            // 15.453
var b = $83687;            // 8.213
var ans = (a*b)>>16;    // == $7EEA54  (126.9153)

And that's basically it. So after converting to 16.16 (as shown below), I was happy to try and get it into 8.8

This simply meant copying the above code and doing shifts of 8 rather than 16. I did also have to reduce the screen size to 128x76, down from 640x480 that the demo used. The original Wolfie used 304x152, we're a little under quarter the size. But since I can only hold a number from 0 to 127 in 7.8, and as I'm targeting the Spectrum Next's "lores" mode (127x96), then this all fits pretty snugly.

Once I swapped everything over to use >>8 and <<8 type maths, I started noticing the odd missing line on the screen. This turned out to be when DX or DY was 0 exactly - stepping on the axis exactly. This is due to the maths no longer being accurate enough. There are times when you do a 1/X so that you can avoid lots of divisions (i.e. 10*0.5, is faster than 10/2 as 1/2 = 0.5). In 7.8 fixed point you really need to do $10000/X, but that's out of range, so I'm stuck with doing $7fff/X. This has knock on effects, but you can cope with them later. Note: technically it's $100, but as you need to shift up by 8 before doing an actual divide in fixed point, that makes it $10000, which is what is out of range, so you're stuck with $7FFF.

After porting to 7.8 I did hit another issue, rotating the player's view was going "nuts". This was because the original demo rotated a vector constantly, and while floating point could handle the accuracy, 7.8 just "drifted" and these vectors stretched and went bizarre. To combat this, I created a table of 256 angles using floating point, and then taking it down into 7.8 fixed point for storing. This means the player now has an "angle" he's facing, and I simply look up a perfect vector for that angle.

Once I had these issues fixed up, I started to do a Z80 port, only to discover GameMaker doesn't really do the job of fixed point-ing properly. This is due to the typeless nature of GameMaker, and that many calculations are either done in doubles, or 64bit. This isn't useful for my needs, I need something very strongly typed, so that I can make sure it all "fits", before taking the leap into Z80.

So..... I needed a C/C++ or C# framework, where I could use Int16's directly. I decided to rip the guts out of my #CSpect framework, and use that to give me a basic bitmap and keyboard input. I then ported ALL GameMaker code (bot 8.8 and 16.16 versions!) over to C# so I had a good debugging framework.

One extra bit I needed to do in C#, was to deal with signed >>8. C/C++ and C# doesn't do signed shifts the way Z80 does, so I wrote a small signed shift right function to use instead of a simple >>8. This will be replaced in Z80 with actual shifts if needed, although usually  you just take the top 16 bits directly and need no shifts at all.


The signed shift function isn't fancy, it's just designed to work as I'd expect....


Once all THAT was done.... Actually, I want to just pause here. It may seem like I'm just through stuff together at a great rate of knots, but saying "I'll just do this.... there, done that", actually this all took some time. In all, I spent about a month, reading about the engine, and getting these prototypes up and running. It's important to know there is effort involved in this like this, no matter how experienced you are, you still need to do the grunt work. And this is all BEFORE doing any Z80 at all really!!

Now that all this "prep" work was done..... I'd figured out how the engine worked - mostly, I have a prototype that I could step through along with the Z80 one, so that I could see what answers the Z80 should be giving - this is invaluable on any complex bit of code (and if you don't fully understand the engine). My goal is to almost line for line port the C#, so that should mean the answers and variables should get exactly the same answers. So stepping the Z80 and C# will give exactly the same answers, and if they don't, then the Z80 is wrong and I'll be able to figure out why.

Now comes the hard bit.... writing the Z80 port!

Monday, January 06, 2020

#CSpect V2.12.5

Fixed plugin loading. This was due to the new loading/reset system that was nuking loaded plugin mappings.
I've also added a "Tick" to the plugin interface which gets called once per emulated frame, along with a debugger call which allows you to tell CSpect to enter the debugger. This is handy if you need to debug the operation that "just" happened.
Lastly... I've included a very simple plugin example, along with the interface .CS files to look at.


#CSpect V2.12.4 changes
  • Border now comes from fallback colour if paper/ink mode is 255 (as per hardware)
  • Fixed Plugin loading. Was broken due to new system loading+resetting.
  • Added new "Tick" to iPlugin interface, called once per emulated frame
  • Added Debugger() call to CSpect interface, allowing you to enter the debugger from the emulator
  • Added Plugin example (and current interface)


Friday, January 03, 2020

#CSpect V2.12.4

Another minor fix to fix a memory access bug, and a new MMC folder issue that appeared in the last couple of versions.


SNasm 2.0.21
  • Added “SLL” instruction


#CSpect V2.12.4 changes
  • Fixed memory reading of Layer2 mapping in $2000-$3fff
  • Fixed MMC path, it was being reset when loading SNA/NEX from the command line.