Saturday, May 26, 2018

Making NEXT Lemmings: Part 4

I decided to get back to work and fix my sprite clipping for my object rendering. It turned out to be a simple fix and was just messing up when crossing a bank.



Once I got this done I went looking for more levels with lots of objects so I could test out the level rendering and get a good idea of overall performance. When I was doing levels, I used to love adding in lots of water for decoration - usually right across the level, so I went looking for one of them.
However, it looked like I wasn't converting the levels properly, as there was no water to be seen - or at least very little. I tried several of my old levels, but they were all the same, loads of water removed. Had I missed something?


After much investigation, it turns out that Windows Lemmings doesn't have all the water that the Amiga one did! What the hell!?!? I quizzed Russell Kay (who wrote it Windows Lemmings), and he told me they'd removed a lot of the decorative items for performance reasons. Damn...

This was a mixed blessing as sure it wouldn't look 100% like the Amiga one, but at the same time, it meant I'd be able to keep performance up quite a bit. Oh well.... it wasn't like I could do anything about it.

Speaking of performance.... I'd been using the new ZXNext instructions a lot in my rendering code, so I suddenly started to wonder how I'd fair if I used only the original Z80 instruction set. I was in for a shock that's for sure, as the extra code required would push rendering times up massively.


You can see from the image above the huge speed boost the new instructions - in particular LDIX, gives 256 colour (Layer 2) rendering code. The image on the left uses a very standard rendering loop, load a value from (HL) into A, test to see if it's 0, branch if it is, other wise, store in (DE), then INC HL and INC DE. LDIX does this pretty much in one instruction but has the added advantage you can compare A to any value, not just 0.

There are several new instruction aimed at giving game devs more tools to speed up their code, some of them are real beauties.

Final new Z80 opcodes on the NEXT (V1.10.06 core)
======================================================================================
   swapnib           ED 23           8Ts   A bits 7-4 swap with A bits 3-0
   mul               ED 30           8Ts   multiply D*E = DE (no flags set)
   add  hl,a         ED 31           8Ts   Add A to HL (no flags set)
   add  de,a         ED 32           8Ts   Add A to DE (no flags set)
   add  bc,a         ED 33           8Ts   Add A to BC (no flags set)
   add  hl,$0000     ED 34 LO HI     16Ts  Add $0000 to HL (no flags set)
   add  de,$0000     ED 35 LO HI     16Ts  Add $0000 to DE (no flags set)
   add  bc,$0000     ED 36 LO HI     16Ts  Add $0000 to BC (no flags set)
   outinb            ED 90           16Ts  out (c),(hl), hl++
   ldix              ED A4           16Ts  As LDI,  but if byte==A does not copy
   ldirx             ED B4           21Ts  As LDIR, but if byte==A does not copy
   lddx              ED AC           16Ts  As LDD,  but if byte==A does not copy, and DE is incremented
   lddrx             ED BC           21Ts  As LDDR,  but if byte==A does not copy
   ldpirx            ED B7           16Ts  (de) = ( (hl&$fff8)+(E&7) ) when != A
   ldirscale         ED B6           21Ts  As LDIRX,  if(hl)!=A then (de)=(hl); HL_E'+=BC'; DE+=DE'; dec BC; Loop.
   mirror a          ED 24           8Ts   mirror the bits in A     
   mirror de         ED 26           8Ts   mirror the bits in DE     
   push $0000        ED 8A LO HI     19Ts  push 16bit immidiate value
   nextreg reg,val   ED 91 reg,val   16Ts  Set a NEXT register (like doing out($243b),reg then out($253b),val )
   nextreg reg,a     ED 92 reg       12Ts  Set a NEXT register using A (like doing out($243b),reg then out($253b),A )
   pixeldn           ED 93           8Ts   Move down a line on the ULA screen
   pixelad           ED 94           8Ts   using D,E (as Y,X) calculate the ULA screen address and store in HL
   setae             ED 95           8Ts   Using the lower 3 bits of E (X coordinate), set the correct bit value in A
   test $00          ED 27           11Ts  And A with $XX and set all flags. A is not affected.

New instructions like MUL, MIRROR, PIXELAD,PIXELDN are ones lots of game devs would have killed for back in the day. With the spectrum screen being so tricky, the new instructions like pixelad and pixeldn are a god send for developers, taking away one of the major pains and slow downs they had in rendering.

So after getting a warm fuzzy feeling at my rendering speed, I decided to try and get the SID chip working. This was before we lost it obviously. I decided to use the reSID library and loaded the DLL on startup. But I just could not get it working....


This is an image of a single channel playing a pulse wave - so it should be a simple square layout, but as you can seem, the waves are not only very thin, but have odd little bumps on the top, and that odd block missing. I fought with this for a while, quickly getting nowhere, so eventually gave up and decided to stick with my own SID code from my C64 emulator. It's not great, but does sound okay, and does work - which is always a plus.

All this was working towards a new major CSpect release, to try and get as close to the actual machine as I could. This would also include the new 3xAY chip, and DMA.


DMA (Direct Memory Access controller) was something I was really wanting, as it would speed up my Lemmings rendering code hugely. When I copy the screen each game cycle, it can take 2-3 frames just for that copy as it needs to copy 38K each game tick, which for a spectrum, is a hell of a lot. DMA runs at the same speed as the CPU clock, and at 4T-States per byte copied, is a massive boost in performance. But first, I needed to get it into CSpect, and that meant understanding how it worked - beyond what most coders would care about.

I spent a while hunting for more info on the DMA chip, and finally found the datasheet for it, which you can find on an earlier blog post ( DMA Datasheet ). It's a little confusing, but with the help of Victor I stumbled through creating the state machine inside CSpect. The DMA is basically a set of registers that you set by doing a stack of OUTs, with the first byte of the instruction telling the DMA controller what registers follow. Once I had this in place, Victor gave my little DMA sample code a once over, testing it on the real hardware, and I was then able to also get something running locally.

DMA has a few modes, it can either increment, decrement or not move the source or destination, and it can go to RAM or a PORT. So I started off by trying to DMA a stack of data to the border and see what happens...


After a bit of fiddling around, I finally got the DMA working. I had to rearrange my CSpect processing loop as the DMA locks out the CPU, but I still needed the screen to render each scanline based on the number of T-States the DMA was taking from the machine overall. It's certainly not perfect, but it doesn't have to be. CSpect is all about making it easy to code for the Next, not about making it pixel perfect.

Next I wanted to do a memory to memory copy, so I grabbed a screen show and DMA'd it up and got the image below-  this was at 28Mhz...


It's a shame we've lost the 28Mhz, as it's ballistically quick. Here you can see I can copy a normal spectrum screen in about 16 scan lines - although this is probably without the old memory contention in there, but no matter what, it's still incredibly quick. That's not to say 14Mhz isn't quick as well mind, and the speed up it gives me for my Lemmings screen copy code is well worth the effort. Here's the little DMA program that copies the screen above (which is included in the CSpect archive)...

DMA db $C3   ;R6-RESET DMA
 db $C7   ;R6-RESET PORT A Timing
        db $CB   ;R6-SET PORT B Timing same as PORT A

        db $7D    ;R0-Transfer mode, A -> B
        dw ScreenDump  ;R0-Port A, Start address    (source address)
        dw 6912   ;R0-Block length     (length in bytes)

        db $54    ;R1-Port A address incrementing, variable timing
        db 2   ;R1-Cycle length port A
    
        db $50   ;R2-Port B address fixed, variable timing
        db $02    ;R2-Cycle length port B
    
        db $C0   ;R3-DMA Enabled, Interrupt disabled

 db $AD    ;R4-Continuous mode  (use this for block tansfer)
        dw $4000  ;R4-Dest address     (destination address)
    
 db $82   ;R5-Restart on end of block, RDY active LOW
  
 db $CF   ;R6-Load
 db $B3   ;R6-Force Ready
 db $87   ;R6-Enable DMA

With the DMA now running in CSpect, I thought I'd give some of the old DMA demos a go, see how compatible I am.


It was pretty cool seeing these demos "just work", and showed my DMA code was working well.

I was about to take a break as I headed out to Orlando with the family, but that didn't stop me having a little fun on the plane as we headed out...


It would be a few months before I pick up any of this again as work got busy, and deadlines loomed...



Wednesday, May 16, 2018

CSpect V1.11

Minor update to CSpect

CSpect changes
  • Fixed Lowres window (right edge)
  • You can now use long filenames in RST $08 commands (as per NextOS), can be set back to 8.3 via command line
  • Layer 2 now defaults to banks 9 and 12 as per NextOS
  • Added command line option to retrun $FF from port $1f
  • Fixed a possible issue in loading 128K SNA files. Last entry in stack (SP) was being wiped - this may have been pointing to ROM....
  • Fixed mouse buttons return value (bit 0 for button 1, bit 1 for button 2)





Monday, May 14, 2018

Making NEXT Lemmings: Part 3

Now that I knew things were actually possible, it was time to start thinking about the levels themselves. I started reading up on the level format, a text file by someone called "rt"... (Thank you whoever that is!!)

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
          LEMMINGS .LVL FILE FORMAT
                    BY rt
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

document revision 0.0

thanks to TalveSturges for the original alt.lemmings posting which got me
started on decoding the .lvl format

if you liked lemmings you should give CLONES a try. go to www.tomkorp.com
for more information. 

this document will explain how to interpret the lemmings .lvl level
file for the windows version of lemmings (directly saved levels in LemEdit).

This document is also on my Github repository.

As it happens, I was flying off to LA for a holiday with my daughter, so an 11 hour flight was just the thing I needed to blitz this chunk of code...


I first converted and exported the level brush .SPR files into a large set of sprite banks, with a table at the start that held start addresses, banks, widths and heights, followed by bitmap data. Once I had this, I was able to start writing a "level bob" routine, something that would draw these into my 1600x160 level bitmap (which is the size of a level in Lemmings - 5x320x160 screens).

This immediately turned nasty, as working out the "Y" coordinate was horrible - this was before the MUL instruction appeared. I created a large "line" table that held the address of each line and the bank it started in, and started to use that to index down into the level.

This was "okay", but it was ugly... and the table wasn't exactly small either. It also had a problem that a single line could cross a bank, which wouldn't do at all. I could get round this by not filling the last 384 bytes of each 16K bank, but still....

So, I decided to scrap all this, and burn some of this lovely new memory the NEXT had. I change the background level bitmap into a 2048x160 screen, burning an extra 70K+ of RAM. However... this now meant that the line addresses was just the Y coordinate shifted a couple of times and stuck in H (of HL), and the base of the bank ($C000) added on. It also meant that by banking in 16K at a time, if bit 6 of H reset to 0, I needed to swap bank. This was much nicer...

Doing the basic draw routine was fairly simple, I just had to clip top and bottom. I decided (one again) to cheat like mad and not actually clip things. As this was just being used at level creation time, I simply looked at the Y coordinate, and if it was off screen i didn't draw it, and moved onto the next one. Ideally you'd work out how much you need to clip and only draw that, but for this case, performance wasn't the number one goal, so i though "Sod it".

I also realised with my longer level, I didn't actually have to clip on X either, i could just start drawing 16 pixels further on. That was fortunate. .

With a simple draw blob in place, I looped through all the brushes to draw, and built the screen.

For a first version, this was pretty good I though. You can see what it's supposed to be, and looked pretty close to the actual level. It wasn't perfect though. Lemmings background brushes can have a few special draw modes: Flip on Y, Behind and Remove. Each of these modes can be combined giving 8 different draw modes in total.

While on the Las Vegas leg of our trip I finished off the level loading and added the missing modes...




It took me a while to figure out what was wrong with the image that has the corruption, especially as it was the correct shape - very odd. But turns out the brush I was drawing from was overflowing the bank it was in, so I'd need to do a check on exporting to make sure each line of the sprite is inside the same bank. I can't make the whole sprite fit, as the brushes can get really quite large and won't fit. I've yet to fix this bug... but I'll get around to it one day.

As you can see, I'd gotten most levels now loading fine. After this, I set about actually finishing off the export of all the lemmings as code that I could call, and actually use


This took a little longer than I expected. I had to make sure the generated code wasn't going to cross a bank, and make the offset table that went before it all. But once I finally got it all together, I was able to flick through the different graphics inside the game (above).

It was getting to the point i wanted to have the objects in the level. This was really the last major item missing, not just in terms of level components, but that would show me how fast things would run. Lots of the levels have a load of objects, and they would take up a lot of rendering time, so I really needed to see if it would still be fast enough once I added them.

I managed to get the sprites exported okay, but the colours just weren't right. I ended up having to load new palette files, one for each style and convert them into the NEXT RRRGGGBB format. once I had that, exporting the objects was just the same as the level brushes.

Once I'd exported this, i suddenly remembered those arrows.... damn it. They appear "inside" the background, so this meant I wouldn't be able to use the LDIX instruction, but have to do it long hand. Worse yet, there was always lots of these on a level, meaning I'd have to draw loads of these objects using very slow code.... Bugger.  Well, that was one I would look to address later.

With objects I needed "standard" sprite rendering code as the objects must be clipped right/left and top/bottom, and they have similar drawing modes to the background - normal, behind and inside.
In this case, I opted for a tower of LDIX instructions, and then jumped into the middle of them depending on how many pixels on X I was drawing. I then looped around this on Y, checking for bank swapping as I went. This give me this image below....


After how fast I got the Lemming rendering, I was a little disappointed at the speed of this, especially as I know there are levels with water all over the place. damn. It wasn't the end of the world, but it may well be the beginning of the end.... Each sprite in this case takes about it's own height to render, and if I have several on screen at once, that's easily going to chew up a frame or so.

Still, I knew I still had some tricks up my sleeves so I cracked on. To give me a little break and some thinking time, I decided to start drawing the Lemming font. Time away from actual code is important, as it gives your mind time to chew over some problems. Most coders I know will come up with solutions to things at the funniest times; In the shower, having a poo, out for a walk - all of them away from the computer.

I used the image editor in GameMaker Studio 2, as it's a great pixel editor, and very like D-Paint ( can't for the life of me think why! ).


The Amiga font was 16x8, but due to the smaller screen, I decided to reduce it down to just 8x8. This works pretty well, and as you can see I've "mostly" managed to keep the feel of the original.


I've still to do the numbers, but this gave me a welcome break and let me know i was on a good track for my screen layouts.









Sunday, May 06, 2018

Making NEXT Lemmings: Part 2

Before getting onto Plan B, I decided to get a nice smooth cursor. Reading the mouse every 3 or 5 frames will make the whole thing feel sluggish. perception is everything in games, and if you feel like input is responsive, you'll ignore a lot of a games flaws. So even though the game will only check clicks and positions every 3 to 5 frames, having it move every frame makes it feel like the games moving and playing much more smoothly. Oh... and Hardware sprites are a godsend for this kind of thing, it makes it basically "free".
I used the same trick in Blood Money on the C64. The game ran in 3 frames as well as it was a slow scrolling game, but the player sprites moved every frame, and so it all felt quite responsive.
I started by moving the mouse reading into my interrupt routine, and it all promptly fell flat on it's face. Well... that wasn't the plan. Turns out, I needed to "save" the current NEXT register (this was before the NextReg instruction appeared). So on entry into the IRQ, I now save the current register, restoring it on exit and suddenly everything was looking much nicer, in fact it was looking like a "proper" lemmings level, cursor and everything!


Also, now that I had managed to extract the .SPR format graphics, I was also able to extract the level brushes that are used to build the level. This was great, as it meant I could use the original level files, and not the 320K "bitmaps" I had been using.


Not having to convert the level files is a bonus to be sure. The level files are much smaller, and also well documented. They not only include the level bitmap data, but also objects and collision. So using the level format was going to be far more useful.

About this time, I also realised I'd need a debugger in my emulator.... bleh. More tools work, but again one that would pay dividends later. So yet again, I spend a few weeks writing a good debugger, one that is actually usable for development, not just hacking. This also involved loading the symbol file from SNasm so I could display labels. I've always had a very specific view of how assembly debuggers should work, and the "view" you should use. I've used stacks of truly terrible debuggers over the years, simple scrolling registers and "current" opcode ones stand high on my hate list...

The debugger has been updated over time like the rest of the tool, but it's pretty useful now. I've used the same layout for almost 30 years.... I wrote my first useful debugger for the PC Engine back in '91, and used almost exactly the same kind of layout. It was the one I used on the PDS (Personal Development System) kit I used to make C64 games in the late 80's, and it works incredibly well.


A free moving, scrolling disassembly window is vital, it just makes life MUCH simpler, and a static place for registers is just as important as your eyes will always drift to that fixed location when actually debugging. If it moves around, you'll waste time trying to find it. a second or two each "step" adds up.
The rest of the space just depends on the machine, and what hardware it has. On the PC Engine, I displayed VRam and some hardware info, on the Next I display the screen and some hardware registers - I'll show them all eventually.
The column of numbers on the right is my latest innovation.... a CPU execution list. This is a godsend when the machine crashes and resets. I can just flick back over this (massive!) list until it gets back into the game, and I can see where it went nuts. I've used this a few times already.

Anyway... back to Lemmings and Plan B! So when you have any kind of performance issue, there's 2 ways to approach it.
  1. Try and optimise the hell out the slow function.
  2. Try a totally different method.
Now it was clear that unrolling loops, making towers of code etc. just wouldn't work in this case. Even taking the loops away would only remove a scanline or 2 at most. What I needed was the fastest way possible to stick pixels on the screen. So....what is that exactly?

Well, if you think about how a sprite is drawn, you see you have a source address - the sprite, the destination address - the screen, and you need to copy the data from one to the other. This normally consists of loading A with a value and sticking it into the destination. So how about.... we just write a sprite as a series of load/store instructions? Kind of like this...

LD (IX+0),COLOUR1      ; 19 T-States
LD (IX+1),COLOUR1
LD (IX+2),COLOUR2
LD (IX+3),COLOUR1      ; = 76 T-States

And so on... Now... IX would have been ideal, as you could point it to the top of the graphic, and it could offset along a line, but storing via IX takes 19 T-States. That's pretty damn slow - in fact, it's even slower than our LDIX which is just 16. However.... LD (HL),A is only 7.... that's a fair speed up. And Lemmings by their nature have "runs" of colour green hair, white face, blue body and so on. That means we'd only need to reload A when the colour changes. This turns the above code into this...

LD A,COLOUR1           ; 7 T-States
LD (HL),A              ; 7
INC HL                 ; 6
LD (HL),A
INC HL
LD A,COLOUR2
LD (HL),A
INC HL
LD A,COLOUR1
LD (HL),A              ; = 67 TStates

Also.... LD (HL),$XX  is only 10 T-States, so if we need to change "A" for only a little, we could just store a value directly, which would save reloading A

LD A,COLOUR1          ; 7 T-States
LD (HL),A             ; 7
INC HL                ; 6
LD (HL),A             ; 7
INC HL                ; 6
LD (HL),COLOUR2       ; 10
INC HL                ; 6
LD (HL),A             ; 7 = 56 TStates

Getting there.... Now, as the Layer 2 screen is only 256 wide, we also know on a single line, we'll never cross a 256 byte boundary. So we don't need to increment HL, only L.

LD A,COLOUR1          ; 7 T-States
LD (HL),A             ; 7
INC L                 ; 4
LD (HL),A             ; 7
INC L                 ; 4
LD (HL),COLOUR2       ; 10
INC L                 ; 4
LD (HL),A             ; 7 = 50 TStates

Now we're talking! Lastly, not every pixel in a (say) 6x9 sprite is plotted, so unlike a load of LDIXs which is basically 6*9*16 = 864, just for the load/store parts never mind the management and line changing, we can get by with "JUST" the pixels we need to draw. This might mean doing 2 INC Ls, or an ADD HL,$0000 (another new NEXT opcode) to skip pixels, but that's still faster than storing. A normal Lemming is probably half, to 3 quarters actual pixels to plot, so that's a big saving.

I decided to just sit down and manually write a Z80 function for drawing a single lemming - which while it took a bit of time, meant I could hand tune this code to make it as fast as possible.

A good tip for when doing any kind of R&D like this, is that you should always use the machine to its maximum and then work back from there. Sure, you might manage to do something using 1Mb of code or tables, and that may not work in a game situation, but at least you know it IS possible. There's normally a middle ground where you lose a little speed, but maintain most of the benefits. I've done a lot of R&D code in my time that's turned into real code, and this has always been the best approach.


This is a snip-it of the hand build Lemming draw function. You can see I try and reduce reloading as much as possible, but you do have to do the "newline" section after each line, as you could be bank swapping. This did work out fairly well though, and I managed to reduce the Lemming drawing down to just over a scanline!

Here you can see the new Lemming speed (timing bar in white), and the old one (in red). This was definitely looking more promising. This means I could now draw a screen full in under a frame, which is certainly a requirement if I have any hope of achieving my target frame rate of 17fps. (same as the Amiga version).

Now that I saw this was possible, it was time to take the next step.... That step, is to automatically generate Z80 code based on a graphic so I didn't have to hand build hundreds of Lemming graphics. So inside my Lemmings graphics converter, I started to scan each lemming sprite, and generate a large, optimised, Z80 function that would draw it.
As I progressed I came up with more rules and optimisations that would help. This was pretty cool, because each time I sped things up, I could just regenerate ALL my graphics and everything gets faster! How cool is than!
For example, I wasn't using the DE register pair, so I was able to pre-load D and E with the 2 most common colours. This meant I didn't have to continually reload them, in fact only if there was a run of 3 pixels of the same colour would I need to load A at all.
I use the special "write" only mode of Layer 2, in location $0000 to $3FFF, and this allows me to write some simple, and fast, bank swapping code.


You can see the fruits of my labour above. As you can also see in the video, the next major issue I would have, is clipping. If you draw sprites normally, you can simply reduce the loop size to draw less, but with code drawing things, you can no longer do that. But, as the sprites are less than 16 pixels wide I decided to reduce the screen size by 8 pixels on either side, meaning I no longer had to clip left/right - sweet! However.... vertical clipping is a real issue as it has the potential to run off into memory, or come back in the bottom.
Clipping the top is by far the most difficult, as the panel at the bottom gives me a way to "replace" corrupted graphics. At the moment I simply copy the panel each frame, "fixing" the overdraw, but this is only a temporary solution. Ideally I'd use a raster IRQ and flip buffers so I can free up that copy - or a simple copper list (which would be better, but we didn't have that yet!).
The top of the screen however, is an issue.... how can you simply clip the top of the sprite. I could build a table of jumps and jump into the code at the right point, but that would slow, and a nightmare to generate. What I decided to do was to draw the sprite "backwards". This means I could ignore the lower clipping as I planned, then as I draw upwards, detect going off the top of the screen (by looking to see the bank "loop"), and simply RET from the function. This turned out to work just fine...  here's a snippet of the auto generating code function...

{
   switch (bc)
   {
       case 0: break;
       case 1: sb.AppendLine("\t\tinc\tl"); tstates += 4; bytes += 1; buff.Write(0x2C, 1); break;
       case 2:
               sb.AppendLine("\t\tinc\tl"); tstates += 4; bytes += 1; buff.Write(0x2C, 1);
               sb.AppendLine("\t\tinc\tl"); tstates += 4; bytes += 1; buff.Write(0x2C, 1);
               break;
       case 3:
               sb.AppendLine("\t\tinc\tl"); tstates += 4; bytes += 1; buff.Write(0x2C, 1);
               sb.AppendLine("\t\tinc\tl"); tstates += 4; bytes += 1; buff.Write(0x2C, 1);
               sb.AppendLine("\t\tinc\tl"); tstates += 4; bytes += 1; buff.Write(0x2C, 1);
               break;
       default:
               // "HL" never crosses a 256 byte boundary here, so just add to L. 15 Tstates, 4 bytes (1 T-State faster than add hl,$0000)
               sb.AppendLine("\t\tld\ta," + (bc & 0xff).ToString()); tstates += 7; bytes += 2; buff.Write(0x3E | (((uint)(bc & 0xff)) << 8), 2);
               sb.AppendLine("\t\tadd\ta,l"); tstates += 4; bytes += 1; buff.Write(0x85, 1);
               sb.AppendLine("\t\tld\tl,a"); tstates += 4; bytes += 1; buff.Write(0x6f, 1);
               a = bc & 0xff;
               break;
    }
}

In this section, each time I move along the line to the next pixel, I check to see what way is fastest. For anything under 4 bytes, you just INC L = 12 T-States (max). For any more, you need to add using A and that makes 15 T-STates, one faster than the new ADD HL,$0000 instruction. Lots of these little tricks helps drive the count down and make the function faster. You'll also notice I generate ASM source and byte code. This was so I could not only debug it, but also view a complete function and see if there was any little tricks I was missing.

So to round off this entry, here's an example of the auto-generated code in it's final state for drawing the sprite below....


Remember, this draws from the BOTTOM-LEFT most pixel, across to the right, then jumps back up one line and back to the start of the line again...

At 14Mhz in the border we have 224*4 = 896 T-States (while the screen area is more complicated as it drops to 7Mhz over the screen itself). As you can see from below, a typical lemming is now drawing in well under a scanline. Sure, there is surrounding management code, but this is pretty good going, and means I can draw all my lemmings in well under a frame - probably around 150-200 scan lines (taking the CPU speed drop into account).

If you happen to spot a possible speed-up, please let me know. I do know I could use JP instead of JR for my next line test, but working out the address instead of being relative would be a major pain in the butt, so I've left it for now. I may go back and change it later if I'm feeling adventurous.
I could also move more out of the common code section, but for the rest, feel free to point anything out. (*both these have now been done)

lastly.... remember there will only ever be one case where it could bank swap. No graphics are large enough to swap 2 banks, and usually it won't swap at all.

EDIT: Code below has been updated.


; Sprite Number 22
  ; Common code = 59 T-States (outside function)
  ; HL = screen address [y,x]
  ld de,20479  ;Most common 2 colours

  ld a,5
  add a,l
  ld l,a
  ld (hl),e
  add hl,-258   ;move back to start of line, and up one line

  ;New line
  bit 6,h
  jp z,@NoBankChange9
  ld a,h
  and a,$3f
  ld h,a
  ex af,af'
  sub $40
  ret m
  out (c),a
  ex af,af'
@NoBankChange9:

  ld (hl),d
  inc l
  ld (hl),e
  add hl,-259   ;move back to start of line, and up one line

  ;New line
  bit 6,h
  jp z,@NoBankChange8
  ld a,h
  and a,$3f
  ld h,a
  ex af,af'
  sub $40
  ret m
  out (c),a
  ex af,af'
@NoBankChange8:

  ld (hl),e
  inc l
  ld (hl),d
  inc l
  ld (hl),d
  add hl,-259   ;move back to start of line, and up one line

  ;New line
  bit 6,h
  jp z,@NoBankChange7
  ld a,h
  and a,$3f
  ld h,a
  ex af,af'
  sub $40
  ret m
  out (c),a
  ex af,af'
@NoBankChange7:

  ld (hl),e
  inc l
  inc l
  ld (hl),d
  inc l
  ld (hl),d
  add hl,-259   ;move back to start of line, and up one line

  ;New line
  bit 6,h
  jp z,@NoBankChange6
  ld a,h
  and a,$3f
  ld h,a
  ex af,af'
  sub $40
  ret m
  out (c),a
  ex af,af'
@NoBankChange6:

  ld (hl),e
  inc l
  inc l
  ld (hl),d
  inc l
  ld (hl),d
  inc l
  inc l
  ld (hl),e
  add hl,-260   ;move back to start of line, and up one line

  ;New line
  bit 6,h
  jp z,@NoBankChange5
  ld a,h
  and a,$3f
  ld h,a
  ex af,af'
  sub $40
  ret m
  out (c),a
  ex af,af'
@NoBankChange5:

  ld (hl),e
  inc l
  ld (hl),d
  inc l
  ld (hl),d
  inc l
  ld (hl),e
  add hl,-258   ;move back to start of line, and up one line

  ;New line
  bit 6,h
  jp z,@NoBankChange4
  ld a,h
  and a,$3f
  ld h,a
  ex af,af'
  sub $40
  ret m
  out (c),a
  ex af,af'
@NoBankChange4:

  ld (hl),d
  inc l
  ld (hl),d
  add hl,-258   ;move back to start of line, and up one line

  ;New line
  bit 6,h
  jp z,@NoBankChange3
  ld a,h
  and a,$3f
  ld h,a
  ex af,af'
  sub $40
  ret m
  out (c),a
  ex af,af'
@NoBankChange3:

  ld (hl),e
  inc l
  ld (hl),e
  inc l
  ld (hl),e
  inc l
  ld (hl),20
  add hl,-258   ;move back to start of line, and up one line

  ;New line
  bit 6,h
  jp z,@NoBankChange2
  ld a,h
  and a,$3f
  ld h,a
  ex af,af'
  sub $40
  ret m
  out (c),a
  ex af,af'
@NoBankChange2:

  ld (hl),e
  inc l
  ld (hl),20
  inc l
  ld (hl),20
  add hl,-258   ;move back to start of line, and up one line

  ;New line
  bit 6,h
  jp z,@NoBankChange1
  ld a,h
  and a,$3f
  ld h,a
  ex af,af'
  sub $40
  ret m
  out (c),a
  ex af,af'
@NoBankChange1:

  ld (hl),20
  inc l
  ld (hl),20

  ld a,2
  out (c),a
  ret
  ; T-States=714/762     bytes =246