Saturday, July 07, 2007

65816: The power of 16bit.

I've been wondering just how much faster the SuperCPU actually is to a stock C64, and aside from the x20 jump you get from the raw clock speed, the new instructions and 16bit nature give you an even bigger boost - Alomst another x2! Heres a little example....

The scrolling in XeO3 takes a long time, every game cycle I do this:


ldx #39
ScrollLoop
lda BackBuffer,x
sta HWScreen2+$400+(40*00),x
lda BackBuffer,x
sta HWScreen2+$400+(40*01),x
lda BackBuffer,x
sta HWScreen2+$400+(40*02),x
lda BackBuffer,x
sta HWScreen2+$400+(40*03),x
lda BackBuffer,x
sta HWScreen2+$400+(40*04),x
lda BackBuffer,x
sta HWScreen2+$400+(40*05),x
lda BackBuffer,x
sta HWScreen2+$400+(40*06),x
lda BackBuffer,x
sta HWScreen2+$400+(40*07),x
lda BackBuffer,x
sta HWScreen2+$400+(40*08),x
lda BackBuffer,x
sta HWScreen2+$400+(40*09),x
lda BackBuffer,x
sta HWScreen2+$400+(40*10),x
lda BackBuffer,x
sta HWScreen2+$400+(40*11),x
lda BackBuffer,x
sta HWScreen2+$400+(40*12),x
lda BackBuffer,x
sta HWScreen2+$400+(40*13),x
lda BackBuffer,x
sta HWScreen2+$400+(40*14),x
lda BackBuffer,x
sta HWScreen2+$400+(40*15),x
lda BackBuffer,x
sta HWScreen2+$400+(40*16),x
lda BackBuffer,x
sta HWScreen2+$400+(40*17),x
lda BackBuffer,x
sta HWScreen2+$400+(40*18),x
lda BackBuffer,x
sta HWScreen2+$400+(40*19),x
lda BackBuffer,x
sta HWScreen2+$400+(40*20),x
dex
jpl ScrollLoop1
rts


This code is self-modified to address the new location of the back buffer, and I have to use a jpl (macro) since a normal branch is just out of reach, so this takes (40*21*9)+(40*7) = 7840 cycles. (this is approx as there are also page boundary crossings hidden in here.)

Now in 65816, I can do exactly the same but being 16 bit, the loop is half, and although we add a couple more cycles for LDA/STA, its still much quicker. So the loop is now (20*21*11)+(40*7) = 4900 cycles.

And now lastly, the 65816 has a block transfer instruction MVN+MVP which are like Z80's LDIR instruction, which means (BEST case) its now (20*21*7) = 2940 cycles. Now, although the block transfer would be broken up a little mode (to do lines mainly), its still only going to be around 3000. So not only is more than twice the speed as the 6502 version, but we have the new 20Mhz clock as well.

..............Bitmap blitting suddenly becomes REALLY interesting!!

6 comments:

TNT said...

Remember that accessing C64 memory is slow, so you have to copy full bitmap from SuperCPU RAM to C64 every frame. That still leaves you planty of time for logic, but not as much you think.

Because you need at least 20 cycles between each C64 memory write you can do bitmap scroll while copying for free, I think. Reminds me of 68040/68060 chunky-to-planar routines on Amiga which did the conversion between chip-mem writes, achieving same speed as plain fast > chip copy.

Mike said...

I thought it was only the hardware CHIPS, and not memory that was slow. If it was, then my colour bars wouldn't shrink nearly as much as they do currently.

I have a full screen character map copy routine that takes the whole boarder to copy, and if I flick the switch it drops down to a few scanlines. Since the copy routine I've shown is 99.9% memory related, there would only be a few places that could get sped up, so it wouldn't have gotten as quick as it did - I think.......?

TNT said...

Ok, then I've understood this page wrong (or info on it is wrong.

"- ... SuperCPU uses the C64/128 as essentially an I/O adaptor and video RAM.
- The SuperCPU runs at 20 MHz unless software instructs it to communicate with video RAM or I/O devices...
-The SuperCPU employs a one-byte write cache, allowing it to disconnect from the C64 entirely. This allows the programmer to perform a write operation and then immediately go do something else, without waiting, while the cache handles the actual write operation to I/O or video RAM. This applies so long as the programmer spaces his/her writes out so that they occur at least 20 cycles apart."

Someone send me a freebie so I can test it myself ;)

OTOH charmap copy is only 1000 bytes, ~16 raster lines. Bitmap copy is 8000 bytes, + additional 1000 if you want to copy colors too.

Mike said...

"This applies so long as the programmer spaces his/her writes out so that they occur at least 20 cycles apart"

There must be something else going on because...

lda $8000,x
sta $0400,x
lda $8028,x
sta $0428,x
lda $8050,x
sta $0450,x

etc. Obviously isn't 20 cycles appart. It might be mirroring the lower 64K which may mean READ's are from fast memory, and writes to both. I still wouldn't have thought it would have gone as fast as it was going though - but it may mean that I get 7 cycles free, and all I'm paying for is the actual WRITE of the byte. This means you would get a pretty big boast regardless of what memory your working in - just not as much as the pure 20Mhz RAM...

TNT said...

Yes, you are only waiting for the STA part. That's why I said you can get bitmap scroll free: "lda (src),y; rol; sta (src),y; sta (dest),y" is 19/20 cycles which means it's no slower than plain copy. With plenty of zeropages it's no problem to use 80 bytes for pointers :)

(src) is in SCPU memory, of course.

Mike said...

Actually, I think its more than that; I think your only waiting on the 1 cycle of the store command - the actual write cycle of STA.