Wednesday, January 01, 2014

Improving the runner

So, I thought it might be fun to look back at some of the changes we've made to the runner, why, and what impact it's had on performance. Back when we started, the main complaint with Game Maker was that it was too slow, and that for more complex games, it just didn't cut it. This is clearly not the case any more, so what's really changed? Well, this only really effects the c++ runner, but lets take a look see what we did...

WAD

When we started on the PSP port, the 1st thing we did was to remove all the compression from the output. The game was in fact doubly compressed (which is nuts BTW - it's not like it's going to get magically smaller again!), and on the PSP this lead to loading times of around 40 seconds to a minute is even very simple games. So, we changed the whole output, altering it into an IFF based WAD file, and this meant it could be loaded incredibly quickly, and was ready to go as soon as it loaded. This was a huge change code wise, and touched every element of Game Maker. However, this change alone meant games would load either instantly, on in a few seconds.

Virtual Machine

Game Maker was originally split into 2 parts, scripting, and D&D (Drag and Drop), and believe it or not, there were to totally separate code paths for each. On top of this, everything was compiled on load, meaning there was another noticeable pause when games loaded. So, first thing that was done, was we changed the script engine to use a virtual machine, this meant we could optimise small paths of code rather than huge chunks - twice. This gave a modest boost - not as much as we hoped, but this was due to the way GameMaker dealt with it's variables, and how things were cast, and we were stuck with that. The next thing we did was to remove the D&D path, and then compile all D&D actions into script so they could then be compiled in the same manner. Lastly, we pre-compiled everything. This removed the script from the final game - and finally removing the worry of simple decompilation, but it also sped the loading of a game up yet again. It was not at the point, where a windows game would load and run almost instantly.


Rendering

There's quite a number of changes here, so I'll break it down a little further. But basically, the old Delphi runner had some major CPU->GPU stalls, and so we had to look into fixing them.

Hardware T&L

This was introduced on GM8.1, and it's was the first real boost for performance. All later versions of Studio also use this "Hardware Transform and Lighting" mode. By using this, the GPU does all the transformations, and leaves the CPU free for running the game. Up until we switched it, after you submitted a sprite, DirectX then used the CPU to transform all the vertices into screen space before passing them onto the GPU - this places severe limits on how much you can push though the graphics pipeline, so it was changed on the first version of GM8.1 we released.

DirectX 9

The first thing we did was to upgrade to DX9, this meant we were on a version which Microsoft still maintained properly, and gave us access to more interesting tools down the line - like better shaders. While doing this, we also introduced a proper fullscreen mode, and while this may not seem like much, when in fullscreen mode, things do render slightly quicker. Also...DX9 is just a little faster than DX8, and the upgrade is a pretty simple process (takes time, but its simple enough).


Texture pages

One of the biggest problems with Game Maker, was that every sprite, every background was given it's own texture page, and while this might seem like a good idea as it means you can easily load things in, and throw them away, it utterly destroys any real performance. So, we set about putting all images on texture pages (TPages), but this was more complex as Game Maker allows non-power of 2 images.

Standard methods involved sub-dividing images down, splitting them in 2 each time until you allocate the sprite, and fill up the whole sheet. This is very fast, and very efficient. However... we needed something more. So what we did was create a multi-stream, virtual rectangle system which would clip to everything as you added any sprite of any size. I came up with this years ago, and it was incredibly powerful, and very fast - but it was recursive, and complex. Still, having written it before, I did it again and we had a very nice texture packer for any size of image.

Next we noticed that current Game Maker developers were incredibly wasteful of images. They would have huge sprites with nothing but space in them, so they could be aligned easily with screen elements. A cool fix for this, was to crop out anything which had a 0 alpha value. You basically search the sprite for a bounding box that isn't 0 alpha, and then crop it, and remember the all the offsets to enable you to draw it again as normal. This is a cool trick, that helps us pack even more onto a page,which then helps with batching as well.


Batching

Speaking of batching.... Since the old version changed texture all the time, it could never hope to batch more than a couple of triangles at once, and this meant the CPU was forever getting in the way. Modern graphics hardware wants you to send as much as you can in one shot, then to go away and leave it alone!

So, the first thing we did was go through and change all the primitive types. The old way (TriFan) is unbatchable, and so you would always be stuck with 2 triangles per batch, and that's useless. So we changed them to the most batchable prim we could use, TriList. I wrote a new vertex manager that you could allocate vertices from, and it detected if there was any changes, or breaking in batches, so all the engine had to do, was draw, and the underlying API would auto-batch it for us.

Couple this with the new texture pages, and if you were just drawing sprites, you had a good chance of actually drawing multiple sprites in one render call. In fact, many games (like They Need to be Fed) can render virtually the whole game in one render call depending on the size of the texture page.

There are still some things which can break the batching (obviously), but with careful set up, rendering sprites is no longer the limiting factor of your game.


Blend States

Blend states are always tricky, and in the past I've tended to using a layering system - something that might appear in later versions of Studio or more likely GM: Next, as it helps manage state changes much better. However, we got a shock early in the year when we discovered how GM users actually manage state, and so we worked hard to hide the pain of management from our users.

Now a user can simply set state and reset state as they see fit, and aside from the GML call overhead, we handle all the blend state batching behind the scenes. For example, if you set a blend state, draw something, reset it, then set it again, draw something and then reset it, we recognise this, and both items will be drawn together in a single batch, making the rendering code much more efficient without the developer ever having to care.

It'll also handle more extreme cases, where you set blends several times, then set a different one, and then set more blends several times. The engine will also recognise this, and will only submit 3 batches - or however many blend state batches are actually needed. This is incredibly powerful, and a major playing in rendering performance.


Static Models

In the past, whenever you created a D3D model, it didn't actually create anything, it just remembered the prims and commands you sent, then replayed them back dynamically. This meant 3D rendering was terrible, and you would struggle to get above a few thousand polys.

We actually introduce this new concept in GM8.1 although it was more of a pet project for me at the time. Now its a staple of how you should work. Studio will now let you build fully static models, meaning that when your finished creating them, you'll be left with a single vertex buffer (or multiple depending on lines and points), that will simply be submitted to the hardware when you want to draw something.

This is basically as fast as it's possible to draw on modern hardware, you create a buffer full of vertices and submit them but simply pointing to the buffer. This is what all modern games do, its what engines like Unity does. With this addition, you can quite literally draw models with hundreds of thousands of polys at hundreds of frames per second (depending on your graphics card of course!)


Shaders

Although a recent addition, this is a powerful performance tool. Up until this point, special effects would be done on surfaces, then processed by the CPU, most of the time, painfully so! But with all the previous optimisations in place, shaders were finally able to get their place and show you what the GPU was really able to do.

From effects such as shadows, to environment mapping, to more 2D effects such as outlines, colour remapping or even full shader tilemaps all meant that the CPU got more time to work on the game, not spend time trying to make things look pretty, especially when there was a much better way of doing it sitting there doing nothing.


string_execute()

Yeah... this old one. Well, this was a popular command, mainly because it let you be really lazy in the way you coded, and yeah.... it's not a terrible thing, but it was one of the worse performance hogs, and folk using Game Maker had come to rely on it, meaning it was pulling performance down for everyone. So why was it so bad? Well, unlike code when you executed it it had to compile the string every time before running it, and this was incredibly bad. Lets have a quick look at what's involved when compiling.

Lexical Analysis. First it had to scan the string, break it down into tokens allocate memory for it and store a parse tree. Even a very simple string like "room_goto(MenuScreen)" means we had to parse the string, looking at each character, and breaking it into several tokens. room_goto command token, find and verify the syntax of the command, and evaluate the contents of the brackets - in this case; MenuScreen, which is also stored as a token. These are then stored as a parse tree, so that equations and evaluations are calculated correctly. So if you gave a calculation like "A = instance3.x+((instance.B*12)/instance2.C)", it would be evaluated correctly when executed.

Compilation. This then transforms the parse tree into byte code. Previously, it was transfered into a stream of tokens, and the engine just run these, but with the virtual machine, we now compile into byte code. This memory also has to be allocated.

Running. Now the new script can be run. First it sets up a code environment (again, more code allocation), complete with the current instance so it can store things correctly, then the VM takes over and runs the code.

Freeing. This is the real crime. Once all this work is done, and the script has been run.... it's all thrown away. All the different memory blocks are freed, and when you try and run it again, it has to do this all again.

Knowing this... it's no wonder this command was such a hog. And file_execute was even worse, as it had to load the thing first, before doing any of this - and disk access is ALWAYS slow.  So this is why this command was removed. You should never need it if you code properly anyway. It can be great for tools, but when everyone just sees it as a command they can use....well, they just blame Game Maker for not being quick enough so it had to go.



Conclusion

Now, there are obviously countless other optimisations in the runner, from ds_grid get/set being made much quicker, to the total rewrite of the ds_map, allowing for huge, lightning fast dictionary lookups, or from the way variables are now handled so that the VM can access them much quicker, or how YYC now translates everything into C++ so it can be natively compiled and thereby lifting the what little lid there was left.

And don't forget, we're a multi-platform engine, and each platform gets the same kind of treatment as the windows one does. Each platform has it's own quirks, it's own set of performance enhancing toolset, like for example the total rewrite of the HTML5 WebGL engine, that does things in a totally different way just so it can push things a little better for that platform, or the windows RT platform, that has a custom DX11 engine design and optimised just for RT devices. Each platform is taken on its own merits, and enhanced to best fit that device.

But no matter how you look at it, GameMaker: Studio has move well beyond the mere learning tool it was once designed to be, it's no longer only use by beginners, or bedroom coders looking for some fun, it's now used by professional developers around the world because it delivers, fast, cross platform development, with a truly fast engine, that's always being improved.