This week at work, there was a discussion about games that had great vertigo inducing moments. The conversation was dominated by AAA titles from the last couple of years, such as Assassin’s Creed, Fallout 3, and especially Uncharted 2. For me though, none of them match some of the levels from Tomb Raider. Not the any of the recent games (although I do love Crystal Dynamics’ take on the series), but the original, from the now sadly defunct Core Design. You know, back when Lara steered like a tank and advertised Lucozade.

My favourite moment from the game occurs around a third of the way into St Francis’ Folly, where you run up some stairs, turn a corner and are confronted by this:

Holy shit! You can’t see the floor!

I very nearly soiled myself as a young lad when I saw that; it made such an impression on me that I still to this day dig out my copy of Tomb Raider every couple of years and play it through.

Thoughts of St. Francis’ Folly prompted me to try knocking together a simple Tomb Raider level viewer of my own. After all, the assets couldn’t be that complicated, so how hard could it be? Not hard at all as it turns out, thanks to some excellent documentation of the file format put together years ago by dedicated fans.

Parsing the first part of the level pack, extracting a mesh and getting it up on screen took surprisingly little time. Here it is. It certainly looks like something. Perhaps it’s a rock?

Skipping through a few more meshes that didn’t really look like anything much, I finally found something recognisable, an upside-down pistol.

It was at this point I realised that Tomb Raider uses a slightly unusual coordinate system; X and Z form the horizontal plane, but positive Y points down. After a bit of Y flipping and winding order reversal, things looked a little better. Here’s some shotgun ammo.

It’s interesting to see how differently meshes were put together 15 years ago; they’re made up of a mixture of textured and untextured quads and triangles, with flat shaded quads being used wherever possible. I did similar things back when I used to play with my Net Yaroze in my spare time (the embarrassing fruits of my labours have been thoughtfully uploaded by someone for posterity).

In order to make more sense out the meshes, I moved onto texture extraction. In the first Tomb Raider, one 256-entry colour palette is used for all textures in a particular level. Each level uses around ten 256 by 256 texture atlases. Here’s texture atlas #7 from level one. I’d like to draw your attention to the bottom right corner, where if you look closely, you’ll find a couple of pixelated nipples.

Anyway, after hooking up the textures and taking another look at the meshes, it turned out that the first mesh was in fact Lara’s bum.

The next few meshes in the level pack are all the bits and pieces that make up Lara’s body. Skinning still wasn’t commonplace in those days, so each body part is a separate mesh. Incidentally, Lara’s forehead is much bigger than I remember.


Once textured meshes were rendering correctly, it was time to move onto the world geometry. Each level in Tomb Raider is made up of a number of rooms, connected via portals. Rooms are made up of square sectors, 1024 world units in size (since everything was fixed point back then). Each sector stores only one floor height and one ceiling height, which means that if a level designer wanted to put overhangs into a level, they had to stack multiple rooms on top of each other. Given this limitation, the complexity of some of the levels in the game they managed to put together is amazing.

My first attempts at rendering a whole world were somewhat less that successful.

World geometry vertex positions are stored as 16bit XYZ triples, which are defined relative to a per-room origin. According to the documentation, the room positions are defined in world space, but even taking that into account I couldn’t get the rooms to fit together properly. Instead, since rooms are all connected via portals, it was easier to traverse the portal graph and stitch rooms together based on the vertex positions of the connecting portals. For the most part, this worked very well.

Update: I was two bytes off when reading the room position, the rooms line up just fine now.

After fixing up the baked vertex lighting and adding a little depth-based fog, I had something resembling a Tomb Raider level.

Here’s The Lost Valley. Any minute now, a T-Rex is going to come stomping round the corner.

And here’s the first stage of Natla’s Mines. It’s huge!

There’s still a whole bunch of features that I could add in: sprites, static meshes, dynamic meshes, animated textures to name just a few, but I’m pretty pleased with how far it’s come with just a couple of evening’s work. So huge thanks go to those folks who figured all this out over a decade ago.

I’ve had an idea floating around in my head for several months now, but evening classes, a hard drive failure, then a GPU failure prevented me from doing much about it until this weekend. First, a couple of screenshots of what I’ll be talking about.

Cornell box, I choose you!

The above screenshots are of real-time single-bounce GI in a static scene with fully dynamic lighting. There are 7182 patches, and the lighting calculations take 36ms per frame on one thread of an Intel Core 2 @2.4 GHz. The code is not optimized.

The basic idea is simple and is split into two phases.

A one-time scene compilation phase:

  • Split all surfaces in the scene into roughly equal sized patches.
  • For each patch i, build a list of all other patches visible from i along with the form factor between patches i and j:
    F_{ij} = \frac{cos \phi_i \, cos \phi_j}{\pi \, |r|^2} \, A_j
    Fij is the form factor between patches i and j
    Aj is the area of patch j
    r is the vector from i to j
    Φi is the angle between r and the normal of patch i
    Φj is the angle between r and the normal of patch j

And a per-frame lighting phase:

  • For each patch i, calculate the direct illumination.
  • For each patch i, calculate single-bounce indirect illumination from all visible patches:
  • I_i = \sum_j D_j \, F_{ij}
    Ij is the single-bounce indirect illumination for patch i
    Dj is the direct illumination for patch j

So far, so radiosity. If I understand Michal Iwanicki’s GDC presentation correctly, this is similar to the lighting tech on Milo and Kate, only they additionally project the bounce lighting into SH.

The problem with this approach is that the running time is O(N2) with the number of patches. We could work around this by making the patches quite large, running on the GPU, or both. Alternatively, we can bring the running time down to O(N.log(N)) by borrowing from Michael Bunnell’s work on dynamic AO and cluster patches into a hierarchy. I chose to perform bottom-up patch clustering similarly to the method that Miles Macklin describes in his (Almost) realtime GI blog post.

Scene compilation is now:

  • Split all surfaces in the scene into roughly equal sized patches.
  • Build a hierarchy of patches using k-means clustering.
  • For each patch i, build a list of all other patches visible from i along with the form factor between patches i and j. If a visible patch j is too far from patch i look further up the hierarchy.

And the lighting phase:

  • For each leaf patch i in the hierarchy, calculate the direct illumination.
  • Propagate the direct lighting up the hierarchy.
  • For each patch i, calculate single-bounce indirect illumination from all visible patches clusters.

Although this technique is really simple, it supports a feature set similar to that of Enlighten:

  • Global illumination reacts to changes in surface albedo.
  • Static area light sources that cast soft shadows.

That’s basically about it. There are a few of other areas I’m tempted to look into once I’ve cleaned the code up a bit:

  • Calculate directly and indirect illumination at different frequencies. This would allow scaling to much larger scenes.
  • Perform the last two lighting steps multiple times to approximate more light bounces.
  • Project the indirect illumination into SH, HL2 or the Half-Life basis.
  • Light probes for dynamic objects.

You can grab the source code from here. Expect a mess, since it’s a C++ port of a C# proof of concept with liberal use of vector and hash_map. Scene construction is particularly bad and may take a minute to run. You can fly around using WASD and left-click dragging with the mouse.

4-bit random numbers

January 20, 2011

While I accept that a toy 4-bit microprocessor probably isn’t a subject that warrant multiple posts, I’m unable to resist. Apologies.

After playing with my GMC-4 for a couple of days, the initial novelty of hand assembling programs had well and truly worn off, so I turned to the assembler and simulator available on the web. After a couple of minutes of use though, it was obvious that a simulator wasn’t at all what I wanted.

After all, if I wanted to type in a program nibble by nibble, I may as well do it on the hardware itself. What I really wanted was an integrated assembler and debugger. Luckily, the instruction set is very basic, so it didn’t take very long to put one together:

The left pane is the source window. The three columns on the right show the contents of memory; program memory is shown in Wheat & Cornsilk, data memory in DarkSeaGreen & PaleGreen, the current instruction is highlighted in DarkRed. The colouring is intended to make it easier to follow through the code when typing in the machine code on the hardware.

The jumble of text below the memory view shows the state of the eight registers, the status flag (see my previous post), program counter, seven-segment display and binary LEDs. There are buttons for copying the contents of memory to the clipboard, running the program, single-stepping and finally simultaneously compiling the source and resetting the state of the machine. The bottom panel shows any exceptions that might get thrown by the assembler or emulator (I didn’t bother to spend much time on error handling).

The reason op-codes are still shown for data memory is that (aside from addressable ranges) the GMC-4 makes no distinction between code and data memory. The neat thing about this is that the hardware has no problems executing code that lives in data memory; it also doesn’t mind if those instructions get modified during program execution, which means… self-modifying code! I’ve not found a way to make reasonable use of this, but it’s kind of cool nevertheless.

In the unlikely event that anyone’s interested, I’ve shoved the code up on Google. It’s written in Good Ol’ WinForms, as my fleeting love affair with WPF ended once I decided that I’d rather be productive than fashionable.

For one of the “games” I wrote, I needed a random number generator. Unfortunately, with 4-bit addition and subtraction my only available mathematical operations, it wasn’t immediately obvious how to go about it.

A bit of searching around led inevitably to Wikipedia and multiply-with-carry random number generators. I went with lag-1 MWC for simplicity, which is defined as:

x_n = (ax_{n-1}+c_{n-1})\,mod\,b
c_n = \lfloor\frac{ax_{n-1}+c_{n-1}}{b}\rfloor

Being limited to 4-bit numbers, a natural choice for b was 16 and a quick exhaustive search for a showed that a value of 15 yielded the sequence with the longest period (around 120) and distribution of numbers that wasn’t wholly intolerable (it at least covered all the digits). As an added bonus, using these values for a and b meant that the multiplications and divisions could be done away with completely:

x_n = (15x_{n-1}+c_{n-1})\,mod\,16
x_n = 1 + \tilde{x}_{n-1} + c_{n-1}

c_n = \lfloor\frac{15x_{n-1}+c_{n-1}}{16}\rfloor
c_n = \begin{cases} x_{n-1}-1, & c_{n-1} < x_{n-1} \\ x_{n-1}, & otherwise \end{cases}

That crappy little tilde above the x is meant to represent a bitwise not – my LaTeX is pretty weak I’m afraid.

4-bit random number generation is not exactly a widely discussed topic on the internet, so I don’t know if there’s a better method. This worked well enough for my purposes though. Quality wise, it’s probably on par with RANDU.

“Made it! And with one and a half bytes left to spare!”

I was down in San Francisco’s Japantown a few days ago, browsing the magazine section of the Kinokuniya bookstore, when I stumbled across something totally awesome – a magazine series called Otona no Kagaku (lit. Adult’s Science). Each edition comes packaged with a build-it-yourself kit for some kind of science experiment and the magazine itself contains the assembly instructions, ideas for experiments and other background information.

The subjects covered in the series are diverse and include a steam engine, movie projector, theremin and even a bird organ (no idea). The one that caught my eye, however, was a 4-bit microcomputer kit. The kit itself was very simple to assemble and just involved screwing together a few prefabbed parts and putting in batteries. Once assembled, I held in my hands a working GMC-4 microcomputer. It’s a beast of a machine, with a staggering eighty nibbles of program memory, sixteen nibbles of data memory and eight 4-bit registers (although only two of them are available for use at any one time).


Those primitive scratchings beneath are my first working program.

Once built, the next step was to make it actually do something. I was feeling particularly masochistic, so I decided to figure it out the hard way – without the internet. Armed with a Nintendo DS and a copy of Kanji Sonomama Rakubiki Jiten I spent the next couple of hours translating the operating guide and the instruction set. Once I had a rough idea of how the thing worked and had managed to get a couple of the sample programs running, it was time to write a program for myself.

Since the GMC-4’s built-in arithmetic is limited to 4-bit addition and subtraction, I figured an achievable enough goal for an afternoon’s work would be a 16-bit adder. Five hours later, all I had was a program that thought that 1 + 1 == F. It was slow going – writing the program out on paper and then translating the mnemonics into machine code by hand. Still, I found the process perversely satisfying once everything was finally working and it reminded me of my college days when we had to do the same thing for a 6502. It took another four hours to fix all the bugs and then fit the code into memory. In the end, it exactly filled the available program memory and used 13 of the available 16 nibbles of data memory. Three nibbles to spare!

Here’s a video of the adder in action, calculating the following sums:

0x3978 + 0x2BD6 = 0x0654E
0xA5E3 + 0xD687 = 0x17C6A

Full source code is below. From left to right, the columns are as follows: code address as displayed by the binary LEDs on the system, code address in hexadecimal, operation mnemonic, opcode value, comments.

-------  00   TIY  A  ; INPUT PHASE: Init data pointer to most significant digit of first number - digits
------*  01   <7>  7  ; are read into addresses 7,6,5,4 for the first number and 3,2,1,0 for the second.
-----*-  02    KA  0  ; Wait for user input.
-----**  03  JUMP  F
----*--  04   <0>  0
----*-*  05   <2>  2
----**-  06    AM  4  ; Store and display digit.
----***  07    AO  1
---*---  08   CAL  E  ; BEEP! this both provides feedback and creates a short delay,
---*--*  09  SHTS  9  ; which prevents the press being registered multiple times.
---*-*-  0A   AIY  B  ; Decrement data pointer
---*-**  0B   <F>  F
---**--  0C  JUMP  F  ; If the pointer is still >= 0, loop again.
---**-*  0D   <0>  0
---***-  0E   <2>  2
---****  0F   TIY  A  ; Store the first carry value (which has value 0) in the location of the first output digit (address 8).
--*----  10   <8>  8
--*---*  11   TIA  8
--*--*-  12   <0>  0
--*--**  13    AM  4
--*-*--  14   TIY  A  ; Reset data pointer to point to the least significant digit of the second number.
--*-*-*  15   <0>  0
--*-**-  16    MA  5  ; MAIN LOOP: Load a digit from the first number.
--*-***  17   AIY  B
--**---  18   <4>  4
--**--*  19    M+  6  ; Add the corresponding digit of the second number.
--**-*-  1A  JUMP  F  ; Check for overflow.
--**-**  1B   <2>  2
--***--  1C   <C>  C
--***-*  1D   AIY  B  ; The addition caused no overflow, add the carry value from the previous step.
--****-  1E   <4>  4
--*****  1F    M+  6
-*-----  20  JUMP  F  ; Check for overflow again.
-*----*  21   <2>  2
-*---*-  22   <F>  F
-*---**  23    AM  4  ; Still no overflow, store a carry value of 0 in the address of the next output digit.
-*--*--  24   AIY  B
-*--*-*  25   <1>  1
-*--**-  26   TIA  8
-*--***  27   <0>  0
-*-*---  28    AM  4
-*-*--*  29  JUMP  F  ; Skip to end of the loop.
-*-*-*-  2A   <3>  3
-*-*-**  2B   <5>  5
-*-**--  2C   AIY  B  ; Overflow caused by the initial digit addition - add the carry value from the previous step.
-*-**-*  2D   <4>  4
-*-***-  2E    M+  6
-*-****  2F    AM  4  ; We can get here from either of the two possible overflow conditions,
-**----  30   AIY  B  ; store a carry value of 1 in the address of the next output digit.
-**---*  31   <1>  1
-**--*-  32   TIA  8
-**--**  33   <1>  1
-**-*--  34    AM  4
-**-*-*  35   AIY  B  ; Move on to next digit.
-**-**-  36   <8>  8
-**-***  37   CIY  D  ; Check if we've reached the last digit.
-***---  38   <4>  4
-***--*  39  JUMP  F  ; If not, run the loop again.
-***-*-  3A   <1>  1
-***-**  3B   <6>  6
-****--  3C   CAL  E  ; DISPLAY PHASE: Clear the display.
-****-*  3D  RSTO  0
-*****-  3E   TIY  A  ; Set the data pointer to point past the most significant digit of the output.
-******  3F   <D>  D
*------  40   AIY  B  ; Decrement the data pointer.
*-----*  41   <F>  F
*----*-  42   TIA  8  ; Pause for a short while.
*----**  43   <6>  6
*---*--  44   CAL  E
*---*-*  45  TIMR  C
*---**-  46    MA  5  ; Load and display the value of output digit.
*---***  47    AO  1
*--*---  48   CIY  D  ; Check if we've stepped past the least significant digit of the output.
*--*--*  49   <7>  7
*--*-*-  4A  JUMP  F  ; If so, jump back and clear the display.
*--*-**  4B   <4>  4
*--**--  4C   <0>  0
*--**-*  4D  JUMP  F  ; If not, jump back and move on to the next digit.
*--***-  4E   <3>  3
*--****  4F   <C>  C

Curtis Hoffmann has written a comprehensive description of the GMC-4 and his page also has links to a GMC-4 simulator and assembler.

One feature of the CPU that caused me trouble is that it has only one status flag, the value of which is modified by every instruction. The instruction for reading the keypad sets this flag to 0 if a key is pressed, and 1 if not; the compare instructions set it to 1 if a register is not equal to some constant, and 0 otherwise; the arithmetic instructions set the flag to 1 on overflow and 0 otherwise; all other instructions set the flag to 1. There is only one direct branch instruction and branches are only taken if the status flag is 1 at the time.

The upshot of this is that tests must be acted upon immediately, otherwise their results will be discarded as soon as the next instruction executes. Couple this with a limited instruction set and a scarcity of registers and I ended up having to duplicate many sequences of instructions. Not what you want to be doing with only 40 bytes of memory.

As an example, here’s pseudo code for adding two values stored in addresses 0 and 1, writing the 4-bit result to address 2 and the carry bit to address 3. A and Y are registers, [Y] denotes a reference to the data at address Y:

        Y  =  0
        A  = [Y]
        Y  =  1
        A += [Y]
        goto overflow     ; only taken if addition overflowed
        Y  =  2
       [Y] =  A
        A  =  0
        goto store_carry  ; alway taken
        Y  = 2
       [Y] = A
        A  = 1
        Y  = 3
       [Y] = A

It’s possible to remove this duplication by using some of the remaining six registers, but without any direct way to load data into any register other than A, it’s more effort (and code) that it’s worth.

Here’s another video showing the input process for a much shorter program. You can see how the binary LEDs update to show the current program address as the opcodes are entered. There’s a light show at the end as a payoff, so stick with it! (Or just skip to 1:05)

And here’s the source code:

-------  00   TIA  8  ; Register A stores the delay between each update.
------*  01   <0>  0  ; Register Y stores the current LED position.
-----*-  02   TIY  A  ; Start scrolling left.
-----**  03   <0>  0
----*--  04   CAL  E
----*-*  05  TIMR  C
----**-  06   CAL  E
----***  07  RSTR  2
---*---  08   AIY  B
---*--*  09   <3>  3
---*-*-  0A    AM  4  ; Redundant operation whose purpose is to make sure the status flag is set to 1
---*-**  0B   CAL  E  ; otherwise, this call won't get executed.
---**--  0C  SETR  1
---**-*  0D   AIY  B
---***-  0E   <E>  E
---****  0F   CIY  D
--*----  10   <4>  4
--*---*  11  JUMP  F  ; Continue scrolling left.
--*--*-  12   <0>  0
--*--**  13   <4>  4
--*-*--  14   TIY  A  ; Start scrolling right.
--*-*-*  15   <6>  6
--*-**-  16   CAL  E
--*-***  17  TIMR  C
--**---  18   CAL  E
--**--*  19  RSTR  2
--**-*-  1A   AIY  B
--**-**  1B   <D>  D
--***--  1C    AM  4  ; Redundant operation whose purpose is to make sure the status flag is set to 1
--***-*  1D   CAL  E  ; otherwise, this call call won't get executed.
--****-  1E  SETR  1
--*****  1F   AIY  B
-*-----  20   <2>  2
-*----*  21   CIY  D
-*---*-  22   <2>  2
-*---**  23  JUMP  F  ; Continue scrolling right.
-*--*--  24   <1>  1
-*--*-*  25   <6>  6
-*--**-  26  JUMP  F  ; Start scrolling left again.
-*--***  27   <0>  0
-*-*---  28   <2>  2

That’s about all I’ve done with the GMC-4 so far. It’s not much, but it’s been a lot of fun programming a bit closer to the metal for a change.