My favourite moment from the game occurs around a third of the way into St Francis’ Folly, where you run up some stairs, turn a corner and are confronted by this:

Holy shit! *You can’t see the floor!*

I very nearly soiled myself as a young lad when I saw that; it made such an impression on me that I still to this day dig out my copy of Tomb Raider every couple of years and play it through.

Thoughts of St. Francis’ Folly prompted me to try knocking together a simple Tomb Raider level viewer of my own. After all, the assets couldn’t be that complicated, so how hard could it be? Not hard at all as it turns out, thanks to some excellent documentation of the file format put together years ago by dedicated fans.

Parsing the first part of the level pack, extracting a mesh and getting it up on screen took surprisingly little time. Here it is. It certainly looks like *something*. Perhaps it’s a rock?

Skipping through a few more meshes that didn’t really look like anything much, I finally found something recognisable, an upside-down pistol.

It was at this point I realised that Tomb Raider uses a slightly unusual coordinate system; X and Z form the horizontal plane, but positive Y points *down*. After a bit of Y flipping and winding order reversal, things looked a little better. Here’s some shotgun ammo.

It’s interesting to see how differently meshes were put together 15 years ago; they’re made up of a mixture of textured and untextured quads and triangles, with flat shaded quads being used wherever possible. I did similar things back when I used to play with my Net Yaroze in my spare time (the embarrassing fruits of my labours have been thoughtfully uploaded by someone for posterity).

In order to make more sense out the meshes, I moved onto texture extraction. In the first Tomb Raider, one 256-entry colour palette is used for all textures in a particular level. Each level uses around ten 256 by 256 texture atlases. Here’s texture atlas #7 from level one. I’d like to draw your attention to the bottom right corner, where if you look closely, you’ll find a couple of pixelated nipples.

Anyway, after hooking up the textures and taking another look at the meshes, it turned out that the first mesh was in fact Lara’s bum.

The next few meshes in the level pack are all the bits and pieces that make up Lara’s body. Skinning still wasn’t commonplace in those days, so each body part is a separate mesh. Incidentally, Lara’s forehead is *much* bigger than I remember.

Once textured meshes were rendering correctly, it was time to move onto the world geometry. Each level in Tomb Raider is made up of a number of rooms, connected via portals. Rooms are made up of square sectors, 1024 world units in size (since everything was fixed point back then). Each sector stores only one floor height and one ceiling height, which means that if a level designer wanted to put overhangs into a level, they had to stack multiple rooms on top of each other. Given this limitation, the complexity of some of the levels in the game they managed to put together is amazing.

My first attempts at rendering a whole world were somewhat less that successful.

World geometry vertex positions are stored as 16bit XYZ triples, which are defined relative to a per-room origin. According to the documentation, the room positions are defined in world space, but even taking that into account I couldn’t get the rooms to fit together properly. Instead, since rooms are all connected via portals, it was easier to traverse the portal graph and stitch rooms together based on the vertex positions of the connecting portals. For the most part, this worked very well.

*Update: I was two bytes off when reading the room position, the rooms line up just fine now.*

After fixing up the baked vertex lighting and adding a little depth-based fog, I had something resembling a Tomb Raider level.

Here’s The Lost Valley. Any minute now, a T-Rex is going to come stomping round the corner.

And here’s the first stage of Natla’s Mines. It’s huge!

There’s still a whole bunch of features that I could add in: sprites, static meshes, dynamic meshes, animated textures to name just a few, but I’m pretty pleased with how far it’s come with just a couple of evening’s work. So huge thanks go to those folks who figured all this out over a decade ago.

]]>

Real valued spherical harmonics can be defined as:

Where…

are the associated Legendre polynomials:

etc…

Assuming points on a unit sphere are defined in Cartesian coordinates as:

Then the first three bands of the SH basis are simply:

Note the change in sign of odd *m* harmonics, which is consistent with the above definitions of *x*, *y*, *z* and *P*. In many sources the basis function constants are all positive, which can be explained by assuming that they’re defined using the Condon-Shortley phase. *That* took me a while to figure out.

Projecting incident radiance *L* into the SH basis is done using the following integral:

This is actually a *spectacularly bad* approximation for low numbers of SH bands. For example, here’s Paul Debevec’s light probe of Grace cathedral:

And here’s that same probe’s projection into three SH bands (negative values have been clamped to zero):

Wow.

Fortunately, while spherical harmonics aren’t generally good at representing *incident radiance*, they totally kick arse at representing *irradiance*. (Very roughly speaking, incident radiance is the the amount of light falling on a surface from a *particular* direction, while irradiance is the total sum of light falling on a surface from *all* directions.)

In SH form the conversion from radiance *L* to irradiance *E* is marvelously simple:

The definition of *A* isn’t exactly straight forward, but luckily smart people have already done the hard work for us:

The fact that terms after 2 fall off very quickly is what makes it possible to approximate irradiance fairly accurately with only three bands.

Given a set of spherical harmonic irradiance coefficients, the diffuse illumination for a particular direction is calculated by:

Condon-Shortley phase aside, something else that confused me were the results from An Efficient Representation for Irradiance Environment Maps. As far as I can tell, the gamma is not correct for some (possibly all) of the images in that paper, which made comparing results from my own code frustrating. The problems are compounded by the fact that the authors chose to apply some undefined tone mapping operator to some images, but not others.

Here’s the Grace cathedral probe again, this time with an exposure of -2.5 stops:

Below is the result I got from performing a diffuse convolution of the probe using Monte Carlo integration and 1024 samples per pixel (I was too impatient to wait for a brute force convolution to finish). The exposure is set to -2.5 stops. It’s very close to the result of applying a diffuse convolution in HDR Shop:

And here’s the result of projecting the light probe into three SH bands and converting from the coefficients from radiance to irradiance. Again, the exposure is -2.5 stops. It’s pretty close to the Monte Carlo result, which is reassuring:

Now, let’s compare these results to the those from the irradiance environment maps paper. First, their brute force diffuse convolution:

And now their SH approximation:

My guess is that they either applied a gamma curve of 2.2 for some reason, or didn’t correctly account for sRGB colour space when performing the HDR to LDR conversion. Or am I doing it wrong?

Here are the papers that I cribbed from:

Ramamoorthi & Hanrahan’s paper that introduced SH to the rendering community: An Efficient Representation for Irradiance Environment Maps

Their earlier paper actually contains a more rigorous treatment of spherical harmonics: On the relationship between radiance and irradiance: determining the illumination from images of a convex Lambertian object

Robin Green’s great introduction to the topic: Spherical Harmonic Lighting: The Gritty Details

And Volker Schönefeld wrote my favourite introduction, the way he describes SH in terms of separate functions of theta and phi made everything fall into place: Spherical Harmonics

]]>

**Cornell box, I choose you!**

The above screenshots are of real-time single-bounce GI in a static scene with fully dynamic lighting. There are 7182 patches, and the lighting calculations take 36ms per frame on one thread of an Intel Core 2 @2.4 GHz. The code is not optimized.

The basic idea is simple and is split into two phases.

A one-time scene compilation phase:

- Split all surfaces in the scene into roughly equal sized patches.
- For each patch
*i*, build a list of all other patches visible from*i*along with the form factor between patches*i*and*j*:

where:

*F*is the form factor between patches_{ij}*i*and*j*

*A*is the area of patch_{j}*j*

*r*is the vector from*i*to*j*

*Φ*is the angle between_{i}*r*and the normal of patch*i*

*Φ*is the angle between_{j}*r*and the normal of patch*j*

And a per-frame lighting phase:

- For each patch
*i*, calculate the direct illumination. - For each patch
*i*, calculate single-bounce indirect illumination from all visible patches:

where:

*I _{j}* is the single-bounce indirect illumination for patch

So far, so radiosity. If I understand Michal Iwanicki’s GDC presentation correctly, this is similar to the lighting tech on Milo and Kate, only they additionally project the bounce lighting into SH.

The problem with this approach is that the running time is O(N^{2}) with the number of patches. We could work around this by making the patches quite large, running on the GPU, or both. Alternatively, we can bring the running time down to O(N.log(N)) by borrowing from Michael Bunnell’s work on dynamic AO and cluster patches into a hierarchy. I chose to perform bottom-up patch clustering similarly to the method that Miles Macklin describes in his (Almost) realtime GI blog post.

Scene compilation is now:

- Split all surfaces in the scene into roughly equal sized patches.
- Build a hierarchy of patches using k-means clustering.
- For each patch
*i*, build a list of all other patches visible from*i*along with the form factor between patches*i*and*j*. If a visible patch*j*is too far from patch*i*look further up the hierarchy.

And the lighting phase:

- For each leaf patch
*i*in the hierarchy, calculate the direct illumination. - Propagate the direct lighting up the hierarchy.
- For each patch
*i*, calculate single-bounce indirect illumination from all visible patches clusters.

Although this technique is really simple, it supports a feature set similar to that of Enlighten:

- Global illumination reacts to changes in surface albedo.
- Static area light sources that cast soft shadows.

That’s basically about it. There are a few of other areas I’m tempted to look into once I’ve cleaned the code up a bit:

- Calculate directly and indirect illumination at different frequencies. This would allow scaling to much larger scenes.
- Perform the last two lighting steps multiple times to approximate more light bounces.
- Project the indirect illumination into SH, HL2 or the Half-Life basis.
- Light probes for dynamic objects.

You can grab the source code from here. Expect a mess, since it’s a C++ port of a C# proof of concept with liberal use of vector and hash_map. Scene construction is particularly bad and may take a minute to run. You can fly around using WASD and left-click dragging with the mouse.

]]>

I had suggested applying a scale and bias to the result in order to limit a light’s influence, which is a serviceable solution, but far from ideal. Unfortunately, applying such a bias causes the gradient of the curve to become non-zero at limit of the light’s influence.

Here’s the attenuation curve for a light of radius 1.0:

And after applying a scale and bias (shown in red):

You can see that the gradient at the zero-crossing is close to, but not quite zero. This is problematic because the human eye is irritatingly sensitive to discontinuities in illumination gradients and we might easily end up with Mach bands.

I was discussing this problem with a colleague of mine, Jerome Scholler, and he came up with an excellent suggestion – to transform *d* in the attenuation equation by some function whose value tends to infinity as its input reaches our desired maximum distance of influence. My first thought was of using tan:

That worked well, the resulting curve has roughly the same shape as the original, while also having both a gradient and value of zero at the desired maximum distance. It does have the disadvantage of using a trig function, which isn’t so hot, so we went looking for something else. After a few minutes playing around we came up with the following rational function:

It’s very similar to the tan version, but may run faster, depending on your hardware.

Below are some examples of the different methods, using a light with high intensity and small influence. On the left of each is the original image, on the right is the result of a levels adjustment, which emphasizes the tail of the attenuation curve.

Disclaimer: The parameters for the analytic functions were chosen to highlight their different characteristics, not to look good.

Original ray traced reference:

The graphs today were brought to you courtesy of the awesome fooplot.com.

]]>

where:

d = distance between the light and the surface being shaded

kc = constant attenuation factor

kl = linear attenuation factor

kq = quadratic attenuation factor

Since I first read about light attenuation in the Red Book I’ve often wondered where this equation came from and what values should actually be used for the attenuation factors, but I could never find a satisfactory explanation. Pretty much every reference to light attenuation in both books and online simply presents some variant of this equation, along with screenshots of objects being lit by lights with different attenuation factors. If you’re lucky, there’s sometimes an accompanying bit of handwaving.

Today, I did some experimentation with my path tracer and was pleasantly surprised to find a correlation between the direct illumination from a physically based spherical area light source and the point light attenuation equation.

I set up a simple scene in which to conduct the tests: a spherical area light above a diffuse plane. By setting the light’s radius and distance above the plane to different values and then sampling the direct illumination at a point on the plane directly below the light, I built up a table of attenuation values. Here’s a plot of a some of the results; the distance on the horizontal axis is that between the plane and the light’s *surface*, not its centre.

After looking at the results from a series of tests, it became apparent that the attenuation of a spherical light can be modeled as:

where:

d = distance between the light’s surface and the point being shaded

r = the light’s radius

Expanding this out, we get:

which is the original point light attenuation equation with the following attenuation factors:

Below are a couple of renders of four lights above a plane. The first is a ground-truth render of direct illumination calculated using Monte Carlo integration:

In this second render, direct illumination is calculated analytically using the attenuation factors derived from the light radius:

The only noticeable difference between the two is that in the second image, an area of the plane to the far left is slightly too bright due to a lack of a shadowing term.

Maybe this is old news to many people, but I was pretty happy to find out that an equation that had seemed fairly arbitrary to me for so many years actually had some physical motivation behind it. I don’t really understand why this relationship is never pointed out, not even in Foley and van Dam’s venerable tome*.

Unfortunately this attenuation model is still problematic for real-time rendering, since a light’s influence is essentially unbounded. We can, however, artificially enforce a finite influence by clipping all contributions that fall below a certain threshold. Given a spherical light of radius *r* and intensity *Li*, the illumination *I* at distance *d* is:

Assuming we want to ignore all illumination that falls below some cutoff threshold *Ic*, we can solve for *d* to find the maximum distance of the light’s influence:

Biasing the calculated illumination by *-Ic* and then scaling by *1/(1-Ic)* ensures that illumination drops to zero at the furthest extent, and the maximum illumination is unchanged.

Here’s the result of applying these changes with a cutoff threshold of 0.001; in the second image, areas which receive no illumination are highlighted in red:

And here’s a cutoff threshold of 0.005; if you compare to the version with no cutoff, you’ll see that the illumination is now noticeably darker:

Just to round things off, here’s a GLSL snippet for calculating the approximate direct illumination from a spherical light source. Soft shadows are left as an exercise for the reader.

vec3 DirectIllumination(vec3 P, vec3 N, vec3 lightCentre, float lightRadius, vec3 lightColour, float cutoff) { // calculate normalized light vector and distance to sphere light surface float r = lightRadius; vec3 L = lightCentre - P; float distance = length(L); float d = max(distance - r, 0); L /= distance; // calculate basic attenuation float denom = d/r + 1; float attenuation = 1 / (denom*denom); // scale and bias attenuation such that: // attenuation == 0 at extent of max influence // attenuation == 1 when d == 0 attenuation = (attenuation - cutoff) / (1 - cutoff); attenuation = max(attenuation, 0); float dot = max(dot(L, N), 0); return lightColour * dot * attenuation; }

* I always felt a little sorry for Feiner and Hughes.

]]>

After playing with my GMC-4 for a couple of days, the initial novelty of hand assembling programs had well and truly worn off, so I turned to the assembler and simulator available on the web. After a couple of minutes of use though, it was obvious that a *simulator* wasn’t at all what I wanted.

After all, if I wanted to type in a program nibble by nibble, I may as well do it on the hardware itself. What I *really* wanted was an integrated assembler and debugger. Luckily, the instruction set is very basic, so it didn’t take very long to put one together:

The left pane is the source window. The three columns on the right show the contents of memory; program memory is shown in Wheat & Cornsilk, data memory in DarkSeaGreen & PaleGreen, the current instruction is highlighted in DarkRed. The colouring is intended to make it easier to follow through the code when typing in the machine code on the hardware.

The jumble of text below the memory view shows the state of the eight registers, the status flag (see my previous post), program counter, seven-segment display and binary LEDs. There are buttons for copying the contents of memory to the clipboard, running the program, single-stepping and finally simultaneously compiling the source and resetting the state of the machine. The bottom panel shows any exceptions that might get thrown by the assembler or emulator (I didn’t bother to spend much time on error handling).

The reason op-codes are still shown for data memory is that (aside from addressable ranges) the GMC-4 makes no distinction between code and data memory. The neat thing about this is that the hardware has no problems executing code that lives in data memory; it also doesn’t mind if those instructions get modified during program execution, which means… self-modifying code! I’ve not found a way to make reasonable use of this, but it’s kind of cool nevertheless.

In the unlikely event that anyone’s interested, I’ve shoved the code up on Google. It’s written in Good Ol’ WinForms, as my fleeting love affair with WPF ended once I decided that I’d rather be productive than fashionable.

For one of the “games” I wrote, I needed a random number generator. Unfortunately, with 4-bit addition and subtraction my only available mathematical operations, it wasn’t immediately obvious how to go about it.

A bit of searching around led inevitably to Wikipedia and multiply-with-carry random number generators. I went with lag-1 MWC for simplicity, which is defined as:

Being limited to 4-bit numbers, a natural choice for *b* was 16 and a quick exhaustive search for *a* showed that a value of 15 yielded the sequence with the longest period (around 120) and distribution of numbers that wasn’t wholly intolerable (it at least covered all the digits). As an added bonus, using these values for *a* and *b* meant that the multiplications and divisions could be done away with completely:

That crappy little tilde above the *x* is meant to represent a bitwise *not* – my LaTeX is pretty weak I’m afraid.

4-bit random number generation is not exactly a widely discussed topic on the internet, so I don’t know if there’s a better method. This worked well enough for my purposes though. Quality wise, it’s probably on par with RANDU.

]]>

“Made it! And with one and a half bytes left to spare!”

I was down in San Francisco’s Japantown a few days ago, browsing the magazine section of the Kinokuniya bookstore, when I stumbled across something totally awesome – a magazine series called Otona no Kagaku (lit. Adult’s Science). Each edition comes packaged with a build-it-yourself kit for some kind of science experiment and the magazine itself contains the assembly instructions, ideas for experiments and other background information.

The subjects covered in the series are diverse and include a steam engine, movie projector, theremin and even a bird organ (no idea). The one that caught my eye, however, was a 4-bit microcomputer kit. The kit itself was very simple to assemble and just involved screwing together a few prefabbed parts and putting in batteries. Once assembled, I held in my hands a working GMC-4 microcomputer. It’s a beast of a machine, with a staggering *eighty nibbles* of program memory, sixteen nibbles of data memory and eight 4-bit registers (although only two of them are available for use at any one time).

**Behold!**

Those primitive scratchings beneath are my first working program.

Once built, the next step was to make it actually do something. I was feeling particularly masochistic, so I decided to figure it out the hard way – without the internet. Armed with a Nintendo DS and a copy of Kanji Sonomama Rakubiki Jiten I spent the next couple of hours translating the operating guide and the instruction set. Once I had a rough idea of how the thing worked and had managed to get a couple of the sample programs running, it was time to write a program for myself.

Since the GMC-4’s built-in arithmetic is limited to 4-bit addition and subtraction, I figured an achievable enough goal for an afternoon’s work would be a 16-bit adder. Five hours later, all I had was a program that thought that 1 + 1 == F. It was slow going – writing the program out on paper and then translating the mnemonics into machine code by hand. Still, I found the process perversely satisfying once everything was finally working and it reminded me of my college days when we had to do the same thing for a 6502. It took another four hours to fix all the bugs and then fit the code into memory. In the end, it exactly filled the available program memory and used 13 of the available 16 nibbles of data memory. Three nibbles to spare!

Here’s a video of the adder in action, calculating the following sums:

0x3978 + 0x2BD6 = 0x0654E 0xA5E3 + 0xD687 = 0x17C6A

Full source code is below. From left to right, the columns are as follows: code address as displayed by the binary LEDs on the system, code address in hexadecimal, operation mnemonic, opcode value, comments.

------- 00 TIY A ; INPUT PHASE: Init data pointer to most significant digit of first number - digits ------* 01 <7> 7 ; are read into addresses 7,6,5,4 for the first number and 3,2,1,0 for the second. -----*- 02 KA 0 ; Wait for user input. -----** 03 JUMP F ----*-- 04 <0> 0 ----*-* 05 <2> 2 ----**- 06 AM 4 ; Store and display digit. ----*** 07 AO 1 ---*--- 08 CAL E ; BEEP! this both provides feedback and creates a short delay, ---*--* 09 SHTS 9 ; which prevents the press being registered multiple times. ---*-*- 0A AIY B ; Decrement data pointer ---*-** 0B <F> F ---**-- 0C JUMP F ; If the pointer is still >= 0, loop again. ---**-* 0D <0> 0 ---***- 0E <2> 2 ---**** 0F TIY A ; Store the first carry value (which has value 0) in the location of the first output digit (address 8). --*---- 10 <8> 8 --*---* 11 TIA 8 --*--*- 12 <0> 0 --*--** 13 AM 4 --*-*-- 14 TIY A ; Reset data pointer to point to the least significant digit of the second number. --*-*-* 15 <0> 0 --*-**- 16 MA 5 ; MAIN LOOP: Load a digit from the first number. --*-*** 17 AIY B --**--- 18 <4> 4 --**--* 19 M+ 6 ; Add the corresponding digit of the second number. --**-*- 1A JUMP F ; Check for overflow. --**-** 1B <2> 2 --***-- 1C <C> C --***-* 1D AIY B ; The addition caused no overflow, add the carry value from the previous step. --****- 1E <4> 4 --***** 1F M+ 6 -*----- 20 JUMP F ; Check for overflow again. -*----* 21 <2> 2 -*---*- 22 <F> F -*---** 23 AM 4 ; Still no overflow, store a carry value of 0 in the address of the next output digit. -*--*-- 24 AIY B -*--*-* 25 <1> 1 -*--**- 26 TIA 8 -*--*** 27 <0> 0 -*-*--- 28 AM 4 -*-*--* 29 JUMP F ; Skip to end of the loop. -*-*-*- 2A <3> 3 -*-*-** 2B <5> 5 -*-**-- 2C AIY B ; Overflow caused by the initial digit addition - add the carry value from the previous step. -*-**-* 2D <4> 4 -*-***- 2E M+ 6 -*-**** 2F AM 4 ; We can get here from either of the two possible overflow conditions, -**---- 30 AIY B ; store a carry value of 1 in the address of the next output digit. -**---* 31 <1> 1 -**--*- 32 TIA 8 -**--** 33 <1> 1 -**-*-- 34 AM 4 -**-*-* 35 AIY B ; Move on to next digit. -**-**- 36 <8> 8 -**-*** 37 CIY D ; Check if we've reached the last digit. -***--- 38 <4> 4 -***--* 39 JUMP F ; If not, run the loop again. -***-*- 3A <1> 1 -***-** 3B <6> 6 -****-- 3C CAL E ; DISPLAY PHASE: Clear the display. -****-* 3D RSTO 0 -*****- 3E TIY A ; Set the data pointer to point past the most significant digit of the output. -****** 3F <D> D *------ 40 AIY B ; Decrement the data pointer. *-----* 41 <F> F *----*- 42 TIA 8 ; Pause for a short while. *----** 43 <6> 6 *---*-- 44 CAL E *---*-* 45 TIMR C *---**- 46 MA 5 ; Load and display the value of output digit. *---*** 47 AO 1 *--*--- 48 CIY D ; Check if we've stepped past the least significant digit of the output. *--*--* 49 <7> 7 *--*-*- 4A JUMP F ; If so, jump back and clear the display. *--*-** 4B <4> 4 *--**-- 4C <0> 0 *--**-* 4D JUMP F ; If not, jump back and move on to the next digit. *--***- 4E <3> 3 *--**** 4F <C> C

Curtis Hoffmann has written a comprehensive description of the GMC-4 and his page also has links to a GMC-4 simulator and assembler.

One feature of the CPU that caused me trouble is that it has only one status flag, the value of which is modified by *every instruction*. The instruction for reading the keypad sets this flag to 0 if a key is pressed, and 1 if not; the compare instructions set it to 1 if a register is not equal to some constant, and 0 otherwise; the arithmetic instructions set the flag to 1 on overflow and 0 otherwise; all other instructions set the flag to 1. There is only one direct branch instruction and branches are only taken if the status flag is 1 at the time.

The upshot of this is that tests must be acted upon *immediately*, otherwise their results will be discarded as soon as the next instruction executes. Couple this with a limited instruction set and a scarcity of registers and I ended up having to duplicate many sequences of instructions. Not what you want to be doing with only 40 bytes of memory.

As an example, here’s pseudo code for adding two values stored in addresses 0 and 1, writing the 4-bit result to address 2 and the carry bit to address 3. A and Y are registers, [Y] denotes a reference to the data at address Y:

Y = 0 A = [Y] Y = 1 A += [Y] goto overflow ; only taken if addition overflowed no_overflow: Y = 2 [Y] = A A = 0 goto store_carry ; alway taken overflow: Y = 2 [Y] = A A = 1 store_carry: Y = 3 [Y] = A

It’s possible to remove this duplication by using some of the remaining six registers, but without any direct way to load data into any register other than A, it’s more effort (and code) that it’s worth.

Here’s another video showing the input process for a much shorter program. You can see how the binary LEDs update to show the current program address as the opcodes are entered. There’s a light show at the end as a payoff, so stick with it! (Or just skip to 1:05)

And here’s the source code:

------- 00 TIA 8 ; Register A stores the delay between each update. ------* 01 <0> 0 ; Register Y stores the current LED position. -----*- 02 TIY A ; Start scrolling left. -----** 03 <0> 0 ----*-- 04 CAL E ----*-* 05 TIMR C ----**- 06 CAL E ----*** 07 RSTR 2 ---*--- 08 AIY B ---*--* 09 <3> 3 ---*-*- 0A AM 4 ; Redundant operation whose purpose is to make sure the status flag is set to 1 ---*-** 0B CAL E ; otherwise, this call won't get executed. ---**-- 0C SETR 1 ---**-* 0D AIY B ---***- 0E <E> E ---**** 0F CIY D --*---- 10 <4> 4 --*---* 11 JUMP F ; Continue scrolling left. --*--*- 12 <0> 0 --*--** 13 <4> 4 --*-*-- 14 TIY A ; Start scrolling right. --*-*-* 15 <6> 6 --*-**- 16 CAL E --*-*** 17 TIMR C --**--- 18 CAL E --**--* 19 RSTR 2 --**-*- 1A AIY B --**-** 1B <D> D --***-- 1C AM 4 ; Redundant operation whose purpose is to make sure the status flag is set to 1 --***-* 1D CAL E ; otherwise, this call call won't get executed. --****- 1E SETR 1 --***** 1F AIY B -*----- 20 <2> 2 -*----* 21 CIY D -*---*- 22 <2> 2 -*---** 23 JUMP F ; Continue scrolling right. -*--*-- 24 <1> 1 -*--*-* 25 <6> 6 -*--**- 26 JUMP F ; Start scrolling left again. -*--*** 27 <0> 0 -*-*--- 28 <2> 2

That’s about all I’ve done with the GMC-4 so far. It’s not much, but it’s been a lot of fun programming a bit closer to the metal for a change.

]]>

One of the biggest headaches I encountered were caused by “fireflies”: those bright pixels that can occur when a sampling a strong response combined with a small PDF somewhere along the path. For a long time, I was “fixing” these by hand painting over the offending pixels and pretending like nothing ever happened. Eventually though, the guilt of this gnawed away at me long enough to motivate finding some kind of better solution.

My first thought was to write a filter that estimated variance in an image and replace any “bad” pixels it found with a weighted average of their neighbours. Luckily, my second thought was of shadow maps, the only other context in which I’d read about variance before. Based on the ideas in that paper, I accumulate two separate per-pixel buffers: one storing the running sum of the samples and the other storing the sum of their squares. Having these two buffers then makes it trivial to compute the sample variance of each pixel in the image.

My path tracer already had support for progressive refinement, so it was straightforward to add a separate “variance reduction” pass that would run at the touch of a button. During this pass, the N pixels with the highest variance are identified and oversampled a few hundred times, which hopefully reduces their variance sufficiently. If not, I just run the pass again.

As an example, here’s a render of the Manifold mesh from Torolf Sauermann’s awesome model repository, stopped after only a few paths have been traced per pixel:

I’ve highlighted a few areas that contain fireflies and below is a comparison of those areas before and after running the variance reduction pass:

And here’s the complete result of running the pass; it’s still noisy, but the pixels with particularly high variance have been cleaned up reasonably well:

I’m not too hot at statistics, but I would guess that this adds bias to the final result, which is frowned upon in some circles (but not others). Admittedly, a better solution would be to simply not generate so much variance in the first place, but this will do as a kludge until then. At least it’s better than painting pixels by hand!

Ok, since this post was mostly just an excuse to dump some results from my path tracer, here they are.

The XYZ RGB Dragon from Stanford’s 3D Scanning Repository:

A heat map of the BIH built for the same model.

The *other* Dragon from Stanford:

Manifold again, from jotero.com.

Some Stanford Bunnies:

Crytek’s updated version of Marko Dabrovic’s Sponza model:

And Stanford’s Lucy, just to prove that I can:

I’ve not been able to find any close up renders of Lucy to compare with, but I believe that the “pimples” on the model are noise in the original dataset.

]]>

In order to keep myself distracted from its dirty looks, I’ve been tinkering around with fluid simulation. Miles Macklin has done some great work with Eulerian (grid based) solvers, so in an effort to distance myself from the competition, I’m sticking to 2D Lagrangian (particle based) simulation.

Until recently, I’d always thought that particle based fluid simulation was complicated and involved *heavy maths*. This wasn’t helped by the fact that most of the papers on the subject have serious sounding names like Particle-based Viscoelastic Fluid Simulation, Weakly compressible SPH for free surface flows, or even Smoothed Particle Hydrodynamics and Magnetohydrodynamics.

It wasn’t until I finally took the plunge and tried writing my own Smoothed Particle Hydrodynamics simulation that I found that it can be quite easy, provided you work from the right papers. SPH has a couple of advantages over grid based methods: it is trivial to ensure that mass is exactly conserved, and free-surfaces (the boundary between fluid and non-fluid) come naturally. Unfortunately, SPH simulations have a tendency to explode if the time step is too large and getting satisfactory results is heavily dependent on finding “good” functions with which to model the inter-particle forces.

I had originally intended to write an introduction to SPH, but soon realised that it would make this post intolerably long, so instead I’ll refer to the papers that I used when writing my own sim. Pretty much every SPH paper comes with an introduction to the subject, invariably in section **2. Related Work**.

The first paper I tried implementing was Particle-Based Fluid Simulation for Interactive Applications by Müller et. al. It serves as a great introduction to SPH with a very good discussion of kernel weighting functions, but I had real difficulty getting decent results. In the paper pressure, viscosity and surface tension forces are modeled using following equations:

The pressure for each particle is calculated from its density using:

where is the some non-zero rest density.

The first problem I encountered was with the pressure model; it only acts as a repulsive force if the particle density is greater than the rest density. If a particle has only a small number of neighbours, the pressure force will attract them to form a cluster of particles all sharing the same space. In my experiments, I often found large numbers of clusters of three or four particles all in the same position. It took me a while to figure out what was going on because Müller states that the value of the rest density “mathematically has no effect on pressure forces”, which is only true given a fairly uniform density of particles far from the boundary.

The second problem I found was with the surface tension force. It was originally developed for multiphase fluid situations with no free surfaces and doesn’t behave well near the surface boundary; in fact it can actually pull the fluid into concave shapes. Additionally, because it’s based on a Laplacian, it’s very sensitive to fluctuations in the particle density, which are the norm at the surface boundary.

After a week or so of trying, this was my best result:

From the outset, you can see the surface tension force is doing weird things. Even worse, once the fluid starts to settle the particles tended to stack on top of each and form a very un-fluid blob.

On the up side, I did create possibly my best ever bug when implementing the surface tension model; I ended up with something resembling microscopic life floating around under the microscope:

The next paper I tried was Particle-based Viscoelastic Fluid Simulation by Clavet et al. I actually had a lot of success with their paper and had a working implementation of their basic model up and running in less than two hours. Albeit minus the viscoelasticity. In addition to the pressure force described in Müller’s paper, they model “near” density and pressure, which are similar to their regular counterparts but with a zero rest density and different kernel functions:

This near pressure ensures a minimum spacing and as an added bonus performs a decent job of modelling surface tension too. This is the first simulation I ran using their pressure and viscosity forces:

Although initial results were promising, I struggled when tweaking the parameters to find a good balance between a fluid that was too compressible and one that was too viscous. Also, what I really wanted was to do multiphase fluid simulation. This wasn’t covered in the viscoelastic paper, so my next port of call was Weakly compressible SPH for free surface flows by Becker et al. In this paper, surface tension is modeled as:

They also discuss using Tait’s equation for the pressure force, rather than one based on the ideal gas law:

with

I gave that a shot, but the large exponent caused the simulation to explode unless I used a *really* small time step. Instead, I found that modifying the pressure forces from the viscoelastic paper slightly gave a much less compressible fluid without the requirement for a tiny time step:

Here’s one of my more successful runs:

And here is a slightly simplified version of the code behind it. Be warned, it’s quite messy; I’m rather enjoying hacking code together these days:

#include <float.h> #include <math.h> #include <stdio.h> #include <stdlib.h> #include <assert.h> #include <memory.h> #include <glut.h> #define kScreenWidth 640 #define kScreenHeight 480 #define kViewWidth 10.0f #define kViewHeight (kScreenHeight*kViewWidth/kScreenWidth) #define kPi 3.1415926535f #define kParticleCount 3000 #define kRestDensity 82.0f #define kStiffness 0.08f #define kNearStiffness 0.1f #define kSurfaceTension 0.0004f #define kLinearViscocity 0.5f #define kQuadraticViscocity 1.0f #define kParticleRadius 0.05f #define kH (6*kParticleRadius) #define kFrameRate 20 #define kSubSteps 7 #define kDt ((1.0f/kFrameRate) / kSubSteps) #define kDt2 (kDt*kDt) #define kNorm (20/(2*kPi*kH*kH)) #define kNearNorm (30/(2*kPi*kH*kH)) #define kEpsilon 0.0000001f #define kEpsilon2 (kEpsilon*kEpsilon) struct Particle { float x; float y; float u; float v; float P; float nearP; float m; float density; float nearDensity; Particle* next; }; struct Vector2 { Vector2() { } Vector2(float x, float y) : x(x) , y(y) { } float x; float y; }; struct Wall { Wall() { } Wall(float _nx, float _ny, float _c) : nx(_nx), ny(_ny), c(_c) { } float nx; float ny; float c; }; struct Rgba { Rgba() { } Rgba(float r, float g, float b, float a) : r(r), g(g), b(b), a(a) { } float r, g, b, a; }; struct Material { Material() { } Material(const Rgba& colour, float mass, float scale, float bias) : colour(colour) , mass(mass) , scale(scale) , bias(bias) { } Rgba colour; float mass; float scale; float bias; }; #define kMaxNeighbourCount 64 struct Neighbours { const Particle* particles[kMaxNeighbourCount]; float r[kMaxNeighbourCount]; size_t count; }; size_t particleCount = 0; Particle particles[kParticleCount]; Neighbours neighbours[kParticleCount]; Vector2 prevPos[kParticleCount]; Vector2 relaxedPos[kParticleCount]; Material particleMaterials[kParticleCount]; Rgba shadedParticleColours[kParticleCount]; #define kWallCount 4 Wall walls[kWallCount] = { Wall( 1, 0, 0), Wall( 0, 1, 0), Wall(-1, 0, -kViewWidth), Wall( 0, -1, -kViewHeight) }; #define kCellSize kH const size_t kGridWidth = (size_t)(kViewWidth / kCellSize); const size_t kGridHeight = (size_t)(kViewHeight / kCellSize); const size_t kGridCellCount = kGridWidth * kGridHeight; Particle* grid[kGridCellCount]; size_t gridCoords[kParticleCount*2]; struct Emitter { Emitter(const Material& material, const Vector2& position, const Vector2& direction, float size, float speed, float delay) : material(material), position(position), direction(direction), size(size), speed(speed), delay(delay), count(0) { float len = sqrt(direction.x*direction.x + direction.y*direction.y); this->direction.x /= len; this->direction.y /= len; } Material material; Vector2 position; Vector2 direction; float size; float speed; float delay; size_t count; }; #define kEmitterCount 2 Emitter emitters[kEmitterCount] = { Emitter( Material(Rgba(0.6f, 0.7f, 0.9f, 1), 1.0f, 0.08f, 0.9f), Vector2(0.05f*kViewWidth, 0.8f*kViewHeight), Vector2(4, 1), 0.2f, 5, 0), Emitter( Material(Rgba(0.1f, 0.05f, 0.3f, 1), 1.4f, 0.075f, 1.5f), Vector2(0.05f*kViewWidth, 0.9f*kViewHeight), Vector2(4, 1), 0.2f, 5, 6), }; float Random01() { return (float)rand() / (float)(RAND_MAX-1); } float Random(float a, float b) { return a + (b-a)*Random01(); } void UpdateGrid() { // Clear grid memset(grid, 0, kGridCellCount*sizeof(Particle*)); // Add particles to grid for (size_t i=0; i<particleCount; ++i) { Particle& pi = particles[i]; int x = pi.x / kCellSize; int y = pi.y / kCellSize; if (x < 1) x = 1; else if (x > kGridWidth-2) x = kGridWidth-2; if (y < 1) y = 1; else if (y > kGridHeight-2) y = kGridHeight-2; pi.next = grid[x+y*kGridWidth]; grid[x+y*kGridWidth] = π gridCoords[i*2] = x; gridCoords[i*2+1] = y; } } void ApplyBodyForces() { for (size_t i=0; i<particleCount; ++i) { Particle& pi = particles[i]; pi.v -= 9.8f*kDt; } } void Advance() { for (size_t i=0; i<particleCount; ++i) { Particle& pi = particles[i]; // preserve current position prevPos[i].x = pi.x; prevPos[i].y = pi.y; pi.x += kDt * pi.u; pi.y += kDt * pi.v; } } void CalculatePressure() { for (size_t i=0; i<particleCount; ++i) { Particle& pi = particles[i]; size_t gi = gridCoords[i*2]; size_t gj = gridCoords[i*2+1]*kGridWidth; neighbours[i].count = 0; float density = 0; float nearDensity = 0; for (size_t ni=gi-1; ni<=gi+1; ++ni) { for (size_t nj=gj-kGridWidth; nj<=gj+kGridWidth; nj+=kGridWidth) { for (Particle* ppj=grid[ni+nj]; NULL!=ppj; ppj=ppj->next) { const Particle& pj = *ppj; float dx = pj.x - pi.x; float dy = pj.y - pi.y; float r2 = dx*dx + dy*dy; if (r2 < kEpsilon2 || r2 > kH*kH) continue; float r = sqrt(r2); float a = 1 - r/kH; density += pj.m * a*a*a * kNorm; nearDensity += pj.m * a*a*a*a * kNearNorm; if (neighbours[i].count < kMaxNeighbourCount) { neighbours[i].particles[neighbours[i].count] = &pj; neighbours[i].r[neighbours[i].count] = r; ++neighbours[i].count; } } } } pi.density = density; pi.nearDensity = nearDensity; pi.P = kStiffness * (density - pi.m*kRestDensity); pi.nearP = kNearStiffness * nearDensity; } } void CalculateRelaxedPositions() { for (size_t i=0; i<particleCount; ++i) { const Particle& pi = particles[i]; float x = pi.x; float y = pi.y; for (size_t j=0; j<neighbours[i].count; ++j) { const Particle& pj = *neighbours[i].particles[j]; float r = neighbours[i].r[j]; float dx = pj.x - pi.x; float dy = pj.y - pi.y; float a = 1 - r/kH; float d = kDt2 * ((pi.nearP+pj.nearP)*a*a*a*kNearNorm + (pi.P+pj.P)*a*a*kNorm) / 2; // relax x -= d * dx / (r*pi.m); y -= d * dy / (r*pi.m); // surface tension if (pi.m == pj.m) { x += (kSurfaceTension/pi.m) * pj.m*a*a*kNorm * dx; y += (kSurfaceTension/pi.m) * pj.m*a*a*kNorm * dy; } // viscocity float du = pi.u - pj.u; float dv = pi.v - pj.v; float u = du*dx + dv*dy; if (u > 0) { u /= r; float a = 1 - r/kH; float I = 0.5f * kDt * a * (kLinearViscocity*u + kQuadraticViscocity*u*u); x -= I * dx * kDt; y -= I * dy * kDt; } } relaxedPos[i].x = x; relaxedPos[i].y = y; } } void MoveToRelaxedPositions() { for (size_t i=0; i<particleCount; ++i) { Particle& pi = particles[i]; pi.x = relaxedPos[i].x; pi.y = relaxedPos[i].y; pi.u = (pi.x - prevPos[i].x) / kDt; pi.v = (pi.y - prevPos[i].y) / kDt; } } void ResolveCollisions() { for (size_t i=0; i<particleCount; ++i) { Particle& pi = particles[i]; for (size_t j=0; j<kWallCount; ++j) { const Wall& wall = walls[j]; float dis = wall.nx*pi.x + wall.ny*pi.y - wall.c; if (dis < kParticleRadius) { float d = pi.u*wall.nx + pi.v*wall.ny; if (dis < 0) dis = 0; pi.u += (kParticleRadius - dis) * wall.nx / kDt; pi.v += (kParticleRadius - dis) * wall.ny / kDt; } } } } void Render() { glClearColor(0.02f, 0.01f, 0.01f, 1); glClear(GL_COLOR_BUFFER_BIT); glMatrixMode(GL_PROJECTION); glLoadIdentity(); glOrtho(0, kViewWidth, 0, kViewHeight, 0, 1); glEnable(GL_POINT_SMOOTH); glEnable(GL_BLEND); glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA); for (size_t i=0; i<particleCount; ++i) { const Particle& pi = particles[i]; const Material& material = particleMaterials[i]; Rgba& rgba = shadedParticleColours[i]; rgba = material.colour; rgba.r *= material.bias + material.scale*pi.P; rgba.g *= material.bias + material.scale*pi.P; rgba.b *= material.bias + material.scale*pi.P; } glEnableClientState(GL_VERTEX_ARRAY); glEnableClientState(GL_COLOR_ARRAY); glPointSize(2.5f*kParticleRadius*kScreenWidth/kViewWidth); glColorPointer(4, GL_FLOAT, sizeof(Rgba), shadedParticleColours); glVertexPointer(2, GL_FLOAT, sizeof(Particle), particles); glDrawArrays(GL_POINTS, 0, particleCount); glDisableClientState(GL_COLOR_ARRAY); glDisableClientState(GL_VERTEX_ARRAY); glutSwapBuffers(); } void EmitParticles() { if (particleCount == kParticleCount) return; static int emitDelay = 0; if (++emitDelay < 3) return; for (size_t emitterIdx=0; emitterIdx<kEmitterCount; ++emitterIdx) { Emitter& emitter = emitters[emitterIdx]; if (emitter.count >= kParticleCount/kEmitterCount) continue; emitter.delay -= kDt*emitDelay; if (emitter.delay > 0) continue; size_t steps = emitter.size / (2*kParticleRadius); for (size_t i=0; i<=steps && particleCount<kParticleCount; ++i) { Particle& pi = particles[particleCount]; Material& material = particleMaterials[particleCount]; ++particleCount; ++emitter.count; float ofs = (float)i / (float)steps - 0.5f; ofs *= emitter.size; pi.x = emitter.position.x - ofs*emitter.direction.y; pi.y = emitter.position.y + ofs*emitter.direction.x; pi.u = emitter.speed * emitter.direction.x*Random(0.9f, 1.1f); pi.v = emitter.speed * emitter.direction.y*Random(0.9f, 1.1f); pi.m = emitter.material.mass; material = emitter.material; } } emitDelay = 0; } void Update() { for (size_t step=0; step<kSubSteps; ++step) { EmitParticles(); ApplyBodyForces(); Advance(); UpdateGrid(); CalculatePressure(); CalculateRelaxedPositions(); MoveToRelaxedPositions(); UpdateGrid(); ResolveCollisions(); } glutPostRedisplay(); } int main (int argc, char** argv) { glutInitWindowSize(kScreenWidth, kScreenHeight); glutInit(&argc, argv); glutInitDisplayString("samples stencil>=3 rgb double depth"); glutCreateWindow("SPH"); glutDisplayFunc(Render); glutIdleFunc(Update); memset(particles, 0, kParticleCount*sizeof(Particle)); UpdateGrid(); glutMainLoop(); return 0; }

I’m pretty happy with the results, even if at three seconds per frame for the video above, my implementation isn’t exactly fast. Here are a few other videos from various stages of development:

]]>

Now that TFU2 is *almost* out of the door, I’ve been catching up on this year’s GDC presentations. One that was of particular interest to me was John Hable’s Uncharted 2 HDR Lighting talk because I think we’re all in agreement about how awesome the game looks. That led me to checking out his blog and his discussions of various tone mapping operators.

I agree with him on most of his points and I really like the results of his operator, but I was a bit disappointed by the treatment of Erik Reinhard’s tone mapping operator.

In order to explain why, I’ve pinched the HDR photo from John’s blog and it’s accompanied by some colour ramps to illustrate the results of applying various operations. The ramps go from a luminance of 0 up to a luminance of er… *very large* and are in linear RGB space.

Shown below are the source images (the photo is exposed at +4 stops). Click through for less tiny versions:

In both his blog and GDC presentation, John describes a simplified version of Reinhard’s operator as applying the following function to each colour channel:

Let’s do that to our test image and see what happens:

The top end isn’t nearly so blown out, but where did all my colour go?! That’s no good at all!

Let’s check out how John’s operator does:

It’s *much* better, especially in the blacks, but it’s still rather desaturated towards the top end. Perhaps that’s the price one pays for compressing the dynamic range so heavily.

Actually it doesn’t have to be.

The problem with the tone mapping operators that John describes is that they all operate on the RGB channels independently. Applying *any* non-linear transform in this way will result in both hue and saturation shifts, which is something that should be performed during final colour grading, not tone mapping. Instead, Reinhard’s tone mapping operator should be applied on each pixel’s *luminance*, which will preserve both the hue and saturation of the original image.

There’s some confusion on the internet about the correct way to apply Reinhard’s operator. Some sources recommend converting from linear RGB into CIE xyY, a colour space derived from CIE XYZ. The advantage of this colour space is that luminance is stored in the Y channel, independently of chromacity in xy. The idea is that you convert your image from RGB to xyY, perform the tone mapping on the Y channel only and then convert back to RGB.

While not complicated, the transform between linear RGB and xyY isn’t exactly trivial either. Here’s a simple implementation in C#, for a reference white of D65:

void RGBtoxyY(double R, double G, double B, out double x, out double y, out double Y) { // Convert from RGB to XYZ double X = R * 0.4124 + G * 0.3576 + B * 0.1805; double Y = R * 0.2126 + G * 0.7152 + B * 0.0722; double Z = R * 0.0193 + G * 0.1192 + B * 0.9505; // Convert from XYZ to xyY double L = (X + Y + Z); x = X / L; y = Y / L; } void xyYtoRGB(double x, double y, double Y, out double R, out double G, out double B) { // Convert from xyY to XYZ double X = x * (Y / y); double Z = (1 - x - y) * (Y / y); // Convert from XYZ to RGB R = X * 3.2406 + Y * -1.5372 + Z * -0.4986; G = X * -0.9689 + Y * 1.8758 + Z * 0.0415; B = X * 0.0557 + Y * -0.2040 + Z * 1.0570; }

I was using this colour space transform for a couple of days, until my esteemed colleague Miles pointed out that I was doing it wrong. A much simpler approach is to calculate your luminance directly from the RGB values, perform the tone mapping on this value and then scale the original RGB values appropriately:

double L = 0.2126 * R + 0.7152 * G + 0.0722 * B; double nL = ToneMap(L); double scale = nL / L; R *= scale; G *= scale; B *= scale;

This yields the same results as the conversion to and from xyY and has fewer magic numbers, which is always a win.

Now lets see what happens when we apply the same *x / (x+1)* to each pixel’s *luminance*:

Balls. This has preserved the colours, but at a terrible price; now all the whites are gone. The reason is that by preserving the hue and saturation, the operator prevents any colours being blown out to a full white. Luckily, Reinhard’s got our back. In his paper, the preceding operation is written as:

Almost immediately after this equation, Reinhard goes on to say:

“This formulation is guaranteed to bring all luminances within displayable range. However, as mentioned in the previous section, this is not always desirable”

He then presents the following:

Here, is the smallest luminance that will be mapped to 1.0.

Let’s give that a whirl, with an of 4 for the colour ramps and an of 2.4 for the condo photo:

Well that’s much better than the previous effort; whether its better than John’s results is up for debate. What I like about it is that because it mostly preserves the hues, so you haven’t lost any information; if you want to crisp the blacks or desaturate the whites, you can do that in your final colour grade. If you’re colour grading via a volume texture lookup, this is for free!

Having said all that, the TFU2 environment artists were most happy with a simple brightness & contrast adjustment with a small 0 – 0.05 toe, which in the end only costs one extra madd instruction at the end of our shaders. Whatever works for your game, makes the artists happy and means you get to go home at 6pm, right?

A full implementation of Reinhard’s operator adapts to the local changes in luminance around each pixel and is a little more involved that what I’ve described here, but hopefully I’ve managed to contribute something useful to the tone mapping in games discussion and not just reduced the signal to noise ratio on the subject.

]]>