Last year I had a go at the GCHQ hacking challenge and managed to solve it, so when a colleague informed me that there was a new challenge I thought I might give it a shot if I found some time.

Step 1
The first step is to figure out what the data on the website is. The payload is as follows:


The first thing I did was to paste this into HXD and remove the spaces (which I suspected were just there to break up the structure of the message). I generated a histogram of the bytes and it looked like this.


Apart from the anomalous ‘Q’ the distribution of the letters is basically consistent with average English. Implying that the message has not been subject to a substitution cipher or Ceaser shift (of any variant thereof) and is probably just transposed. At that point I looked at the length of the payload, 143 bytes (which has obvious factors), arranging the characters in this configuration yielded the following:


Substituing the Q for a space of . gives the sentence:


The URL embedded in Turing’s quote is the start of the second phase of the challenge. With “turing” an obvious candidate for the first of the five answers.

Step 2
You’re presented with a RSA key.


The key is base64 encoded (the equals sign as padding gives that away). If you decode it you get what looks mostly like a load of crap, except there is some text that kinda looks like a URL in there:

@ww.whtsisilguoectsrehsri.eocu./klbtehcel y

I was actually able to just read it by looking at it you can kind of read the words, but while writing this I tried to work out how you would explain to somehow how to read it, so it’s just pairs of letters switched around, think 16bit endianness-style.

Of course, “bletchley” is the second answer. I’m starting to see a pattern here. Before I began the next section I tried a few keywords in the answers boxes, “tunny”, “tutte”, “enigma” and even “entscheidungsproblem” all yielded nothing. “Colossus” on the other hand was the answer to Step 4. I thought this sort of obvious pattern actually spoilt the puzzle somewhat, but ah well. I carried on.

Step 3
At stage 3 you are met with the following sequence:


I tried all sorts for this one, disassembling it, checking various hashes, looking for patterns which would generate bits for the gaps on the last line, after about 30 minutes of fail I wondered if you were supposed to decrypt it with the key you were given in Step 2.
I used OpenSSL to take a closer look at the key we are given in Step 2.

C:\OpenSSL\bin>openssl pkey -in comp1.key -text
WARNING: can't open config file: /usr/local/ssl/openssl.cnf
Private-Key: (1022 bit)
publicExponent: 65537 (0x10001)

If I hadn’t have been lazy, I’d have probably noticed the URL for step 3 staring right at me in the prime 2 chunk. Anyway, the bits that are of interest are the public modulus and the public/private exponent. Plug these values into your favourite decrypt tool and feed it the payload from the Step 3 webpage and you get:

20 20 20 20 20 20 20 20 77 77 2E 77 68 74 72 65 67 65 73 69 65 74 2E 72 6F 63 75 2E 2F 6B 6E 65 67 69 61 6D 30 32 33 31 20 20 20 20 20 20 20 20

After applying the same 16bit endian swapping you get the URL:

This means answer 3 is “enigma2013″.

Step 4
Okay, so it’s actually a picture of the Colossus computer. I was actually fortunate enough to see the working replica running at Bletchley Park a year or so ago, so recognised it immediately. Sadly, that doesn’t help me get the URL for Step 5, well I know what the folder will be, but not the domain. Time to take a deeper look into the image…

First of all I checked exif data, nothing interesting there. I tried decompressing the image with djpeg, again, this yielded nothing interesting. I then opened up the JPEG file in a hex editor and searched for the word colossus (and byte-swapped versions thereof)… Nothing.
After sometime skimming through the JPEG file I found what looked like a second header. I cut and paste that data into a new file in HXD and saved it. A thumbnail was immediately generated in explorer, double clicking the file revealed this guy hiding all along:


So my answer was correct, but now I knew the URL of the 5th and final stage.

Step 5
Step 5 is a bit disappointing, I get the feeling they couldn’t be arsed at this point, it just bloody tells you the answer, :-(

It turns out to be: “Secured”
Ah well. Bit of an anti-climax.

Maybe I’ll post on the blog again in 12 months time.

I have held off posting this solution until the deadline on GCHQ’s website had expired, my hope is that this will help to ensure that those who are “unworthy” (in their view) do not abuse the solutions I give here. Before we jump right in my special thanks go to my awesome colleague Lionel Lemarié who was also working on this at the same time. Bouncing ideas and thoughts off such a knowledgeable chap was super useful and really does go to prove that two heads are better than one! This post is also reproduced on my personal blog.

On the 1st of December 2011 I saw the tag GCHQ trending on Twitter. Apparently they had set a challenge to British Nationals to try and crack a code they released on the website As someone who enjoys puzzles I thought it’d be fun to have a go at trying to crack the code they put up. This blog post details my efforts in solving the code.

Stage 1:

If you visit you will be greeted with something that looks like a really bad IT system in a Hollywood Blockbuster. Most prominently there is an image with 160 bytes of hexadecimal, split into 10 rows of 16 entries. The first stage of solving this puzzle was to get this data into binary format. There was no real good way of doing this except to painstaking transcribe the bytes by hand into my favourite Hex Editor. I double checked the data and also had a friend check it to make sure there was no errors at this point. My initial feeling was perhaps it was some “cryptographic in joke”, where the data hashed to make a readable bit of plain text. I ran the data through a bunch of hashing functions, a bunch of SHA variants, CRCs, the works. Nothing! It wasn’t until I noticed the string of bytes 0xef 0xeb 0xad 0xde, that it set alarm bells ringing in my head. This was an executable payload! Once this realisation set in it only took a cursory glance at the first byte of the data, 0xeb, to confirm that it was likely to be x86. Here is the binary file I created for those who are interested, please excuse the .txt extension (WordPress flags my .bin files and even .c files as security risks so a lot of downloads from this post are .txt).

It actually turns out that the payload in the image on the site is only half the story. There is a second part to it, which is inside cypber.png itself. So download the main image from the site and fire up your favourite hex editor. You’re looking for an iTXt chunk, which is the header of some metadata.


The presence of the == at the end was a hint that the data was actually base64 encoded. So you’ll need to decode this, I used this site. Once you have that, your payload is now complete.

001C13AE jmp main+24h (1C13B4h)
001C13B0 scas dword ptr es:[edi]
001C13B1 ret 0A3BFh
001C13B4 sub esp,100h
001C13BA xor ecx,ecx
001C13BC mov byte ptr [esp+ecx],cl
001C13BF inc cl
001C13C1 jne main+2Ch (1C13BCh)
001C13C3 xor eax,eax
001C13C5 mov edx,0DEADBEEFh
001C13CA add al,byte ptr [esp+ecx]
001C13CD add al,dl
001C13CF ror edx,8
001C13D2 mov bl,byte ptr [esp+ecx]
001C13D5 mov bh,byte ptr [esp+eax]
001C13D8 mov byte ptr [esp+eax],bl
001C13DB mov byte ptr [esp+ecx],bh
001C13DE inc cl
001C13E0 jne main+3Ah (1C13CAh)
001C13E2 jmp main+0B3h (1C1443h)
001C13E7 mov ebx,esp
001C13E9 add ebx,4
001C13EF pop esp
001C13F0 pop eax
001C13F1 cmp eax,41414141h
001C13F6 jne main+0ABh (1C143Bh)
001C13F8 pop eax
001C13F9 cmp eax,42424242h
001C13FE jne main+0ABh (1C143Bh)
001C1400 pop edx
001C1401 mov ecx,edx
001C1403 mov esi,esp
001C1405 mov edi,ebx
001C1407 sub edi,ecx
001C1409 rep movs byte ptr es:[edi],byte ptr [esi]
001C140B mov esi,ebx
001C140D mov ecx,edx
001C140F mov edi,ebx
001C1411 sub edi,ecx
001C1413 xor eax,eax
001C1415 xor ebx,ebx
001C1417 xor edx,edx
001C1419 inc al
001C141B add bl,byte ptr [esi+eax]
001C141E mov dl,byte ptr [esi+eax]
001C1421 mov dh,byte ptr [esi+ebx]
001C1424 mov byte ptr [esi+eax],dh
001C1427 mov byte ptr [esi+ebx],dl
001C142A add dl,dh
001C142C xor dh,dh
001C142E mov bl,byte ptr [esi+edx]
001C1431 mov dl,byte ptr [edi]
001C1433 xor dl,bl
001C1435 mov byte ptr [edi],dl
001C1437 inc edi
001C1438 dec ecx
001C1439 jne main+89h (1C1419h)
001C143B xor ebx,ebx
001C143D mov eax,ebx
001C143F inc al
001C1441 nop
001C1442 nop
001C1443 nop
001C1444 nop
001C1445 call main+57h (1C13E7h)
001C144A inc ecx
001C144B inc ecx
001C144C inc ecx
001C144D inc ecx
001C144E inc edx
001C144F inc edx
001C1450 inc edx
001C1451 inc edx
001C1452 xor al,byte ptr [eax]
001C1454 add byte ptr [eax],al
001C1456 xchg eax,ecx
001C1457 fdiv st,st(1)
001C1459 ins dword ptr es:[edi],dx
001C145A jo main+0ECh (1C147Ch)
001C145C cmp ch,byte ptr [ebx-3BF46599h]
001C1462 xchg eax,ecx
001C1463 sti
001C1464 db c7h
001C1465 paddb xmm1,xmm5
001C1469 int 3
001C146A mov ah,2
001C146C cli
001C146D xlat byte ptr [ebx]
001C146E ja main+94h (1C1424h)
001C1470 push esp
001C1471 cmp byte ptr [ebx-711CF1E1h],ch
001C1477 ror dword ptr ds:[93C399EBh],cl
001C147D db feh
001C147E shr dword ptr [ebx],1
001C1480 sbb edx,dword ptr [ecx]
001C1482 db c6h
001C1483 adc edi,ebp
001C1485 enter 2FCAh,0Dh

As it would be less than smart to simply execute a chunk of unknown x86 binary on my machine (regardless of its source), I disassembled the binaries to make sure they weren’t harmful prior to running them. To run the payload you can do it in two different methods. You can use a little C program, in this program you just need to dump the data as an array of unsigned chars, and then make a function pointer to the prototype typedef void(*executePayload)();. A lovely typecast from the unsigned char array to a function pointer will do the trick. However, most systems these days come with various security measures to stop malicious code chewing through your system. One such feature is DEP (Data Execution Protection). In order to execute a payload in this way you need to make sure the memory that contains the payload is mapped correctly with the correct executable flags. Here is the source file that will execute the payload. You can compile it and execute it with the following commands in your Linux terminal and then use GDB to debug it:

gcc -g -O0 payload.c -o payload

In visual studio, you can just do __asm _emit for each of the bytes after you byte swap them. You can download my source file for this here.

The first thing the payload does it jump the next 4 bytes (this will be important later on), then it allocated a buffer on the stack by moving esp and then does a bunch of ops to build strings in these buffers. It’s not really important or interesting what it’s actually doing, to get a hold of what you want, just run until the int 0x80 call, and then check out the stack frame in the memory window. It’ll contain a URL to a .js file.

Stage 2:

If we download the file we are presented with some JavaScript source code. It’s just a case of implementing the virtual machine, running it til it hits a halt instruction and checking the memory buffer for the URL.

The virtual machine as described had a handful of 8-bit registers, 8 instructions and a 16-byte segmented memory model.

Instruction Fetch & Decode

The op codes that are described in the JS file take the format shown in the image below, it is trivial to knock up a bit of C code to fetch and decode each op.

/* fetch & decode: */
const uint8_t byte = g_mem[calculateIndex(g_cpu.cs, g_cpu.ip)];
const uint8_t opCode = (byte & 0xe0) >> 5;
const uint8_t mod = (byte & 0x10) >> 4;
const uint8_t op1 = (byte & 0xf);
const uint8_t op2 = g_mem[calculateIndex(g_cpu.cs, g_cpu.ip)+1]; /* optional */


Segmented addressing means that in order to address something you use a segment index (which in this case selects which 16-byte location of memory you are in), and then an offset from that segment’s base. The offset is not necessarily constrained to [0..15], check out the C code below. Here is the source file for my VM written in C. If that wasn’t enough Lionel has very kindly made available his JS source code for stage 2, which you can get here. Alternatively just click here if you just want to run Lionel’s VM in your browser and see his cool debug output.

int16_t calculateIndex(const uint8_t segment, const uint8_t offset) {

    const int16_t segment16 = (uint16_t)segment;
    const int16_t offset16 = (uint16_t)offset;
    const int16_t index = (segment16 << 4) + offset16;
    return index;


A simple unconditional jump. Just sets cs and ip to whatever the jump target is. If mod bit is not set, then the cs register is left unaffected.

movr & movm

These instructions just move data between registers and also to and from memory.

add, xor and cmp

These three instructions have similar characteristics from the mod flag perspective, and basically do exactly what it says on the tin. However, in the case of cmp there are some additional changes required to the flags register. Comparisons are usually implemented as a subtraction, if the result is 0 then the values are equal, and hence the flags register is set to 0. If the result of the subtraction yields a positive value, then the first operand is larger than the second. In the specifications for this virtual machine the flags register is required to be set to 0x1 in this case. And if the sign bit is set, (i.e.: the subtract yields a negative result) then the flags register should be set to 0xff. This is shown in the code snippet below.

const int8_t result = getRegisterValue(op1) - getRegisterValue(op2);
if(result == 0)
    setRegisterValue(REGISTER_FLAG, 0);
else if(result < 0)
    setRegisterValue(REGISTER_FLAG, 0xff);
    setRegisterValue(REGISTER_FLAG, 1);


This instruction will jump when the flag register is set to 0x0, which indicates that the result of a preceding cmp instruction was 0.

if(g_cpu.fl == 0x0) {
    if(!mod) {
        g_cpu.ip = getRegisterValue(op1);
        printf("jmpe r%d\n", op1);
    } else {
        g_cpu.cs = op2;
        g_cpu.ip = getRegisterValue(op1);
        printf("jmpe #%x:r%d\n", op2, op1);


The halt instruction (as the name suggests) simply brings the virtual machine to a halt.

Wrapping it up…

On examining the source code an astute reader maybe notice that the firmware is not used at all. I tried disassembling this stuff with the disassembler I wrote before I wrote the VM (also included in my little C file, just change the argument to cpuInit to true, it was a load of rubbish, so it didn’t make sense to run it. Initially why it was there was a mystery (but it became obvious later)!

Another thing to note is that it turns out that their specifications aren’t the entire picture. If you take a look at the CS register you will notice that is never non-zero. Seems odd to have a completely redundant register, well it turns out it *is* required. The CS register I assume stands for code-segment (matching the DS register which I guess is data-segment). This should have been my first clue.

Looks like another URL, this time to a .exe file.

Stage 3:

So after downloading the executable file my first stop was to disassemble it. My tool of choice here was IDA Freeware. IDA is an awesome tool which I would highly recommend, you can grab it here. I also used a Hex Editor, my tool of choice is Hex Workshop, which you can grab here. Once you disassemble it under IDA you need to run the exe and step it (you will need to install Cygwin for this). You will notice that it attempts to load a file called license.txt, so I created a txt file with this name and continued to step it. Eventually I hit the following bit of code. It is checking the first 4 bytes of the buffer it reads from license.txt to see if it matches some special magic number (GCHQ).

.text:00401160 mov [ebp+var_4C], 0
.text:00401167 cmp [ebp+gchqHeaderBytes], 71686367h
.text:0040117F mov [esp+78h+salt?], eax
.text:00401182 call crypt
.text:00401187 mov edx, eax
.text:00401189 mov eax, DEShash
.text:0040118E mov [esp+78h+var_74], eax
.text:00401192 mov [esp+78h+salt?], edx
.text:00401195 call strcmp
.text:0040119A test eax, eax

So the first 4 bytes of license.txt is easy, you just replicate this value in there. Be careful to byte swap it. The next bit of the code is a bit more tricky. It calculates a hash value for the next few bytes in license.txt and compares it the result to some other magic value (again stored in the data segment of the executable). The first 2 bytes of the result buffer from crypt() will be the salt for the hashing function (to try and guard against dictionary-based attacks). To break this password I fired up John the Ripper. This is a free program you can get from here (just make sure you get the jumbo version that supports the Markov mode). To build John the Ripper Jumbo, fire up your cygwin bash terminal and navigate to the src directory. You will then need to the following, replacing system with your target system:

make clean (system)

Where (system) is replaced with your target system, for me it was win32-cygwin-x86-any. Once it’s built, make a new file in the run directory called “password.txt” and type the following in it:


Then use the following command line to try and break it:

./john -markov:220 password.txt

Now for me 220 was enough to break the password, but other individuals I’ve spoken to about this said that they had to go up to 240. I’m not exactly sure on how the Markov mode works in John the Ripper, but potentially you might need to run a few times at 220 to break the password, or up the level. On my laptop which has an Intel Core i5 in it, it took about 5 minutes to break the password.

$ ./john -markov:220 password.txt
Loaded 1 password hash (Traditional DES [24/32 4K])
Warning: MaxLen = 12 is too large for the current hash type, reduced to 8
MKV start (lvl=220 len=8 pwd=1601142515)
cyberwin (user)
guesses: 1 time: 0:00:04:39 DONE (Sat Dec 3 17:29:12 2011) c/s: 302851 trying: cybervam - cyberwo
Use the "--show" option to display all of the cracked passwords reliably

At this point it’s probably worth mentioning that actually breaking the password for license.txt is a waste of time. You can crack the .exe without having to generate the password. Later on in the execution flow the .exe will try to set up the stage 1 and stage 2 license keys. All this involves is a copy (or two) from the license.txt buffer that was read in via the standard C file I/O stuff. From examining the code you can work out the exact offset into this buffer that the copy is performed. Once you have this, you can subtract the size in bytes of the magic number (which was 4 bytes as you will recall). Congratulations, you now know the password’s length without needing to actually crack the password itself. You can pad license.txt with whatever you like and then just hex edit the conditional jump, to make it non-conditional just after the strcmp of the output buffer from the call to crypt.

So what are the final pieces of the puzzle? It took me a while to work it out, but mentions of “Stage 1″ and “Stage 2″ in the printf are actually clues about using values from stage 1 and stage 2 (duh!). You need a single 4 byte value from stage 1 and 2×4 byte values from stage 2. Stage 2 to me was obvious, the unused firmware values from the VM (just remember to byte swap them!). Stage 1 you can get by looking at the disassembly again, the first instruction jumps the next 4 bytes. :) This data is already byte swapped, as shown by the presence of 0xdeadbeef in the data.

So append these to your license.txt file and you’re done! Here is mine. If you step the code again you will eventually see keygen.exe will try to make a BSD socket and connect to the host you passed in as the argument (as shown below).

.text:004012C4 call connect
.text:004012C9 test eax, eax
.text:004012CB jns short loc_4012EF
.text:004012CD mov eax, [ebp+arg_0]
.text:004012D0 mov [esp+148h+var_144], eax
.text:004012D4 mov [esp+148h+var_148], offset aErrorConnectSF ; "error: connect(\"%s\") failed\n"
.text:004012DB call printf
.text:004012E0 mov [ebp+var_144], 0FFFFFFFFh
.text:004012EA jmp loc_401423

I actually stepped over this stuff until you get to the call to sprintf where it builds the path to request for the HTTP server.

.text:00401315 mov [esp+148h+pathPart1], eax
.text:00401319 mov [esp+148h+var_144], offset aGetSXXXKey_txt ; "GET /%s/%x/%x/%x/key.txt HTTP/1.0\r\n\r\n"
.text:00401321 lea eax, [ebp+removeFilePath]
.text:00401327 mov [esp+148h+var_148], eax
.text:0040132A call sprintf
.text:0040132F lea eax, [ebp+removeFilePath]

The value at removeFilePath in the disasm is the file you want. Fire up your favourite browser and bash that path in. It turns out to be

The file contains a single line of text:


And that’s that… :) I hope you enjoyed this little detour into how I spent a few hours of my life. It was good fun, maybe the games industry should do recruitment tests more like this! :) Until next time.

A draft of this was first penned back in 2009 and appeared in GPU Pro : Advanced Rendering Techniques. It was co-written with my good friend, Stephen McAuley. Some of the stuff in here I don’t really agree with anymore, but I thought I’d post it up pretty much unedited (only minor corrections).

The light pre-pass renderer [Engel08, Engel09, Engel09a] is becoming an ever more popular choice of rendering architecture for modern real-time applications that have extensive dynamic lighting requirements. In this article we introduce and describe techniques that can be used to accelerate the real-time lighting of an arbitrary 3D scene on the Cell Broadband Engine™ without adding any additional frames of latency to the target application. The techniques described in this article were developed for the forthcoming PLAYSTATION®3[1] version of Blur which was released in May 2010 [2].

Figure 1. A screenshot from Blur.

1. Introduction

As GPUs have become more powerful, people have sought to use them for purposes other than graphics. This has opened an area of research called GPGPU (General Purpose GPU), which major graphics card manufacturers are embracing. For example, all NVIDIA® GeForce® GPUs now support PhysX® technology, which enables physics calculations to be performed on the GPU. However, much less has been made of the opposite phenomenon – with the increase in speed and number of CPUs in a system, it is becoming feasible on some architectures to move certain graphics calculations from the GPU back onto the CPU. Forthcoming hardware such as Intel’s Larrabee even combines both components [Seiler08], which will certainly open the door for CPU-based approaches to previously GPU-only problems becoming more popular. Today, one such architecture is the PLAYSTATION®3 where the powerful Cell Broadband Engine™ was designed from the outset to support the GPU in its processing activities [Shippy09]. This article explains how the Cell Broadband Engine™ can be used to calculate lighting within the context of a light pre-pass rendering engine.

2. Light Pre-Pass Rendering

A commonly encountered problem in computer graphics has been how to construct a renderer that can handle many dynamic lights in a scene. Traditional forward rendering does not perform well with multiple lights. For example, if a pixel shader is written for up to four point lights, then only four point lights can be drawn (and no spotlights). We could either increase the number of pixel shader combinations to handle as many cases as possible, or we could render the geometry multiple times, once more for each additional light. Neither of these solutions is desirable as they increase the number of state changes and draw calls to uncontrollable levels.
A popular solution to this problem is to use a deferred renderer [Deering88]. Instead of writing out fully lit pixels from the pixel shader, we instead write out information about the surface into a G-Buffer, which would include depth, normal and material information. An example G-buffer format is shown below:

Figure 2. An example G-Buffer format from a deferred rendering engine (after [Valient07]).

We then additively blend the lights into the scene, using the information provided in the G-Buffer. Thus many lights can be rendered, without additional geometry cost or shader permutations. In addition, by rendering closed volumes for each light, we can ensure that only calculations for pixels directly affected by a light are carried out. However, with deferred rendering, all materials must use the same lighting equation, and can only vary by the properties stored in the G-Buffer. There are also huge memory bandwidth costs to rendering to (and reading from) so many buffers, which increases with MSAA.

In order to solve these problems, Engel suggested the light pre-pass renderer, first online in [Engel08] and then later published in [Engel09], although a similar idea was independently developed by others and has been recently used in games such as Uncharted: Drake’s Fortune [Balestra08]. Instead of rendering out the entire G-Buffer, the light pre-pass renderer stores depth and normals in one or two render targets. The lighting phase is then performed, with the properties of all lights accumulated into a lighting buffer. The scene is then rendered for a second time, sampling the lighting buffer to determine the lighting on that pixel.

Using a Blinn-Phong lighting model means that the red, green and blue channels of the lighting buffer store the diffuse calculation, while we can fit a specular term in the alpha channel, the details of which are described in [Engel09]. This means that unlike a deferred renderer, different materials can handle the lighting values differently. This increased flexibility, combined with reduced memory bandwidth costs, has seen the light pre-pass renderer quickly increase in popularity and is now in use in many recent games on a variety of hardware platforms.

Yet the deferred renderer and light pre-pass renderer share the fact that lighting is performed in image space, and as such requires little to no rasterization. This makes the lighting pass an ideal candidate to move from the GPU back onto the CPU. Swoboda first demonstrated this method with a deferred renderer on the PLAYSTATION®3 and Cell Broadband Engine™ in [Swoboda09], and now we expand upon his work and apply similar techniques to the light pre-pass renderer.

3. The PLAYSTATION®3 and the CBE™

Sony Computer Entertainment released the PLAYSTATION®3 in 2006. It contains the Cell Broadband Engine™ which was developed jointly by Sony Computer Entertainment, Toshiba Inc. and IBM Corp. [Shippy09, Möller08, IBM08]. The Cell is the Central Processing Unit (CPU) of the PLAYSTATION®3. In addition to the Cell chip, the PLAYSTATION®3 also has a GPU, the Reality Synthesizer, or RSX®. The RSX® was developed by NVIDIA Corporation, and is essentially a modified GeForce®7800 [Möller08]. A high-level view of the architecture can be found in figure 3.

Figure 3. The PLAYSTATION®3 architecture. (Illustration after [Möller08, Perthuis06]).

Inside the Cell chip one can find two distinctly different types of processor. There is the PowerPC Processing Element (PPE) and eight[3] pure SIMD processors [Möller08] known as Synergistic Processing Elements (SPEs) all of which are connected by a high speed, “token-ring style” bus known as the Element Interconnect Bus (EIB), see figure 4. The techniques introduced and described in this paper are chiefly concerned with the usage of the SPEs and as such further discussion of the PPE has been omitted.

Figure 4. The Cell Broadband Engine (after [IBM08]).

One interesting quirk of the SPE is that it does not directly have access to the main address space and instead has its own internal memory known as the local store. The local store on current implementations of the CBE is 256KB in size. The memory is unified, untranslatable and unprotected [Bader06, IBM08] and must contain the SPE’s program code, call stack and any data that it may happen to be processing. To load or store data from or to the main address space a programmer must explicitly use the Memory Flow Controller (MFC). Each SPE has its own MFC which is capable of queuing up to sixteen Direct Memory Accesses (DMAs) [IBM08].

As the SPU ISA operates primarily on SIMD vector operands, both fixed-point and floating-point [IBM09], it is very well equipped to process large quantities of vectorised data. It has a very large register file (4KB) which is helpful to hide the latencies of pipelined and unrolled loops, and while the local store is relatively small in capacity, it is usually sufficient to allow a programmer is able to hide the large latency of main memory accesses[4] through effective multi-buffering. Code that is to efficiently execute on the SPE should be written to play to the SPE’s strengths.

A more in-depth discussion of the PLAYSTATION®3 and the Cell Broadband Engine™ is out of the scope of this paper, interested readers can refer to IBM’s website for  more in depth details about the Cell chip [IBM09], and Möller, Haines and Hoffman describe some of the PLAYSTATION®3 architecture in [Möller08].

4. GPU/SPE Synchronization

As the number of processors in our target platforms becomes ever greater, the need to automate the scheduling of work being carried out by these processing elements also becomes greater. This has continued to the point where game development teams now build their games and technology around the concept of the job scheduler [Capcom06]. Our engine is no exception to this trend and the solution we propose for GPU/SPE inter-processor communication relies on close integration with such technology. It is for this reason we believe our solution to be a robust and viable solution to the problem of RSX®/SPE communication that many others can easily foster into their existing scheduling frameworks. In order to perform fragment shading on the SPE without introducing unwanted latency into the rendering pipeline there needs to be a certain amount of inter-processor communication between the GPU and SPEs. This section discusses the approach we used in achieving this synchronization.
Each SPE has several Memory Mapped I/O (MMIO) registers it can use for inter-processor communication with other SPEs or the PPU. However, these are unfortunately not trivially writable from the RSX®. An alternative approach is required in order to have the RSX® signal the SPEs that the rendering of the normal/depth buffer is complete and that they can now begin their work, without having the desired SPE programs spinning on all six of the available SPEs wasting valuable processing time.

When adding a job to our job scheduler it is optionally given an address in RSX®-mapped memory upon which the job is dependent. When the scheduler is pulling the next job from the job queue it polls this address to ensure that it is written to a known value by the RSX®. If this is not the case, the job is skipped and the next one fetched from the queue and processed, if the location in memory is written however, then our job is free to run. This dependency is visualized in figure 5.

Figure 5. The RSX® and SPE communication, the RSX® writes a 128 byte value when the normal/depth buffer is available for processing. The SPEs poll the same location to know when to begin their work.

The problem of ensuring that the GPU waits for the light buffer to be available from the SPEs is solved by a technique that is well-known to PLAYSTATION®3 developers, but unfortunately we cannot disclose it here. Interested PLAYSTATION®3 developers can consult Sony’s official development support website.

It is desirable for the RSX® to continue doing useful work in parallel with the SPEs performing the lighting calculations. In Blur we are fortunate in that we have a number of additional views that are rendered which do not rely on the lighting buffer, for example planar reflections and a rear-view mirror (in other applications these might also include the rendering of a shadow buffer). This is shown in figure 6. If no useful work can be performed on the RSX®, it may be possible (depending on your memory budget and the requirements of your application) to perform the lighting calculations one frame latent as in [Swoboda09], this approach also has the added benefit of reducing the likelihood of stalling the RSX®.

Figure 6. The RSX® continues to do useful work as the SPEs calculate the dynamic lighting for our scene.

5. The Pre-Pass

To begin the lighting pre-pass we must first construct the normal and depth buffers. We store view space normals in an 8:8:8:8 format, and since we are able to read from the depth buffer on our chosen hardware, we have no need for a separate target for the depth. We chose to perform our lighting in view space as we find it faster compared with world space.

Next we render the relevant solid and alpha-test geometry into the buffers. We only render the geometry affected by the lights – we cull all draw calls against the bounding spheres of the lights and also bring in the far clip plane. (Note that a simple sphere test is not sufficient, since we also need to render near objects that occlude the light spheres.) These methods of culling reduce the cost of drawing the pre-pass geometry by approximately half.

When rendering the scene, we enable stencil writes with a reference value of 0xff. The stencil buffer is cleared to 0x00 beforehand, which gives us the relevant region of the screen masked out in the stencil buffer. Whether rendering lights on the RSX® or the SPE, this enables us to use the early stencil to ensure that we only light relevant pixels.

We do not currently render the pre-pass or the light buffers with MSAA. This has a number of disadvantages, including some lighting artefacts around the edges of objects, and the loss of the ability to use the depth buffer as a depth pre-pass with the main scene (which we render with MSAA). However, we found the artefacts minimal, and the extra cost of rendering the light pre-pass MSAA outweighed the saving from having a depth pre-pass. This is still an area we wish to return to in the future.

Once we have the normal and depth buffers, we are able to perform the lighting. Currently, we use the Lambert diffuse model for our lights, and render the lights into an 8:8:8:8 buffer. This is for simplicity and performance reasons, but with the cost of no specular and limited lighting range. This also means that the alpha channel of the lighting buffer is unused. Some ideas for its use are explained in the section on further work.

We maintain a GPU implementation of our lighting model for reference and for other platforms. First, the stencil test is set to “equals” with a reference value of 0xff, so we only render to pixels marked in the stencil buffer. Then, the lights are rendered, with point lights and spot lights using two very different methods.

Point lights are rendered as in [Balestra08] – the frame buffer is split into tiles, and we gather lists of lights (up to a maximum of eight) that affect each tile. We then render each tile using a shader corresponding to its number of lights. This method saves on fill rate, and enables us to perform the reconstruction of view space position and normal from our normal and depth buffers only once per pixel, no matter the number of point lights.

Spot lights use the more standard method of rendering the bounding volumes of the lights – in this case, cones. We render front faces, unless we are inside the volume, in which case we render back faces.

We further optimize the lighting code by making use of the depth bounds test, when it is available on the target hardware. The depth bounds test compares the depth buffer value at the current fragment’s coordinates to a given minimum and maximum depth value. If the stored depth value is outside the given range, then the fragment is discarded. When drawing either a tile of point lights, or a spot light volume, we set the depth bounds range to be the minimum and maximum depth extents of the light (or lights, in case of the point lights).

This gives us a fast, optimized GPU implementation of our lighting model. However, it is still a significant percentage of our frame rendering time, and its image space nature makes it a perfect candidate to offload from the GPU onto the SPEs.

6. The Lighting SPE Program

This section describes in detail the SPE program that performs our lighting calculations. In order to try to contextualize each sub-section we have included figure 7 which shows the high-level structure of the SPE program as a whole.

Figure 7. The high-level flow of our SPE lighting program.

6.1 The Atomic Tile Arbiter

Due to the relatively large memory footprint of a 720p frame buffer; the limitations imposed by the size of an SPE’s local store; and the internal format of a surface created by PLAYSTATION®3’s RSX®, our lighting SPE program works on tiles of the frame buffer, 64×64 pixels in size, as shown in figure 8. Thus, there is a need to keep track of which tile is free to bring in to local store for processing. The simplest and most concurrent way we found of achieving this was by way of an atomically incremented tile index which resides in main memory. It should be noted that the SPE and RSX® are only able to efficiently co-operate on the processing of resources that are placed into correctly mapped main memory.

Figure 8. Each SPE assigns itself a task by atomically incrementing a tile index held in main memory.

For efficiency (and to avoid contention for the cache line) the tile index is aligned to a 128 byte boundary and is padded to 128 bytes in size to exactly match the cache line width of the SPEs atomic unit (ATO) [IBM08, IBM07]. The effective address (EA) of the tile is given by the product of the tile index and the total size of a single tile summed with the address of the beginning of the frame buffer, as in equation 6.1. For our chosen tile size, the resulting effective address always falls on a 16 byte boundary since our tile size is itself a 16 byte multiple.


6.2. Multi-Buffering

Multi-buffering is a must for almost all SPE programs that process any significant volume of data [Bader07] and our lighting program is no exception. In our implementation we use triple buffering to minimize the latency of accesses to the normal/depth buffer in main memory. Each buffer in the triple buffer has enough space to support a single unit of work (i.e.: a single tile of the frame buffer). The first of the buffers in our triple buffer is used as the target for inbound DMA, it utilizes its own tag group and DMA into this buffer are initiated as soon as the tile decoding process[5] on the previous tile has completed. The second and third buffers are used as the output targets for the decoding process. In addition to this, they act as scratch memory for the lighting calculations and are the source of the outgoing DMA from the running SPE program back to the light buffer in main memory[6]. This is achieved by using the two buffers alternately in order to allow outgoing DMA to complete asynchronously of the tile decoding and lighting of the next tile. A high level view of our multi-buffering strategy is depicted in figure 9.

Figure 9. The triple-buffering strategy in our lighting SPE program.

The multi-buffering system described here works so effectively that our SPE program spends an average of 5μs per frame waiting for data to be transferred to and from main memory per SPU, with the bulk of this delay being introduced early in the program’s execution as one should expect.

6.3. Light Gathering

When the incoming DMAs for the normal buffer and depth-stencil buffer tiles have completed, we can begin processing. Before we light, we first gather the lights that affect a given tile. We do this by constructing a view frustum for each tile and culling the bounding spheres of the lights against the frustum. In addition, we also cull against the stencil buffer. This is a vital optimization as it minimizes the work done in the expensive lighting phase.

In order to perform the culling and the lighting, we actually work on sub-tiles of the frame buffer tile, 3,216 pixels in size. Culling over a smaller region is more effective, and we found sub-tiles of the above size to be optimal in our case.

Next we iterate over the depth-stencil tile to collect the minimum and maximum depth values and the minimum and maximum stencil values for each sub-tile. The depth values will form the near and far clip planes of our sub-tile’s view frustum and the stencil values allow us to do a stencil test akin to the early-stencil hardware on a GPU.

In section 5 above we describe how we write 0xff into the stencil buffer when rendering the pre-pass buffers, hence any pixels with a stencil of 0x00 we do not wish to light. However, we do not stencil on a per-pixel basis, but instead skip the lighting on a sub-tile if the maximum stencil value is equal to 0x00 (hence it is 0x00 across the entire sub-tile).

Once a sub-tile has passed the stencil test, we construct its view frustum. Knowing the screen-space position of the corners of the sub-tile, using values from the projection matrix we can construct its position in view-space space, at a fixed distance of one meter from the camera (see equation 6.5). Multiplication by the minimum and maximum view-space depth values then gives us the eight vertices of the frustum, from which we can construct the frustum’s six planes (see equation 6.4 for how to construct view-space z from a depth buffer value).

We then construct separate lists of point lights and spot lights which intersect this view frustum. The bounding sphere of each light is tested against each frustum plane, with successful point lights added to a point light list, and successful spot lights added to a spot light list.


If no lights are gathered, then the sub-tile follows the same path as one which fails the stencil test – the lighting is skipped and we output zero to the frame buffer. However, if at least one light does affect the sub-tile, then the lists of point lights and spot lights are passed into our lighting function to perform the most important part of the work.

6.4. Point Lights

The SPE program used for lighting is written in C and makes heavy use of SPE-specific language extensions. We made the choice early on to favour the si-style of intrinsic over the higher-level spu-style. This is due to a closer mapping to the underlying opcodes generated by the compiler [Acton08].

Lighting code is an excellent candidate for both software pipelining and loop unrolling; our lighting is performed on batches of 16 pixels at a time. We found that 16 pixels gave us a very small number of wasted cycles per iteration of our lighting loops while still allowing us to fit everything we needed into the 4KB (128, 16 byte) register file[7]. The large numbers of independent instructions that result from lighting a relatively large set of pixels mean that the latency caused by dependent instructions closely following one another is almost completely eliminated and overall performance is massively increased (limited only by the number of issued instructions). Non-dependent instructions are interleaved with one-another with the results being used some time later when they are available; this well-known optimization technique also has the side effect of improving the balance of instructions over the odd and even execution pipelines because there are a greater number of suitable, none-dependent instructions that can occupy a single fetch group in the Synergistic Execute Unit (SXU). We found that we were able to achieve approximately three times the pixel throughput from batching pixels into groups of 16 over our earlier attempts which loosely mimicked RSX® quads by lighting smaller groups of four pixels.


Before any lighting can begin it is important to reconstruct the correct input to our lighting equation expressed in equation 6.3. Equation 6.4 demonstrates how to reconstruct the z component of the view-space position of a pixel given its depth buffer value and the near and far planes of the view frustum. Calculating the x and y components of the view-space position is equally trivial when given the x and y coordinates of the pixel in screen-space and the view projection matrix. This is shown by equation 6.5.



In HLSL/Cg shaders it is quite common to use the saturate intrinsic function to clamp values to a [0..1] range. To achieve this on the SPE there is a clever trick that we feel is certainly worthy of mention here. Day et al. introduced the fast saturate/clamp technique which uses the SPU’s floating-point conversion instructions in order to achieve clamping of a floating point value to a variety of different ranges. This depends on the combination of scale bias operands issued with the instructions [Day08]. In a pipelined loop, such as our lighting loop, instruction count is oftentimes the overriding determinant of the code’s execution speed and as such we are able to employ this trick to great effect. Listing 1 demonstrates this technique.

/* HLSL saturate, clamp to [0..1]. */
qword x = si_cfltu(q, 0x20);
qword y = si_cuflt(x, 0x20);

Listing 1. Saturate a qword in two even pipeline instructions.

One interesting difference between the GPU implementation of our lighting and the SPE implementation is the switch from the default Array of Structures (AoS) data layout on the GPU, to the transposed, SIMD-friendly Structure of Arrays (SoA)[8] data layout on the SPE. The difference in format of the data is illustrated below in figure 10. By storing, loading and shuffling data into a SoA layout we are able to perform our lighting calculations much more optimally on the SPEs. A pleasant side-effect of the switch is that the resulting C code becomes much more scalar-like in appearance, making it easier for other programmers to follow [Bader06].

Figure 10. Shuffling an AoS into a SoA.

The SPE is only equipped to deal with 16 byte aligned writes and reads to and from its local store [Bader06, IBM08, IBM08a]. The targets from all load and store operations first undergo a logical ‘and’ with the LSLR register (set to 0x3ffff for current implementations of the CBE) before the SPU Store and Load unit (SLS) fetches or writes the address [IBM08, IBM08a]. Writing scalar values is achieved by way of a load-shuffle-store pattern. It is therefore desirable to perform loads and stores on 16 byte boundaries only. As our program required a lot of 4 byte loads from our normal/depth buffer and a lot of similarly sized writes to our light buffer we ended up batching these loads and stores into 16 byte chunks in order to eliminate the overhead of the additional code that would be required if we were to perform these operations on a pixel-by-pixel basis. This proved to deliver a significant performance increase, especially in the case of storing where nearly all pipeline bubbles were eliminated. We present a portion of our pixel writing code in listing 2 for a single four pixel block:

qword c0        = si_cfltu(dif0, 0x20);
qword c1        = si_cfltu(dif1, 0x20);
qword c2        = si_cfltu(dif2, 0x20);
qword c3        = si_cfltu(dif3, 0x20);
      dif       = si_ila(0x8000);
qword scale     = si_ilh(0xff00);
      dif0      = si_mpyhhau(c0, scale, dif);
      dif1      = si_mpyhhau(c1, scale, dif);
      dif2      = si_mpyhhau(c2, scale, dif);
      dif3      = si_mpyhhau(c3, scale, dif);
const vector unsigned char _shuf_uint = {
    0xc0, 0x00, 0x04, 0x08,
    0xc0, 0x10, 0x14, 0x18,
    0xc0, 0x00, 0x04, 0x08,
    0xc0, 0x10, 0x14, 0x18 };

qword s_uint    = (const qword)_shuf_uint;
qword base_addr = si_from_ptr(result);
qword p0_01     = si_shufb(dif0, dif1, s_uint);
qword p0_02     = si_shufb(dif2, dif3, s_uint);
qword p0        = si_selb(p0_01, p0_02, m_00ff);
                  si_stqd(pixel0, base_addr, 0x00);

Listing 2. Pixels are converted from their floating-point representations into 32 bit values, batched into 16 byte chunks, and stored.

6.5 Spot Lights

In the interest of completeness we present the mathematics used for our, ‘regular’ spotlights in equation 6.6.


Note that this is the same as the equation for the point lights, with an additional term at the end. d is the direction of the light (as opposed to l, which is the direction from the light to the point),  is the angle of the inner cone and  is the angle of the outer cone. However, we store their cosines on the light rather than calculating them every time. All lighting values for both point and spot lights are summed for each pixel, yielding the equation in 6.7.


7. The Main Pass

When the SPEs have finished calculating the light buffer, they then signal to the RSX® that the main pass can be rendered. As mentioned above, the synchronization at this stage is very important – we do not want to be reading from an incomplete light buffer. To composite the light buffer with the main scene, we read it as a texture in the pixel shaders. However, as not every pixel in our scene receives light from our pre-pass (see above, we only render geometry into the pre-pass that receives light), we use two shader techniques in the scene: one which samples from the light buffer, and one which does not. For the former technique, each pixel looks up its lighting in the light buffer using its screen-space coordinate, and then composites the light value as a diffuse light, as follows:


It might be tempting to simply additively or multiplicatively blend the lighting buffer over the scene, but as can be seen above, that method will result in incorrect lighting. This is due to the presence of additional static lighting in our scene.

It is also possible to read from the normal buffer in the main pass. This means that reading from normal maps and converting from tangent space to view (or world) space only happens once. However, this also means that the low precision of the normals stored in the pre-pass becomes more noticeable (only 8 bits per component). For this reason and others we did not use this option.

At the end of rendering we have a scene with many dynamic lights rendered using the Cell Broadband Engine™. Not only does this open up exciting new possibilities for our rendering engine, but it does so with minimal GPU cost, with a large amount of work performed on the CPU.

8. Conclusion

We have presented a method which splits the work of light pre-pass rendering between the RSX® and the SPEs on the Cell Broadband Engine™. We use the strengths of both components to our advantage: the rasterization performance of the RSX® to render the pre-pass geometry, and the vector maths performance of the SPEs to calculate the lighting buffer. By parallelizing the lighting calculation on the SPEs with some other rendering on the RSX® (for instance, a dynamic cube map), the lighting becomes free and thus this can be a major GPU optimization. In fact, even without the added bonus of parallelization, we found that in some cases, five SPEs running carefully crafted programs could outperform the RSX® when performing lighting calculations.

Figure 11. Another screenshot from Blur.

As new architectures emerge we believe there will be increasing opportunities to take processing load off the GPU and place it back onto the CPU. It remains to be seen how things will pan out when the two are combined in Intel’s Larrabee [Seiler08], but on the Cell Broadband Engine™ we offer that the GPU can be massively accelerated in cases such as deferred lighting or light pre-pass rendering by writing a custom CPU implementation that executes on the SPEs.

9. Further Work

There are many improvements that could be done to techniques we describe. Firstly, we currently omit specular from our lighting model. We propose either writing out specular to a separate lighting buffer or placing a monochrome specular term in the alpha channel of the lighting buffer as in [Engel09]. Material properties could be controlled by adding a specular power in the alpha channel of the normal buffer. Another problem is that our lighting is currently LDR, as it is stored in an 8:8:8:8 integer format. One option is moving to a 16:16:16:16 float, but Wilson suggests instead using the CIE Luv colour space [Wilson09]. Using this method, we can still use an 8:8:8:8 buffer, but with the luminance part of the colour using 16 bits. This technique has problems on the GPU as additive blending of lights on top of each other no longer works, but in the SPE program we have no such problem and thus this becomes more feasible; if one wished to implement a more GPU-friendly technique then diffuse light intensity could also be stored in the alpha channel as in [Valient07].

Both of the previous suggestions involve making use of the currently unused alpha channel in the lighting buffer. While there are certainly many possible uses for this byte, one idea we are currently investigating is storing the amount of fog for each pixel. We believe this could be especially beneficial for more expensive fogging equations, for instance, if height fog is being used. This is an example of adding value to the SPE program [Swoboda09a].

Given the amount of work already being done, including processing the entire normal and depth buffers, there is extra rendering work that could be done in the SPE program. One simple example is performing a down-sample of the depth buffer to a quarter resolution – this could be output asynchronously through the MFC, adding little overhead to the SPE program, and would be useful for many reduced resolution effects such as motion blur, soft particles, occlusion culling and even screen-space ambient occlusion. It would be possible to reduce the amount of processing on the normal depth buffers by combining the view-space normals and depth into a single 32 bit buffer. By encoding the x and y components of the normal into the first two channels (or by converting them to spherical coordinates), and packing linear view-space depth into the remaining 16 bits. This halves the amount of data needed by our SPE program. In fact, this approach is the method we chose for the final version of Blur.

Finally, it is our intention to remove the decoding of the buffers altogether and perform lighting on encoded normal/depth buffers, this has several advantages. The decoding process can be replaced with a simple pass over all the pixels in the frame buffer tile, which should yield a minor increase in overall lighting performance together with saving the memory required for the lighting buffer. However, this extra performance and improved memory footprint come at the cost of added mathematical complexity, as deriving the view-space position of pixels becomes non-trivial. This is due to the need to take into account the effects of the encoded buffer’s format on the final view-space position of the pixel.

10. Acknowledgements

First and foremost we would like to extend our unending thanks to Matt Swoboda of SCEE R&D for laying the groundwork for our continuing efforts and for his suggestions for our implementation. We would also like to thank Colin Hughes of SCEE R&D for his help and suggestions with optimizations.

We also extend our thanks to all the supremely talented individuals that form the Core Technologies Team at Bizarre Creations Ltd., especially to Ian Wilson, Paul Malin, Lloyd Wright, Ed Clay, Jose Sanchez, Charlie Birtwistle, Jan van Valburg, Kier Storey, Fengyun Lu, Jason Denton, Dave Hampson, Chris Cookson and Richard Thomas.

11. Bibliography

[Acton08] M. Acton, E. Christensen, “Insomniac SPU Best Practices,”, accessed on 2nd July 2009.

[Bader07] D. A. Bader, “Cell Programming Tips & Techniques,”, accessed on 2nd July 2009.

[Balestra08] C. Balestra and P. Engstad, “The Technology of Uncharted: Drake’s Fortune,” Game
Developers Conference 2008, available online at

[Capcom06] Capcom Inc. “The MT Framework,”,
accessed on 3rd July 2009.

[Day08] M. Day, J. Garrett, “Faster SPU Clamp,”, accessed on 2nd July 2009.

[Deering88] M. Deering, “The Triangle Processor and Normal Vector Shader: A VLSI System for High
Performance Graphics,”

[Engel08] W. Engel, “Light Pre-Pass Renderer,”, accessed on 4th July 2009.

[Engel09] W. Engel, “Designing a Renderer for Multiple Lights: The Light Pre-Pass Renderer,” in
ShaderX7, edited by Wolfgang Engel, Charles River Media, 2008: pp. 655-666.

[Engel09a] W. Engel, “The Light Pre-Pass Renderer Mach III,” to appear in proceedings of ACM
SIGGRAPH09, 2009.

[IBM08] IBM Corp., “Cell Broadband Engine Programming Handbook Version 1.11”.

[IBM08a] IBM Corp., “Synergistic Processing Unit Instruction Set Architecture”.

[IBM09] IBM Corp. “The Cell Project at IBM,”, accessed
on 4th July 2009.

[Möller08] T. Akenine-Möller, E. Haines, N. Hoffman, “Real-Time Rendering 3rd Edition,” 978-1-56881-424-7, A.K. Peters Ltd. 2008.

[Perthuis06] C. Perthuis, “Introduction to the Graphics Pipeline of the PS3,” Eurographics 2006.

[Seiler08] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juni, P. Hanrahan, “Larabee: a many core x86 architecture for visual computing,” ACM Transactions on Graphics, Volume 27, Issue 3, ACM SIGGRAPH 2008, August 2008.

[Shippy09] D. Shippy, M. Phipps “The Race for a New Games Machine: Creating The Chips Inside The New Xbox360 & The Playstation 3,” 978-0-8065-3101-4, Citadel Press, 2009.

[Swoboda09] M. Swoboda, “Deferred Lighting and Post Processing on PLAYSTATION®3,” Game Developers Conference 2009, available online at

[Swoboda09a] M. Swoboda, Private Correspondance, 2009.

[Valient07] M. Valient, “Deferred Rendering in Killzone 2,, Accessed on 4th July 2009.

[Wilson09] P. Wilson, “Light Pre-Pass Renderer: Using the CIE Luv Color Space,” in ShaderX7, edited by Wolfgang Engel, Charles River Media, 2009: pp.667-677.

[1] “PlayStation”, “PLAYSTATION” and the “PS” family logo are registered trademarks and “Cell Broadband Engine” is a trademark of Sony Computer Entertainment Inc. The “Blu-ray Disc” and “Blu-ray Disc” logos are trademarks.

[2] Screenshots of Blur appear courtesy of Activision Blizzard Inc. and Bizarre Creations Ltd.

[3] One of the eight SPEs is locked out to increase chip yield and another is reserved by the Sony’s Cell OS. Applications running on the PLAYSTATION®3 actually have six SPEs to take advantage of.

[4] As one might expect, linear access patterns fare significantly better than random access.

[5] For information on how to decode, Sony PLAYSTATION®3 developers can consult the RSX® User’s Manual.

[6] We do not currently encode the lighting results; please see further work for more information.

[7] Any more pixels in a single loop in our implementation would risk causing registers to be spilled.

[8] SOA organization is also known as “parallel-array”.


Hi everyone, I know it’s been a while since my last post, I’m looking to change that soon as I get some time. I want to get back into the AltDevBlog a day routine, but for now. This is a quick post to say that if you’re a PlayStation Vita developer I’m going to be speaking at closed events in both San Fransico (on the 28th June) and then in back here in London (on the 5th July). Feel free to say hi if you’re attending :)



This post is going to be much shorter as I’ve sadly been very pressed for time this last couple of weeks and as a result been unable to finish the other (longer) post I had original planned to the high standard that you all deserve. This post is talking about measuring stuff in frames per second and why you shouldn’t do it.

Many people who haven’t got any experience in profiling and optimising a game will approach performance measurements in a frankly scary way. I’m taking of course about measuring performance in terms of frame rate. Of course it’s unfair to say the games industry never mentions ‘frame rate’, we do! We have target frame rates for our games which are usually 60Hz or 30Hz (increasingly the latter), but for anything beyond this I think it’s fair to say that most sane people take the leap into a different, more logical and intuitive space. It’s called frame time and it’s measured in; you guessed it, units of time! Typical units of time used include ms, µs or even ns. The two have a relationship which allows you to convert between them given by t=1/r.

So why am I concerned about this if it’s trivial to convert between them? Well, the reason is that measuring anything in terms of the frame rate suffers from some big flaws which measuring in frame time does not and can lead to problems with your view of the situation:

1. Traditional frame rate measurements are often dependent on synchronisation with a hardware-driven event such as the vertical reset on your display device. If you’ve locked your application to this vertical sync to avoid frame tearing this can mean no matter how fast your application is going it will always appear to run at say, 60fps. If you’re application is straddling the border of 60fps and locked to vsync, this can give rise to occassions where small fluctuations in the amount of time something takes can make the game to appear to go either two times quicker or slower!

2. Frame rate can be very deceptive! To show what I mean let’s use an example that crops up time and time again among beginners, it goes as follows:

Coder A: “Check this shit out, my game is running at 2000fps! I’m such a ninja!”
Coder B: “Nice one chief, but you’re accidentally frustum culling the player there…”
Coder A: “Ah shit, okay… Let me fix that… Sorted!”
Coder A hits F5 and looks on truly horrified at the frame rate counter.
Coder B: “Heh! Not so ninja now are you? Running at 300fps!”
Coder A slopes off looking dejected.

What happened here? Apparently performance has dropped drastically by a whopping 1700 frames per second to a mere three hundred! However, if these pairs of idiots worked in a set of sensible units they’d instantly see that it’s only taking 2.5ms to render that player mesh. Small drops in frame rate in an application with an already low frame rate can point to big performance problems, where as large drops in an application with a very big frame rate are usually not much to worry about.

3. And perhaps the most compelling reason which should give any remaining doubters a firm nudge into reality; All the wonderful profiling tools out there work in time, not rate.

At Bizarre Creations (R.I.P) we had a large cardboard cut out of the Stig (among others, such as Hulk Hogan and Austin Powers) for anyone that broke the build. In honour of Bizarre which sadly closed last Friday, I offer these final Stig inspired words to bring this post to a close: Next time you even contemplate telling your colleague that your code runs at a particular frame rate remember that what your saying is more backwards than Jeremy Clarkson telling you that the Stig got the Ariel Atom 500 round the top gear track at 37.54993342 metres per second!

Obligatory Introductory Parable

I really like Sushi, it’s tasty and convenient. I like the immediacy of being able to go into a Sushi restaurant complete with conveyor belt and being able to take a seat and grab something fresh and delicious from the belt without blowing my entire lunch hour. Having said that, what I really wouldn’t like is to be a member of staff in a Sushi restaurant, especially if it was my job to show the diners to their seats and here’s why…

A group of three diners walk in and ask to be seated. Delighted at having some custom, you kindly oblige showing the diners to their seats. No sooner have they sat down and helped themselves to some tasty looking ‘Tekka Maki’, when the door opens again and four more diners walk in! Wow, you guys are on a roll (see what I did there?). Your restaurant now looks like this…

Since its lunch time, the restaurant quickly fills up to capacity. Finally after eating all they could and settling the bill, the first party to arrive (the party of three leave), and a group of two walk in and you offer the newly vacant seats to your new customers. This occurrence happens a few more times until your restaurant looks like this…

Finally a group of four dinners walk in and ask to be seated. Ever the pragmatist, you’ve been carefully keeping track of how many empty seats you have left and it’s your lucky day, you’ve got four spare seats! There is one snag though, these diners are the social type and would like to be seated together. You look around frantically, but while you have four empty seats you can’t seat the diners together! It would be very rude to your ask existing customers to move mid-meal, which sadly leaves you no option but to turn your new customers away, probably never to return again. This makes everyone very sad. If you’re sat there wondering how this little tale relates in the slightest bit to games development then read on.

In programming we have to manage memory. Memory is a precious resource which must be carefully managed, much like the seats in our metaphorical sushi restaurant. Every time we allocate memory dynamically we’re reserving memory from something called the “heap”. In C and C++ this is typically done through the use of the malloc function and the new operator respectively. To continue the somewhat fishy analogy (last one I promise!), this is like our intrepid groups of diners asking to be seated in our sushi restaurant. The real shame though is what happened in our hypothetical scenario happens in the context of memory also, but the results are much worse than a couple of empty tummies. It is called fragmentation and it is a nightmare!

What’s wrong with malloc and new?

Sadly, the rest of the discussion won’t have such a fishy flavour to it as this post is going to talk about malloc and new and why they have a very limited place in the context of embedded systems (such as games consoles). While fragmentation is just one facet of problems caused by dynamic memory allocation, it is perhaps the most serious, but before we can come up with some alternatives we should take a look at all of the problems:

1. malloc and new try to be all things to all programmers…
They will as soon as allocate you a few bytes as they will a few megabytes. They have no concept of what the data is that they’re allocating for you and what its lifetime is likely to be. Put another way, they don’t have the bigger picture that we have as programmers.

2. Run-time performance is relatively bad…
Allocations from the standard library functions or operators typical require descending into the kernel to service the allocation requests (this can involve all sorts of nasty side effects to your application’s performance, including flushing of translation lookaside buffers, copying blocks of memory, etc). For this reason alone using dynamic memory allocation can be very expensive in terms of performance. The cost of the free or delete operations in some allocation schemes can also be expensive as in many cases a lot of extra work is done to try to improve the state of the heap for subsequent allocations. “Extra work” is a fairly vague term, but can mean the merging of memory blocks and in some cases can mean walking an entire list of allocations your application has made! Certainly not something you’d want to be wasting valuable processor cycles on if you can avoid it!

3. They cause fragmentation of your heap…
If you’ve never been on a project that has suffered from the problems associated with fragmentation then count yourself very lucky, but rest of us know that heap fragmentation can be a complete and utter nightmare to address.

4. Tracking dynamic allocations can be tricky…
Dynamic allocation comes with the inherent risk of memory leaks. I’m sure we all know what memory leaks are, but if not, have a read of this. Most studios try to build some tracking infrastructure on top of their dynamic allocations to try and track what memory is in play and where.

5. Poor locality of reference…
There is essentially no way of knowing where the memory you get back from malloc or new will be in relation to any of the other memory in your application. This can lead to more us suffering more increasingly expensive cache misses than we need to, as we end up dancing through memory like Billy Elliot on Ecstasy.

So what is the alternative? The idea behind this post is provide you with the details (and some code!) for a few different alternatives that you can use in place of malloc and new to help combat the problems we’ve just discussed.

The Humble Linear Allocator

As the name of this section suggests this allocator is certainly the simplest of all those I’m going to present, although truth be told; they’re all simple (and that’s part of the beauty). The linear allocator essentially assumes that there is no fine grained de-allocation of allocated memory resources and proceeds to make allocations from a fixed pool of memory one after the other in a linear fashion. Check out the diagram below.

A good example of where a linear allocator is exceedingly useful is found in nearly all SPU programming. The persistence of data in the local store is not important beyond the scope of the currently executing job and in many cases the amount of data one brings into the local store (at least data that needs some degree of variance in its size) tends to fairly restricted. However, don’t be fooled that’s far from its only application. Here’s an example of how one might go about implementing a simple, linear allocator in C.

/** Header structure for linear buffer. */
typedef struct _LinearBuffer {
    uint8_t *mem; /*!< Pointer to buffer memory. */
    uint32_t totalSize; /*!< Total size in bytes. */
    uint32_t offset; /*!< Offset. */
} LinearBuffer;

/* non-aligned allocation from linear buffer. */
void* linearBufferAlloc(LinearBuffer* buf, uint32_t size) {

    if(!buf || !size)
        return NULL;

    uint32_t newOffset = buf->offset + size;
    if(newOffset <= buf->totalSize) {

        void* ptr = buf->mem + buf->offset;
        buf->offset = newOffset;
        return ptr;
    return NULL; /* out of memory */

It is of course possible to support aligned allocations by applying the required bitwise operations to the offset during the course of the allocation. This can be incredibly useful, but be aware that depending on the size of the data you’re allocating from your buffer (and in some cases the order in which those allocations are made) you may find that you get some wasted space in the buffer between allocations. This wasted space is typically okay for alignments of a reasonable size, but can get prohibitively wasteful if you are allocating memory which requires a much larger alignment, for example 1MB. The ordering of your allocations from linear allocators can have a drastic effect on the amount of wasted memory in these types of situations.

To reset the allocator (perhaps at the end of a level), all we need to do is set the value of offset to zero. Just like with all allocations you would do, clients of the allocator need to ensure they’re not hanging on to any pointers that you’ve effectively de-allocated by doing this, otherwise you risk corrupting the allocator. Any C++ objects you’ve allocated from the buffer will also need their destructors calling manually.

The Stack

Let’s get something out of the way before we start into this allocator. When I talk about a “stack allocator” in this particular case, I’m not talking about the call stack, however, that stack does have an important part to play in avoiding run-time heap allocations as we shall see later on. So what am I talking about then? Well, the linear allocator I just described above is an excellent solution to many allocation problems, but what if you want slightly more control over how you free up your resources? The stack allocator will give you this.

Towards the end of my description of the linear allocator, I mentioned that to reset the allocator you can simply set the offset to zero in order to free up all the resources in the allocator. The principle of setting the offset to a particular value is the principle that guides the implementation of the stack allocator. If you are not familiar with the concept of the stack data structure then now is probably a good time to get acquainted, you can do so here.

Back? Okay, each allocation from our stack allocator will optionally be able to get a handle which represents the state of the stack allocator just before that allocation was made. This means that if we restore the allocator to this state (by changing its offset) we are effectively ‘freeing’ up the memory to be reused again. This is shown in the diagram below.

This can be a very useful thing if you want some memory allocated temporarily for data which has a limited scope. For example; the life time of a specific function or sub-system. This strategy can also be useful for things like level resources, which have a well defined order that objects need to be freed up in (that is reverse order to which they are allocated). Here is some example C code to illustrate what I’ve just explained:

typedef uint32_t StackHandle;
void* stackBufferAlloc(StackBuffer* buf, uint32_t size, StackHandle* handle) {

    if(!buf || !size)
        return NULL;

    const uint32_t currOffset = buf->offset;
    if(currOffset + size <= buf->totalSize) {

        uint8_t* ptr = buf->mem + currOffset;
        buf->offset += size;
            *handle = currOffset; /* set the handle to old offset */
        return (void*)ptr;
    return NULL;

void stackBufferSet(StackBuffer* buf, StackHandle handle) {

    buf->offset = handle;

The Double Stack

If you’re comfortable with the stack concept above, we can now move on to the double-ended stack. This is similar to the stack allocator we just described except that there are two stacks, one which grows from the bottom of the buffer upward and one which grows from the top of the buffer downward. This is shown in the diagram below.

Where would this be useful? A good example would be any situation where you have data of a similar type, but which have distinctly different lifetimes or perhaps if you had data that was related and should be allocated from the same static memory buffer (i.e.: part of the same subsystem), but had different size properties. It should be noted that it is not mandated where the offsets of the two stacks meet, they don’t have to be exactly half the buffer. In one case the bottom stack can grow very large and the top stack smaller and vice versa. This added flexibility can sometimes be required to make the best use of your memory buffers.

I don’t think I really need insult your intelligence by including a code sample for the double stack allocator due to its inherent resemblance to the single stack allocator discussed previously.

The Pool

We’re going to shift gears a little now from the family of allocators described above that are based on linearly advancing pointers or offsets and move to something a little different. The pool allocator I’m about to describe is designed to work with data types that are of the same kind or size. It splits up the memory buffer it is managing into equally sized chunks, and then allows the client to allocate and free these chunks at will (see the diagram below). To do this, it must keep a track of which chunks are free and I’ve seen several ways of implementing this. I personally shy away from implementations such as those using a stack of indices (or pointers) to available chunks, due to the extra storage required which can often be prohibitive, but I’ve seen them around. The implementation I shall describe here uses no additional storage to manage which chunks in the pool are free. In fact the header structure for this allocator contains only two member variables, making it the smallest of all the allocators described in this post.

So how does it work internally? To manage which chunks are free we’re going to use a data structure known as a linked list. Again if you’re not acquainted with this data structure then try reading this. Coming from a PlayStation3 and Xbox360 background, where memory access is expensive I generally find node-based structures (such as the linked list) leave a rather sour taste, but I think this is perhaps one of the applications I may approve of. Essentially the header structure for the allocator will contain a pointer to a linked list. The linked list itself is spread throughout out the pool, occupying the same space as the free chunks in the memory buffer. When we initialise the allocator, we move through the memory buffer’s chunks and write a pointer in the first four (or eight) bytes of each chunk, with the address (or index) of the next free chunk in the buffer. The header then points to the first element in that linked list. A limitation of storing pointers in the pool’s free chunks in this way is that chunks must be at least the same size as a pointer on your target hardware. See the diagram below.

When allocations are made from the pool we simply need to make the linked list pointer in the header structure point at the second element in the linked list and then return the pointer we originally had to the first element. It’s very simple, we always return the first element in the linked list when allocating. Similarly, to free a chunk and return it to the pool, we simply insert it into the front of the linked list. Inserting chunks we want to free at the front of the list as opposed to the back has a couple of benefits, firstly we don’t need to a traverse the linked list (or store an extraneous tail pointer in the header structure) and secondly (and more importantly) we stand a better chance of the node we just freed up being in the cache for subsequent allocations from the pool. After a few allocations and de-allocations your pool might look a little like the diagram below.

Some C code for initialising the allocator and making allocations and de-allocations from it is provided below.

/* allocate a chunk from the pool. */
void* poolAlloc(Pool* buf) {

        return NULL;

        return NULL; /* out of memory */

    uint8_t* currPtr = buf->head;
    buf->head = (*((uint8_t**)(buf->head)));
    return currPtr;

/* return a chunk to the pool. */
void poolFree(Pool* buf, void* ptr) {

    if(!buf || !ptr)

    *((uint8_t**)ptr) = buf->head;
    buf->head = (uint8_t*)ptr;

/* initialise the pool header structure,
and set all chunks in the pool as empty. */
void poolInit(Pool* buf, uint8_t* mem, uint32_t size, uint32_t chunkSize) {

    if(!buf || !mem || !size || !chunkSize)

    const uint32_t chunkCount = (size / chunkSize) - 1;
    for(uint32_t chunkIndex=0; chunkIndex<chunkCount; ++chunkIndex) {

        uint8_t* currChunk = mem + (chunkIndex * chunkSize);
        *((uint8_t**)currChunk) = currChunk + chunkSize;

    *((uint8_t**)&mem[chunkCount * chunkSize]) = NULL; /* terminating NULL */
    buf->mem = buf->head = mem;

A Note on Stack-based Allocation (alloca is your friend)

Earlier on, you may recall that I said there’d be a mention of stack based allocations in the context of the call stack. I’m sure you’ve seen code which conceptually looks something like this:

myFunction() {

    myTemporaryMemoryBuffer = malloc(myMemorySize);
    /* do processing limited to this function. */

There is a function you can use which comes with most C compilers which should mean (depending on the size of your allocation) that you won’t have to resort to heap allocations for temporary buffers of this ilk. That function is alloca. How alloca works internally is architecture dependant, but essentially it performs allocations by adjusting the stack frame for your function to allow you to write data to an area, this can be as simple as moving the stack pointer (just like the linear allocator we mentioned right at the start). The memory returned to you by alloca is then freed up when the function returns.

There are a few caveats with using alloca that you should be aware of however. Be sure to check the size of your allocation to make sure you’re not requesting an unreasonable amount from the stack, this can cause nasty crashes later on, if your gets so large that it stack overflows. For this reason it is also best to know all the places where your function will be called in the context of the program’s overall flow if you’re contemplating allocating a large chunk with alloca. Use of alloca can affect portability in some limited circumstances (apparently), but I’ve yet to come across a compiler that doesn’t support it.

For more details you can consult your favourite search engine.

A Final Thought…

Often the memory one allocates during the course of a processing task is temporary and persists only for the lifetime of a single frame. Taking advantage of this type of knowledge and moving allocations of this type to separate memory buffers is essential to combating the adverse affects of fragmentation in a limited memory system. Any of the above allocation schemes will work, but I would probably suggest going with one of the linear ones, as they are much easier to reset than the pool implementation I’ve described here. Noel Llopis talks about this topic in more detail in this excellent blog post.

The right allocator for you depends on many factors and what the problem you’re trying to solve demands. The advice I would offer is to think carefully about the patterns of allocations and de-allocations you wish to perform with your system, think about the sizes and lifetimes of these allocations and try to manage them with allocators that make sense in that context. It can sometimes help to draw memory layouts on paper (yeah, with a pen and everything) or to graph roughly how you expect your memory usage will look over time, believe it or not, this can really help you to understand how a system produces and consumes data. Doing these things can help you make calls about how to separate your allocations to make memory management as easy, quick and fragmentation-free as possible. Remember, what I’ve talked about here is just a small selection of the some of the simplest strategies that I tend to favour when writing code for console games. It is my hope that if you’ve made it this far and you weren’t already doing this stuff, that your brain is alive with ideas about code you’ve written in the past which could have taken advantage of these techniques to mitigate the substantial drawbacks associated with dynamic memory allocations, or of other strategies you can exploit to solve your memory management problems without dynamic allocation in limited memory systems.

In closing, I believe that programmers should be mindful of the impact of making dynamic allocations (especially in console games) and think twice about using the malloc function or the new operator. It is easy to have the attitude that you’re not doing many allocations so it doesn’t really matter, but this type of thinking spread across a whole team quickly snowballs and results in a death by a thousand paper cuts. If not nipped in the bud, fragmentation and the performance penalties arising from the use of dynamic memory allocations can have catastrophic consequences later on in your development lifecycle which are not easy to solve. Projects where memory allocation and management is not thought through and managed properly often suffer from random crashes after prolonged playing session due to out of memory conditions (which by the way are near impossible to reproduce) and cost hundreds of programmer hours trying to free up memory and reorganise allocations.

Remember: You can never start thinking about memory early enough in your project and you’ll always wish you had started earlier!

More Information

Here are some links to related topics, or to topics that I didn’t have time to cover:

Thanks to Sarah, Jaymin, Richard and Dat for proof reading this drivel.

Well, I’m mostly working with this thing… More details to come in the future, perhaps 😉


Welcome to this, my first technical blog post for a good while and also my debut contribution to the excellent blogging effort!

On my travels around the games industry I have noticed that although many people know about the existence of the Radix Sort, most know that it’s typically quick (some even known it’s non-comparison based, and linear time). What a great number of individuals I’ve met don’t seem to know, however, is the nuts and bolts of how it actually works! Given this, I thought I’d have a crack at explaining Radix Sort for us mere mortals. With any luck, if you’re scratching your head at the notion of a sort that doesn’t do any comparisons or wondering how one is able to break free from the shackles of that O(n log n) thing that your Comp Sci. professor told you about, then this post will help you through it!

The high-level concept of Radix Sort can be imagined by thinking about what you could do if you had an array of, say 128, unique 16bit integers that you wanted to sort. What is perhaps the most obvious way to do it, if you didn’t care about memory utilisation or locality of your results? A simple way would be to simply use each value in the array as an index into an array. Something akin to the following:

for(uint32_t currentValue=0; currentValue<maxCount; ++currentValue) {
uint32_t value= unsortedArray[currentValue];
sortedArray[value] = value;

Obviously this is pretty wasteful in this form, but this extremely basic concept gives rise to the foundation of how Radix Sort works, put another way, we simply aim to just put the values into the correct places straight away without any need to compare with any other values. Obviously in the above example there are numerous points of failure and problems that we need to deal with, but we’ll get to those shortly.

Okay, so we’ve got the concept of roughly what a Radix Sort aims to do; now how do we go about solving the problems with our naive first-pass implementation? Like all good programmers, we’ll first define the problems we’re trying to solve to begin with. Perhaps the most fatal blow to the code above is the larger the values you have in the list of unsorted values, the greater amount of memory you will need to accommodate their placement directly into the correct spot in the results array. This is because of the way we’re placing the values, we just use the value itself to determine where to store it in the corresponding results array. Another problem is collisions. The word “collision” here simply means, “When two items map to the same location in memory in the results array,” for example, when we have the same value twice (or more) in the input array that we’re attempting to sort. For the acquainted, the concept is analogous to that of a collision in the context of a hash map. Another problem that we actually wouldn’t have with the above (due to it working on 16bit unsigned integers) is what to do with negative values or values of a different type, such as floating point numbers. It is not a stretch for the imagination to contrive use-cases in which our sort should be robust enough to deal with this type of data.

Analysing Data Using Elementary Statistical Tools

One of the reasons I like Radix Sort is that the solutions to these seemingly death-dealing problems come in the form of some mathematical tools you were likely taught at primary school (for those of you that only speak American, read “elementary school”). By performing some rudimentary analysis on the data, we’re able to sort it robustly and efficiently. The analysis that I speak of is something called a histogram, a simple mathematical tool used to depict the distribution of a dataset. In its traditional form a histogram is just a bar chart where the height of the column represents the number of times a given value is present in a set. Check out the example below:

To calculate a histogram for an arbitrary set of values we can simply iterate over all the values in our set. For each value in our array we maintain a running total of the number of times it has been encountered… That’s it! (My apologies if you were expecting something more complex). We have a histogram for our array, and we’re well on our way to conquering Radix Sort. Pause for a moment here, and actually take a look at the histogram for the dataset you’re sorting. It can be interesting and very illuminating to see just how your data is spread out and you may gain some interesting insight.

An astute reader may have spotted that I actually glossed over an important detail here. How exactly do you keep track of a running total for each unique value? Ideally we’d like this to be nice and quick, so we don’t want to use some nefarious mapping data structure to keep track of the histogram. Besides, I did promise that I’d keep things simple so how about a nice, simple, flat array? The problem is if we’re sorting, for example, 32bit integers and want to use the values themselves as array indices denoting the counter for a specific value, we’re left with the rather inconvenient problem of requiring 232 elements just to store the histogram! Not to fear though, as this is where the concept of the radix enters the fray.

If we have a smaller radix, of say 8 bits, and multiple histograms (one for each byte of the 32bit integers we’re sorting, a total of four), then we can use a relatively small amount of memory for the histograms which is proportional to the assumable range of the radix. Calculation of multiple histograms can be done in parallel (or in the same loop) no matter how many histograms you seek to calculate for the dataset, you just shift and mask off the number of bits you’re interested in for each histogram. Here’s a quick example of what I mean, for an 8bit radix, you’d simply do: (x & 0xff), ((x>>8)&0xff), ((x>>16)&0xff) and (x>>24) to access each byte of the 32bit value individually. This type of bit shifting should be immediately familiar to any graphics coders out there as it is often used to access individual channel in a 32bit colour value. The code to calculate four histograms (one for each byte) from our 32bit integers ends up looking a little something like this:

for(uint32_t currentItem=0; currentItem<maxCount; ++currentItem) {
const uint32_t value = unsortedArray[currentItem];
const uint32_t value0 = value & 0xff;
const uint32_t value1 = (value>>0x8) & 0xff;
const uint32_t value2 = (value>>0x10) & 0xff;
const uint32_t value3 = (value>>0x18);

At this point I’d like to stress that there is absolutely no reason why you must have an 8bit radix. It is common, yes, but 11bit, 16bit, or whatever you want will also work. When you’re actually implementing this algorithm you should probably try out a few different radix sizes to see which gives you the best results on your target hardware. It’s essentially a trade-off between cache performance accessing the supporting data structures and doing less passes over the input array. Increasingly from my experience the more cache optimal solution (i.e.: the smaller Radix size and hence histogram/offset table) tends to perform best as memory latency increases relative to CPU performance. It’s also worth noting that histograms for different subsets of a full dataset can be summed in order to produce the histogram for the entire set, this trick can be useful when doing this type of thing in parallel, but we’ll get to that in due course.

Offset Calculation

If you cast your mind back to the beginning of this post, I stated that the guiding principle of Radix Sort was to place values at the correct places in the output array immediately without performing any comparisons. To do this we need to know at what offset into our output array we should use to start placing writing to for each unique value in our input array. The histogram was actually an intermediate step in calculating this list of offsets. To illustrate this I will use a worked example, consider the following list of unsorted numbers:

1, 2, 4, 3, 1, 1, 3, 1, 7, 6, 5

If we wanted to place the value ‘2’ directly into the output array at its correct place we would actually need to place it at index 4 (i.e.: the 5th slot). So the table we’re computing will simply tell us that for any values of ‘2’ begin placing them at index 4. We then increment the offset for the value we just placed as we go, so that any subsequently occurring values which are the same (assuming there are any) go just after the one we’ve placed. The offset table we want to calculate for the above example would look something like this:

0 – N/A (There are no 0′s in the set, so it doesn’t matter!)
1 – 0
2 – 4
3 – 5
4 – 7
5 – 8
6 – 9
7 – 10

So how do we calculate this offset table for a histogram? That’s easy; each location in the table is just the running total of each value in the histogram at that point. So in this example, the offset for ‘2’ would be 4, because we have no ‘0’s, but then four ‘1’′s. This unsurprisingly is a total of 4! The next offset, for ‘3’, is 5 because in our data set we only have one ‘2’, and 0+4+1 is 5. The technique of increasing the offset for a given value as you place it in the output array gives rise to a very subtle, but important property of Radix Sort that is vital for implementations which begin at the least significant byte (LSB Radix Sort) —the sort is stable. In the context of sorting algorithms, a sort is said to be stable if the ordering of like values in the unsorted list is preserved in the sorted one, we shall see why this is so important for LSB Radix Sort later on. Incidentally don’t worry about what LSB means just yet, we’ll get to that!

A quick note at this point, the application of these types of analysis tricks can also be done offline to help you better design a good solution around the data you’re trying to process. I’m hoping that a certain Mr. Acton will be kind enough to share some the increasingly important statistical tools that have made their way into his bag of tricks, at some point on this blog in the future. How about it Mike? :)

What Have We Got So Far?

The data flow chart below shows what we’ve got so far. We’ve successfully applied some elementary data analysis to our data, and in the process computed the only supporting data structure we need: a table of offsets for each radix which tells us where to begin inserting each datum in our output array. I find a fun and useful way to visualise what we’ve just done is to imagine taking the bars of the histogram for each number in our data set, rotate them 90 degrees, and then lay them end to end. You can imagine these bars come together to form a contiguous block of memory, with each bar being big enough to hold all the occurrences of the particular value it represents. The offset table is just the indices along this array where each bucket begins. Now comes the fun part, actually moving the data.

Things are going to get a little more complicated here, but only very, very slightly I promise. There are actually two ways to approach data movement in Radix Sort. We can either start with the least significant bits or with the most significant bits. Most people with a serial programming mindset tend to go with LSB, so we’ll start there and then cover MSB later on, as that has some advantages predominately for parallel programming.

Serial Offender!

Radix Sort is typically not an in-place sort, that is to say implementations typically requires some auxiliary storage which is proportional to the number of elements in the original, unsorted list in order to operate. To move the data around we need to do one pass through our unsorted array for each radix (the number of passes will change depending on the size of each datum being sorted and the size of your radix, of course). Each time we perform a pass, we are actually sorting the data with respect to the particular radix we are considering, and I’ll begin by discussing this for LSB and then discuss MSB.

The first pass will typically sort by the first byte, the second by the second (but respecting the positioning of those elements with respect to the first) and so on. Hopefully at this point you are able to see both why the stable property of LSB Radix Sort is so important and where it comes from. The data movement pass is just a single pass over the input array for each radix as mentioned. For each pass, you just read the value in the input array, mask off the bits that are relevant to the particular radix you are considering and then look-up the offset from the table of indices we computed earlier. This offset tells you where in the corresponding output array we should move the value to. When you do this, you need to be sure to increment the offset table so that the next value from the input array with the same binary signature will be written to the next element in the array (not on top of the one you just wrote!). That’s all there is to it really! You just do this once for each radix, something like this:

for(uint32_t currentValue=0; currentValue<maxCount; ++currentValue) {
const uint32_t value = unsortedArray[currentValue];
const uint32_t offsetTableIndex = value & 0xff;
const uint32_t writePosition = offsetTable0[offsetTableIndex];
auxiliaryArray[writePosition] = value;

So that’s our LSB Radix Sort for integers, but what if you want to sort floating point numbers, or indeed negative numbers? I’m sure if you’re reading this you’re probably aware of the common storage formats of floating point values (the most common of which being IEEE 754) or two’s compliment for signed integers. How does Radix Sort deal with these things? The answer is perfectly well if you apply a simple transformation to the input data, in time-honoured fashion I’ll leave this as an exercise for the reader as I don’t have time to delve into this now.

Making the Best of a Parallel Situation

So we’ve covered how to Radix Sort works when starting from the LSB, seems to work nicely, why would we want to complicate things any further? This part will talk about how sorting data beginning with the most significant digits changes the nature of the problem and how we can use this to our advantage in this increasingly parallel world.

If you think about what you actually have at each stage of the data movement passes, you will see that using MSB Radix gives rise to a particularly interesting property, the first pass over the data actually produces n independent sorting problems, where n is the assumable range of the radix. If you can’t see why the parallel programmer in you is now rubbing his/her hands together, don’t worry, here’s why!

The first pass over the data in the context of an MSB Radix Sort actually has the effect of roughly bucketing the data according it’s most significant byte (or whatever size radix you’re working with), to put it another way, you know that after each pass of MSB Radix Sort, none of the values in the buckets will ever need to be moved into other buckets (which is of course not the case with LSB Radix Sort). This important property actually opens up the potential for us to distribute the sorting of the different buckets over multiple processing elements, crucially with each element working on independent data, making synchronisation much simpler. To perform MSB in serial you just re-apply Radix sort to each bucket independently. Another interesting side effect of doing MSB Radix Sort is that each application of Radix Sort makes the data more and more cache coherent for subsequent passes.

So, there you have it. Radix Sort explained (hopefully well enough) for humans. I hope after reading this you are able to have a strong mental model of how Radix Sort works in practise and moves data around in memory, and importantly how it is able to break free of the O(n log n) lower limit on comparison-based sorting algorithms. Just a quick disclaimer, for some of my more performance minded readers: I make no guarantees as to the optimality of my implementation; the purpose here was to explain the concepts behind the algorithm, not to write an optimal implementation.

The 2nd game I worked on for Bizarre Creations, James Bond 007: Bloodstone is in stores now in the US and will be available everywhere else on the November the 5th! :) I hope everyone who manages to grab a copy has a great time playing it. Many congrats to the rest of the team for shipping a great little game :). Looking forward very much to the reviews, :)

I got featured on a new site dedicated to GameDev Blogs. I would definitely recommend having a look over there, as this directory of game developers’ blogs grows. If you’re a developer, why not consider adding yours to the list? :)


Theme © 2005 - 2009
BlueMod is a modification of the blueblog_DE Theme by Oliver Wunder