A draft of this was first penned back in 2009 and appeared in GPU Pro : Advanced Rendering Techniques. It was co-written with my good friend, Stephen McAuley. Some of the stuff in here I don’t really agree with anymore, but I thought I’d post it up pretty much unedited (only minor corrections).

The light pre-pass renderer [Engel08, Engel09, Engel09a] is becoming an ever more popular choice of rendering architecture for modern real-time applications that have extensive dynamic lighting requirements. In this article we introduce and describe techniques that can be used to accelerate the real-time lighting of an arbitrary 3D scene on the Cell Broadband Engine™ without adding any additional frames of latency to the target application. The techniques described in this article were developed for the forthcoming PLAYSTATION®3[1] version of Blur which was released in May 2010 [2].

Figure 1. A screenshot from Blur.

1. Introduction

As GPUs have become more powerful, people have sought to use them for purposes other than graphics. This has opened an area of research called GPGPU (General Purpose GPU), which major graphics card manufacturers are embracing. For example, all NVIDIA® GeForce® GPUs now support PhysX® technology, which enables physics calculations to be performed on the GPU. However, much less has been made of the opposite phenomenon – with the increase in speed and number of CPUs in a system, it is becoming feasible on some architectures to move certain graphics calculations from the GPU back onto the CPU. Forthcoming hardware such as Intel’s Larrabee even combines both components [Seiler08], which will certainly open the door for CPU-based approaches to previously GPU-only problems becoming more popular. Today, one such architecture is the PLAYSTATION®3 where the powerful Cell Broadband Engine™ was designed from the outset to support the GPU in its processing activities [Shippy09]. This article explains how the Cell Broadband Engine™ can be used to calculate lighting within the context of a light pre-pass rendering engine.

2. Light Pre-Pass Rendering

A commonly encountered problem in computer graphics has been how to construct a renderer that can handle many dynamic lights in a scene. Traditional forward rendering does not perform well with multiple lights. For example, if a pixel shader is written for up to four point lights, then only four point lights can be drawn (and no spotlights). We could either increase the number of pixel shader combinations to handle as many cases as possible, or we could render the geometry multiple times, once more for each additional light. Neither of these solutions is desirable as they increase the number of state changes and draw calls to uncontrollable levels.
A popular solution to this problem is to use a deferred renderer [Deering88]. Instead of writing out fully lit pixels from the pixel shader, we instead write out information about the surface into a G-Buffer, which would include depth, normal and material information. An example G-buffer format is shown below:

Figure 2. An example G-Buffer format from a deferred rendering engine (after [Valient07]).

We then additively blend the lights into the scene, using the information provided in the G-Buffer. Thus many lights can be rendered, without additional geometry cost or shader permutations. In addition, by rendering closed volumes for each light, we can ensure that only calculations for pixels directly affected by a light are carried out. However, with deferred rendering, all materials must use the same lighting equation, and can only vary by the properties stored in the G-Buffer. There are also huge memory bandwidth costs to rendering to (and reading from) so many buffers, which increases with MSAA.

In order to solve these problems, Engel suggested the light pre-pass renderer, first online in [Engel08] and then later published in [Engel09], although a similar idea was independently developed by others and has been recently used in games such as Uncharted: Drake’s Fortune [Balestra08]. Instead of rendering out the entire G-Buffer, the light pre-pass renderer stores depth and normals in one or two render targets. The lighting phase is then performed, with the properties of all lights accumulated into a lighting buffer. The scene is then rendered for a second time, sampling the lighting buffer to determine the lighting on that pixel.

Using a Blinn-Phong lighting model means that the red, green and blue channels of the lighting buffer store the diffuse calculation, while we can fit a specular term in the alpha channel, the details of which are described in [Engel09]. This means that unlike a deferred renderer, different materials can handle the lighting values differently. This increased flexibility, combined with reduced memory bandwidth costs, has seen the light pre-pass renderer quickly increase in popularity and is now in use in many recent games on a variety of hardware platforms.

Yet the deferred renderer and light pre-pass renderer share the fact that lighting is performed in image space, and as such requires little to no rasterization. This makes the lighting pass an ideal candidate to move from the GPU back onto the CPU. Swoboda first demonstrated this method with a deferred renderer on the PLAYSTATION®3 and Cell Broadband Engine™ in [Swoboda09], and now we expand upon his work and apply similar techniques to the light pre-pass renderer.

3. The PLAYSTATION®3 and the CBE™

Sony Computer Entertainment released the PLAYSTATION®3 in 2006. It contains the Cell Broadband Engine™ which was developed jointly by Sony Computer Entertainment, Toshiba Inc. and IBM Corp. [Shippy09, Möller08, IBM08]. The Cell is the Central Processing Unit (CPU) of the PLAYSTATION®3. In addition to the Cell chip, the PLAYSTATION®3 also has a GPU, the Reality Synthesizer, or RSX®. The RSX® was developed by NVIDIA Corporation, and is essentially a modified GeForce®7800 [Möller08]. A high-level view of the architecture can be found in figure 3.

Figure 3. The PLAYSTATION®3 architecture. (Illustration after [Möller08, Perthuis06]).

Inside the Cell chip one can find two distinctly different types of processor. There is the PowerPC Processing Element (PPE) and eight[3] pure SIMD processors [Möller08] known as Synergistic Processing Elements (SPEs) all of which are connected by a high speed, “token-ring style” bus known as the Element Interconnect Bus (EIB), see figure 4. The techniques introduced and described in this paper are chiefly concerned with the usage of the SPEs and as such further discussion of the PPE has been omitted.

Figure 4. The Cell Broadband Engine (after [IBM08]).

One interesting quirk of the SPE is that it does not directly have access to the main address space and instead has its own internal memory known as the local store. The local store on current implementations of the CBE is 256KB in size. The memory is unified, untranslatable and unprotected [Bader06, IBM08] and must contain the SPE’s program code, call stack and any data that it may happen to be processing. To load or store data from or to the main address space a programmer must explicitly use the Memory Flow Controller (MFC). Each SPE has its own MFC which is capable of queuing up to sixteen Direct Memory Accesses (DMAs) [IBM08].

As the SPU ISA operates primarily on SIMD vector operands, both fixed-point and floating-point [IBM09], it is very well equipped to process large quantities of vectorised data. It has a very large register file (4KB) which is helpful to hide the latencies of pipelined and unrolled loops, and while the local store is relatively small in capacity, it is usually sufficient to allow a programmer is able to hide the large latency of main memory accesses[4] through effective multi-buffering. Code that is to efficiently execute on the SPE should be written to play to the SPE’s strengths.

A more in-depth discussion of the PLAYSTATION®3 and the Cell Broadband Engine™ is out of the scope of this paper, interested readers can refer to IBM’s website for  more in depth details about the Cell chip [IBM09], and Möller, Haines and Hoffman describe some of the PLAYSTATION®3 architecture in [Möller08].

4. GPU/SPE Synchronization

As the number of processors in our target platforms becomes ever greater, the need to automate the scheduling of work being carried out by these processing elements also becomes greater. This has continued to the point where game development teams now build their games and technology around the concept of the job scheduler [Capcom06]. Our engine is no exception to this trend and the solution we propose for GPU/SPE inter-processor communication relies on close integration with such technology. It is for this reason we believe our solution to be a robust and viable solution to the problem of RSX®/SPE communication that many others can easily foster into their existing scheduling frameworks. In order to perform fragment shading on the SPE without introducing unwanted latency into the rendering pipeline there needs to be a certain amount of inter-processor communication between the GPU and SPEs. This section discusses the approach we used in achieving this synchronization.
Each SPE has several Memory Mapped I/O (MMIO) registers it can use for inter-processor communication with other SPEs or the PPU. However, these are unfortunately not trivially writable from the RSX®. An alternative approach is required in order to have the RSX® signal the SPEs that the rendering of the normal/depth buffer is complete and that they can now begin their work, without having the desired SPE programs spinning on all six of the available SPEs wasting valuable processing time.

When adding a job to our job scheduler it is optionally given an address in RSX®-mapped memory upon which the job is dependent. When the scheduler is pulling the next job from the job queue it polls this address to ensure that it is written to a known value by the RSX®. If this is not the case, the job is skipped and the next one fetched from the queue and processed, if the location in memory is written however, then our job is free to run. This dependency is visualized in figure 5.

Figure 5. The RSX® and SPE communication, the RSX® writes a 128 byte value when the normal/depth buffer is available for processing. The SPEs poll the same location to know when to begin their work.

The problem of ensuring that the GPU waits for the light buffer to be available from the SPEs is solved by a technique that is well-known to PLAYSTATION®3 developers, but unfortunately we cannot disclose it here. Interested PLAYSTATION®3 developers can consult Sony’s official development support website.

It is desirable for the RSX® to continue doing useful work in parallel with the SPEs performing the lighting calculations. In Blur we are fortunate in that we have a number of additional views that are rendered which do not rely on the lighting buffer, for example planar reflections and a rear-view mirror (in other applications these might also include the rendering of a shadow buffer). This is shown in figure 6. If no useful work can be performed on the RSX®, it may be possible (depending on your memory budget and the requirements of your application) to perform the lighting calculations one frame latent as in [Swoboda09], this approach also has the added benefit of reducing the likelihood of stalling the RSX®.

Figure 6. The RSX® continues to do useful work as the SPEs calculate the dynamic lighting for our scene.

5. The Pre-Pass

To begin the lighting pre-pass we must first construct the normal and depth buffers. We store view space normals in an 8:8:8:8 format, and since we are able to read from the depth buffer on our chosen hardware, we have no need for a separate target for the depth. We chose to perform our lighting in view space as we find it faster compared with world space.

Next we render the relevant solid and alpha-test geometry into the buffers. We only render the geometry affected by the lights – we cull all draw calls against the bounding spheres of the lights and also bring in the far clip plane. (Note that a simple sphere test is not sufficient, since we also need to render near objects that occlude the light spheres.) These methods of culling reduce the cost of drawing the pre-pass geometry by approximately half.

When rendering the scene, we enable stencil writes with a reference value of 0xff. The stencil buffer is cleared to 0x00 beforehand, which gives us the relevant region of the screen masked out in the stencil buffer. Whether rendering lights on the RSX® or the SPE, this enables us to use the early stencil to ensure that we only light relevant pixels.

We do not currently render the pre-pass or the light buffers with MSAA. This has a number of disadvantages, including some lighting artefacts around the edges of objects, and the loss of the ability to use the depth buffer as a depth pre-pass with the main scene (which we render with MSAA). However, we found the artefacts minimal, and the extra cost of rendering the light pre-pass MSAA outweighed the saving from having a depth pre-pass. This is still an area we wish to return to in the future.

Once we have the normal and depth buffers, we are able to perform the lighting. Currently, we use the Lambert diffuse model for our lights, and render the lights into an 8:8:8:8 buffer. This is for simplicity and performance reasons, but with the cost of no specular and limited lighting range. This also means that the alpha channel of the lighting buffer is unused. Some ideas for its use are explained in the section on further work.

We maintain a GPU implementation of our lighting model for reference and for other platforms. First, the stencil test is set to “equals” with a reference value of 0xff, so we only render to pixels marked in the stencil buffer. Then, the lights are rendered, with point lights and spot lights using two very different methods.

Point lights are rendered as in [Balestra08] – the frame buffer is split into tiles, and we gather lists of lights (up to a maximum of eight) that affect each tile. We then render each tile using a shader corresponding to its number of lights. This method saves on fill rate, and enables us to perform the reconstruction of view space position and normal from our normal and depth buffers only once per pixel, no matter the number of point lights.

Spot lights use the more standard method of rendering the bounding volumes of the lights – in this case, cones. We render front faces, unless we are inside the volume, in which case we render back faces.

We further optimize the lighting code by making use of the depth bounds test, when it is available on the target hardware. The depth bounds test compares the depth buffer value at the current fragment’s coordinates to a given minimum and maximum depth value. If the stored depth value is outside the given range, then the fragment is discarded. When drawing either a tile of point lights, or a spot light volume, we set the depth bounds range to be the minimum and maximum depth extents of the light (or lights, in case of the point lights).

This gives us a fast, optimized GPU implementation of our lighting model. However, it is still a significant percentage of our frame rendering time, and its image space nature makes it a perfect candidate to offload from the GPU onto the SPEs.

6. The Lighting SPE Program

This section describes in detail the SPE program that performs our lighting calculations. In order to try to contextualize each sub-section we have included figure 7 which shows the high-level structure of the SPE program as a whole.

Figure 7. The high-level flow of our SPE lighting program.

6.1 The Atomic Tile Arbiter

Due to the relatively large memory footprint of a 720p frame buffer; the limitations imposed by the size of an SPE’s local store; and the internal format of a surface created by PLAYSTATION®3’s RSX®, our lighting SPE program works on tiles of the frame buffer, 64×64 pixels in size, as shown in figure 8. Thus, there is a need to keep track of which tile is free to bring in to local store for processing. The simplest and most concurrent way we found of achieving this was by way of an atomically incremented tile index which resides in main memory. It should be noted that the SPE and RSX® are only able to efficiently co-operate on the processing of resources that are placed into correctly mapped main memory.

Figure 8. Each SPE assigns itself a task by atomically incrementing a tile index held in main memory.

For efficiency (and to avoid contention for the cache line) the tile index is aligned to a 128 byte boundary and is padded to 128 bytes in size to exactly match the cache line width of the SPEs atomic unit (ATO) [IBM08, IBM07]. The effective address (EA) of the tile is given by the product of the tile index and the total size of a single tile summed with the address of the beginning of the frame buffer, as in equation 6.1. For our chosen tile size, the resulting effective address always falls on a 16 byte boundary since our tile size is itself a 16 byte multiple.


6.2. Multi-Buffering

Multi-buffering is a must for almost all SPE programs that process any significant volume of data [Bader07] and our lighting program is no exception. In our implementation we use triple buffering to minimize the latency of accesses to the normal/depth buffer in main memory. Each buffer in the triple buffer has enough space to support a single unit of work (i.e.: a single tile of the frame buffer). The first of the buffers in our triple buffer is used as the target for inbound DMA, it utilizes its own tag group and DMA into this buffer are initiated as soon as the tile decoding process[5] on the previous tile has completed. The second and third buffers are used as the output targets for the decoding process. In addition to this, they act as scratch memory for the lighting calculations and are the source of the outgoing DMA from the running SPE program back to the light buffer in main memory[6]. This is achieved by using the two buffers alternately in order to allow outgoing DMA to complete asynchronously of the tile decoding and lighting of the next tile. A high level view of our multi-buffering strategy is depicted in figure 9.

Figure 9. The triple-buffering strategy in our lighting SPE program.

The multi-buffering system described here works so effectively that our SPE program spends an average of 5μs per frame waiting for data to be transferred to and from main memory per SPU, with the bulk of this delay being introduced early in the program’s execution as one should expect.

6.3. Light Gathering

When the incoming DMAs for the normal buffer and depth-stencil buffer tiles have completed, we can begin processing. Before we light, we first gather the lights that affect a given tile. We do this by constructing a view frustum for each tile and culling the bounding spheres of the lights against the frustum. In addition, we also cull against the stencil buffer. This is a vital optimization as it minimizes the work done in the expensive lighting phase.

In order to perform the culling and the lighting, we actually work on sub-tiles of the frame buffer tile, 3,216 pixels in size. Culling over a smaller region is more effective, and we found sub-tiles of the above size to be optimal in our case.

Next we iterate over the depth-stencil tile to collect the minimum and maximum depth values and the minimum and maximum stencil values for each sub-tile. The depth values will form the near and far clip planes of our sub-tile’s view frustum and the stencil values allow us to do a stencil test akin to the early-stencil hardware on a GPU.

In section 5 above we describe how we write 0xff into the stencil buffer when rendering the pre-pass buffers, hence any pixels with a stencil of 0x00 we do not wish to light. However, we do not stencil on a per-pixel basis, but instead skip the lighting on a sub-tile if the maximum stencil value is equal to 0x00 (hence it is 0x00 across the entire sub-tile).

Once a sub-tile has passed the stencil test, we construct its view frustum. Knowing the screen-space position of the corners of the sub-tile, using values from the projection matrix we can construct its position in view-space space, at a fixed distance of one meter from the camera (see equation 6.5). Multiplication by the minimum and maximum view-space depth values then gives us the eight vertices of the frustum, from which we can construct the frustum’s six planes (see equation 6.4 for how to construct view-space z from a depth buffer value).

We then construct separate lists of point lights and spot lights which intersect this view frustum. The bounding sphere of each light is tested against each frustum plane, with successful point lights added to a point light list, and successful spot lights added to a spot light list.


If no lights are gathered, then the sub-tile follows the same path as one which fails the stencil test – the lighting is skipped and we output zero to the frame buffer. However, if at least one light does affect the sub-tile, then the lists of point lights and spot lights are passed into our lighting function to perform the most important part of the work.

6.4. Point Lights

The SPE program used for lighting is written in C and makes heavy use of SPE-specific language extensions. We made the choice early on to favour the si-style of intrinsic over the higher-level spu-style. This is due to a closer mapping to the underlying opcodes generated by the compiler [Acton08].

Lighting code is an excellent candidate for both software pipelining and loop unrolling; our lighting is performed on batches of 16 pixels at a time. We found that 16 pixels gave us a very small number of wasted cycles per iteration of our lighting loops while still allowing us to fit everything we needed into the 4KB (128, 16 byte) register file[7]. The large numbers of independent instructions that result from lighting a relatively large set of pixels mean that the latency caused by dependent instructions closely following one another is almost completely eliminated and overall performance is massively increased (limited only by the number of issued instructions). Non-dependent instructions are interleaved with one-another with the results being used some time later when they are available; this well-known optimization technique also has the side effect of improving the balance of instructions over the odd and even execution pipelines because there are a greater number of suitable, none-dependent instructions that can occupy a single fetch group in the Synergistic Execute Unit (SXU). We found that we were able to achieve approximately three times the pixel throughput from batching pixels into groups of 16 over our earlier attempts which loosely mimicked RSX® quads by lighting smaller groups of four pixels.


Before any lighting can begin it is important to reconstruct the correct input to our lighting equation expressed in equation 6.3. Equation 6.4 demonstrates how to reconstruct the z component of the view-space position of a pixel given its depth buffer value and the near and far planes of the view frustum. Calculating the x and y components of the view-space position is equally trivial when given the x and y coordinates of the pixel in screen-space and the view projection matrix. This is shown by equation 6.5.



In HLSL/Cg shaders it is quite common to use the saturate intrinsic function to clamp values to a [0..1] range. To achieve this on the SPE there is a clever trick that we feel is certainly worthy of mention here. Day et al. introduced the fast saturate/clamp technique which uses the SPU’s floating-point conversion instructions in order to achieve clamping of a floating point value to a variety of different ranges. This depends on the combination of scale bias operands issued with the instructions [Day08]. In a pipelined loop, such as our lighting loop, instruction count is oftentimes the overriding determinant of the code’s execution speed and as such we are able to employ this trick to great effect. Listing 1 demonstrates this technique.

/* HLSL saturate, clamp to [0..1]. */
qword x = si_cfltu(q, 0x20);
qword y = si_cuflt(x, 0x20);

Listing 1. Saturate a qword in two even pipeline instructions.

One interesting difference between the GPU implementation of our lighting and the SPE implementation is the switch from the default Array of Structures (AoS) data layout on the GPU, to the transposed, SIMD-friendly Structure of Arrays (SoA)[8] data layout on the SPE. The difference in format of the data is illustrated below in figure 10. By storing, loading and shuffling data into a SoA layout we are able to perform our lighting calculations much more optimally on the SPEs. A pleasant side-effect of the switch is that the resulting C code becomes much more scalar-like in appearance, making it easier for other programmers to follow [Bader06].

Figure 10. Shuffling an AoS into a SoA.

The SPE is only equipped to deal with 16 byte aligned writes and reads to and from its local store [Bader06, IBM08, IBM08a]. The targets from all load and store operations first undergo a logical ‘and’ with the LSLR register (set to 0x3ffff for current implementations of the CBE) before the SPU Store and Load unit (SLS) fetches or writes the address [IBM08, IBM08a]. Writing scalar values is achieved by way of a load-shuffle-store pattern. It is therefore desirable to perform loads and stores on 16 byte boundaries only. As our program required a lot of 4 byte loads from our normal/depth buffer and a lot of similarly sized writes to our light buffer we ended up batching these loads and stores into 16 byte chunks in order to eliminate the overhead of the additional code that would be required if we were to perform these operations on a pixel-by-pixel basis. This proved to deliver a significant performance increase, especially in the case of storing where nearly all pipeline bubbles were eliminated. We present a portion of our pixel writing code in listing 2 for a single four pixel block:

qword c0        = si_cfltu(dif0, 0x20);
qword c1        = si_cfltu(dif1, 0x20);
qword c2        = si_cfltu(dif2, 0x20);
qword c3        = si_cfltu(dif3, 0x20);
      dif       = si_ila(0x8000);
qword scale     = si_ilh(0xff00);
      dif0      = si_mpyhhau(c0, scale, dif);
      dif1      = si_mpyhhau(c1, scale, dif);
      dif2      = si_mpyhhau(c2, scale, dif);
      dif3      = si_mpyhhau(c3, scale, dif);
const vector unsigned char _shuf_uint = {
    0xc0, 0x00, 0x04, 0x08,
    0xc0, 0x10, 0x14, 0x18,
    0xc0, 0x00, 0x04, 0x08,
    0xc0, 0x10, 0x14, 0x18 };

qword s_uint    = (const qword)_shuf_uint;
qword base_addr = si_from_ptr(result);
qword p0_01     = si_shufb(dif0, dif1, s_uint);
qword p0_02     = si_shufb(dif2, dif3, s_uint);
qword p0        = si_selb(p0_01, p0_02, m_00ff);
                  si_stqd(pixel0, base_addr, 0x00);

Listing 2. Pixels are converted from their floating-point representations into 32 bit values, batched into 16 byte chunks, and stored.

6.5 Spot Lights

In the interest of completeness we present the mathematics used for our, ‘regular’ spotlights in equation 6.6.


Note that this is the same as the equation for the point lights, with an additional term at the end. d is the direction of the light (as opposed to l, which is the direction from the light to the point),  is the angle of the inner cone and  is the angle of the outer cone. However, we store their cosines on the light rather than calculating them every time. All lighting values for both point and spot lights are summed for each pixel, yielding the equation in 6.7.


7. The Main Pass

When the SPEs have finished calculating the light buffer, they then signal to the RSX® that the main pass can be rendered. As mentioned above, the synchronization at this stage is very important – we do not want to be reading from an incomplete light buffer. To composite the light buffer with the main scene, we read it as a texture in the pixel shaders. However, as not every pixel in our scene receives light from our pre-pass (see above, we only render geometry into the pre-pass that receives light), we use two shader techniques in the scene: one which samples from the light buffer, and one which does not. For the former technique, each pixel looks up its lighting in the light buffer using its screen-space coordinate, and then composites the light value as a diffuse light, as follows:


It might be tempting to simply additively or multiplicatively blend the lighting buffer over the scene, but as can be seen above, that method will result in incorrect lighting. This is due to the presence of additional static lighting in our scene.

It is also possible to read from the normal buffer in the main pass. This means that reading from normal maps and converting from tangent space to view (or world) space only happens once. However, this also means that the low precision of the normals stored in the pre-pass becomes more noticeable (only 8 bits per component). For this reason and others we did not use this option.

At the end of rendering we have a scene with many dynamic lights rendered using the Cell Broadband Engine™. Not only does this open up exciting new possibilities for our rendering engine, but it does so with minimal GPU cost, with a large amount of work performed on the CPU.

8. Conclusion

We have presented a method which splits the work of light pre-pass rendering between the RSX® and the SPEs on the Cell Broadband Engine™. We use the strengths of both components to our advantage: the rasterization performance of the RSX® to render the pre-pass geometry, and the vector maths performance of the SPEs to calculate the lighting buffer. By parallelizing the lighting calculation on the SPEs with some other rendering on the RSX® (for instance, a dynamic cube map), the lighting becomes free and thus this can be a major GPU optimization. In fact, even without the added bonus of parallelization, we found that in some cases, five SPEs running carefully crafted programs could outperform the RSX® when performing lighting calculations.

Figure 11. Another screenshot from Blur.

As new architectures emerge we believe there will be increasing opportunities to take processing load off the GPU and place it back onto the CPU. It remains to be seen how things will pan out when the two are combined in Intel’s Larrabee [Seiler08], but on the Cell Broadband Engine™ we offer that the GPU can be massively accelerated in cases such as deferred lighting or light pre-pass rendering by writing a custom CPU implementation that executes on the SPEs.

9. Further Work

There are many improvements that could be done to techniques we describe. Firstly, we currently omit specular from our lighting model. We propose either writing out specular to a separate lighting buffer or placing a monochrome specular term in the alpha channel of the lighting buffer as in [Engel09]. Material properties could be controlled by adding a specular power in the alpha channel of the normal buffer. Another problem is that our lighting is currently LDR, as it is stored in an 8:8:8:8 integer format. One option is moving to a 16:16:16:16 float, but Wilson suggests instead using the CIE Luv colour space [Wilson09]. Using this method, we can still use an 8:8:8:8 buffer, but with the luminance part of the colour using 16 bits. This technique has problems on the GPU as additive blending of lights on top of each other no longer works, but in the SPE program we have no such problem and thus this becomes more feasible; if one wished to implement a more GPU-friendly technique then diffuse light intensity could also be stored in the alpha channel as in [Valient07].

Both of the previous suggestions involve making use of the currently unused alpha channel in the lighting buffer. While there are certainly many possible uses for this byte, one idea we are currently investigating is storing the amount of fog for each pixel. We believe this could be especially beneficial for more expensive fogging equations, for instance, if height fog is being used. This is an example of adding value to the SPE program [Swoboda09a].

Given the amount of work already being done, including processing the entire normal and depth buffers, there is extra rendering work that could be done in the SPE program. One simple example is performing a down-sample of the depth buffer to a quarter resolution – this could be output asynchronously through the MFC, adding little overhead to the SPE program, and would be useful for many reduced resolution effects such as motion blur, soft particles, occlusion culling and even screen-space ambient occlusion. It would be possible to reduce the amount of processing on the normal depth buffers by combining the view-space normals and depth into a single 32 bit buffer. By encoding the x and y components of the normal into the first two channels (or by converting them to spherical coordinates), and packing linear view-space depth into the remaining 16 bits. This halves the amount of data needed by our SPE program. In fact, this approach is the method we chose for the final version of Blur.

Finally, it is our intention to remove the decoding of the buffers altogether and perform lighting on encoded normal/depth buffers, this has several advantages. The decoding process can be replaced with a simple pass over all the pixels in the frame buffer tile, which should yield a minor increase in overall lighting performance together with saving the memory required for the lighting buffer. However, this extra performance and improved memory footprint come at the cost of added mathematical complexity, as deriving the view-space position of pixels becomes non-trivial. This is due to the need to take into account the effects of the encoded buffer’s format on the final view-space position of the pixel.

10. Acknowledgements

First and foremost we would like to extend our unending thanks to Matt Swoboda of SCEE R&D for laying the groundwork for our continuing efforts and for his suggestions for our implementation. We would also like to thank Colin Hughes of SCEE R&D for his help and suggestions with optimizations.

We also extend our thanks to all the supremely talented individuals that form the Core Technologies Team at Bizarre Creations Ltd., especially to Ian Wilson, Paul Malin, Lloyd Wright, Ed Clay, Jose Sanchez, Charlie Birtwistle, Jan van Valburg, Kier Storey, Fengyun Lu, Jason Denton, Dave Hampson, Chris Cookson and Richard Thomas.

11. Bibliography

[Acton08] M. Acton, E. Christensen, “Insomniac SPU Best Practices,”
http://www.insomniacgames.com/tech/articles/0208/insomniac_spu_programming_gdc08.php, accessed on 2nd July 2009.

[Bader07] D. A. Bader, “Cell Programming Tips & Techniques,”
http://www.cc.gatech.edu/~bader/CellProgramming.html, accessed on 2nd July 2009.

[Balestra08] C. Balestra and P. Engstad, “The Technology of Uncharted: Drake’s Fortune,” Game
Developers Conference 2008, available online at

[Capcom06] Capcom Inc. “The MT Framework,” http://game.watch.impress.co.jp/docs/20070131/3dlp.htm,
accessed on 3rd July 2009.

[Day08] M. Day, J. Garrett, “Faster SPU Clamp,”
http://www.insomniacgames.com/tech/articles/0308/faster_spu_clamp.php, accessed on 2nd July 2009.

[Deering88] M. Deering, “The Triangle Processor and Normal Vector Shader: A VLSI System for High
Performance Graphics,”

[Engel08] W. Engel, “Light Pre-Pass Renderer,”
http://diaryofagraphicsprogrammer.blogspot.com/2008/03/light-pre-pass-renderer.html, accessed on 4th July 2009.

[Engel09] W. Engel, “Designing a Renderer for Multiple Lights: The Light Pre-Pass Renderer,” in
ShaderX7, edited by Wolfgang Engel, Charles River Media, 2008: pp. 655-666.

[Engel09a] W. Engel, “The Light Pre-Pass Renderer Mach III,” to appear in proceedings of ACM
SIGGRAPH09, 2009.

[IBM08] IBM Corp., “Cell Broadband Engine Programming Handbook Version 1.11”.

[IBM08a] IBM Corp., “Synergistic Processing Unit Instruction Set Architecture”.

[IBM09] IBM Corp. “The Cell Project at IBM,” http://researchweb.watson.ibm.com/cell/home.html, accessed
on 4th July 2009.

[Möller08] T. Akenine-Möller, E. Haines, N. Hoffman, “Real-Time Rendering 3rd Edition,” 978-1-56881-424-7, A.K. Peters Ltd. 2008.

[Perthuis06] C. Perthuis, “Introduction to the Graphics Pipeline of the PS3,” Eurographics 2006.

[Seiler08] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juni, P. Hanrahan, “Larabee: a many core x86 architecture for visual computing,” ACM Transactions on Graphics, Volume 27, Issue 3, ACM SIGGRAPH 2008, August 2008.

[Shippy09] D. Shippy, M. Phipps “The Race for a New Games Machine: Creating The Chips Inside The New Xbox360 & The Playstation 3,” 978-0-8065-3101-4, Citadel Press, 2009.

[Swoboda09] M. Swoboda, “Deferred Lighting and Post Processing on PLAYSTATION®3,” Game Developers Conference 2009, available online at

[Swoboda09a] M. Swoboda, Private Correspondance, 2009.

[Valient07] M. Valient, “Deferred Rendering in Killzone 2,http://www.dimension3.sk/mambo/Download-document/Deferred-Rendering-In-Killzone.php, Accessed on 4th July 2009.

[Wilson09] P. Wilson, “Light Pre-Pass Renderer: Using the CIE Luv Color Space,” in ShaderX7, edited by Wolfgang Engel, Charles River Media, 2009: pp.667-677.

[1] “PlayStation”, “PLAYSTATION” and the “PS” family logo are registered trademarks and “Cell Broadband Engine” is a trademark of Sony Computer Entertainment Inc. The “Blu-ray Disc” and “Blu-ray Disc” logos are trademarks.

[2] Screenshots of Blur appear courtesy of Activision Blizzard Inc. and Bizarre Creations Ltd.

[3] One of the eight SPEs is locked out to increase chip yield and another is reserved by the Sony’s Cell OS. Applications running on the PLAYSTATION®3 actually have six SPEs to take advantage of.

[4] As one might expect, linear access patterns fare significantly better than random access.

[5] For information on how to decode, Sony PLAYSTATION®3 developers can consult the RSX® User’s Manual.

[6] We do not currently encode the lighting results; please see further work for more information.

[7] Any more pixels in a single loop in our implementation would risk causing registers to be spilled.

[8] SOA organization is also known as “parallel-array”.

Obligatory Introductory Parable

I really like Sushi, it’s tasty and convenient. I like the immediacy of being able to go into a Sushi restaurant complete with conveyor belt and being able to take a seat and grab something fresh and delicious from the belt without blowing my entire lunch hour. Having said that, what I really wouldn’t like is to be a member of staff in a Sushi restaurant, especially if it was my job to show the diners to their seats and here’s why…

A group of three diners walk in and ask to be seated. Delighted at having some custom, you kindly oblige showing the diners to their seats. No sooner have they sat down and helped themselves to some tasty looking ‘Tekka Maki’, when the door opens again and four more diners walk in! Wow, you guys are on a roll (see what I did there?). Your restaurant now looks like this…

Since its lunch time, the restaurant quickly fills up to capacity. Finally after eating all they could and settling the bill, the first party to arrive (the party of three leave), and a group of two walk in and you offer the newly vacant seats to your new customers. This occurrence happens a few more times until your restaurant looks like this…

Finally a group of four dinners walk in and ask to be seated. Ever the pragmatist, you’ve been carefully keeping track of how many empty seats you have left and it’s your lucky day, you’ve got four spare seats! There is one snag though, these diners are the social type and would like to be seated together. You look around frantically, but while you have four empty seats you can’t seat the diners together! It would be very rude to your ask existing customers to move mid-meal, which sadly leaves you no option but to turn your new customers away, probably never to return again. This makes everyone very sad. If you’re sat there wondering how this little tale relates in the slightest bit to games development then read on.

In programming we have to manage memory. Memory is a precious resource which must be carefully managed, much like the seats in our metaphorical sushi restaurant. Every time we allocate memory dynamically we’re reserving memory from something called the “heap”. In C and C++ this is typically done through the use of the malloc function and the new operator respectively. To continue the somewhat fishy analogy (last one I promise!), this is like our intrepid groups of diners asking to be seated in our sushi restaurant. The real shame though is what happened in our hypothetical scenario happens in the context of memory also, but the results are much worse than a couple of empty tummies. It is called fragmentation and it is a nightmare!

What’s wrong with malloc and new?

Sadly, the rest of the discussion won’t have such a fishy flavour to it as this post is going to talk about malloc and new and why they have a very limited place in the context of embedded systems (such as games consoles). While fragmentation is just one facet of problems caused by dynamic memory allocation, it is perhaps the most serious, but before we can come up with some alternatives we should take a look at all of the problems:

1. malloc and new try to be all things to all programmers…
They will as soon as allocate you a few bytes as they will a few megabytes. They have no concept of what the data is that they’re allocating for you and what its lifetime is likely to be. Put another way, they don’t have the bigger picture that we have as programmers.

2. Run-time performance is relatively bad…
Allocations from the standard library functions or operators typical require descending into the kernel to service the allocation requests (this can involve all sorts of nasty side effects to your application’s performance, including flushing of translation lookaside buffers, copying blocks of memory, etc). For this reason alone using dynamic memory allocation can be very expensive in terms of performance. The cost of the free or delete operations in some allocation schemes can also be expensive as in many cases a lot of extra work is done to try to improve the state of the heap for subsequent allocations. “Extra work” is a fairly vague term, but can mean the merging of memory blocks and in some cases can mean walking an entire list of allocations your application has made! Certainly not something you’d want to be wasting valuable processor cycles on if you can avoid it!

3. They cause fragmentation of your heap…
If you’ve never been on a project that has suffered from the problems associated with fragmentation then count yourself very lucky, but rest of us know that heap fragmentation can be a complete and utter nightmare to address.

4. Tracking dynamic allocations can be tricky…
Dynamic allocation comes with the inherent risk of memory leaks. I’m sure we all know what memory leaks are, but if not, have a read of this. Most studios try to build some tracking infrastructure on top of their dynamic allocations to try and track what memory is in play and where.

5. Poor locality of reference…
There is essentially no way of knowing where the memory you get back from malloc or new will be in relation to any of the other memory in your application. This can lead to more us suffering more increasingly expensive cache misses than we need to, as we end up dancing through memory like Billy Elliot on Ecstasy.

So what is the alternative? The idea behind this post is provide you with the details (and some code!) for a few different alternatives that you can use in place of malloc and new to help combat the problems we’ve just discussed.

The Humble Linear Allocator

As the name of this section suggests this allocator is certainly the simplest of all those I’m going to present, although truth be told; they’re all simple (and that’s part of the beauty). The linear allocator essentially assumes that there is no fine grained de-allocation of allocated memory resources and proceeds to make allocations from a fixed pool of memory one after the other in a linear fashion. Check out the diagram below.

A good example of where a linear allocator is exceedingly useful is found in nearly all SPU programming. The persistence of data in the local store is not important beyond the scope of the currently executing job and in many cases the amount of data one brings into the local store (at least data that needs some degree of variance in its size) tends to fairly restricted. However, don’t be fooled that’s far from its only application. Here’s an example of how one might go about implementing a simple, linear allocator in C.

/** Header structure for linear buffer. */
typedef struct _LinearBuffer {
    uint8_t *mem; /*!< Pointer to buffer memory. */
    uint32_t totalSize; /*!< Total size in bytes. */
    uint32_t offset; /*!< Offset. */
} LinearBuffer;

/* non-aligned allocation from linear buffer. */
void* linearBufferAlloc(LinearBuffer* buf, uint32_t size) {

    if(!buf || !size)
        return NULL;

    uint32_t newOffset = buf->offset + size;
    if(newOffset <= buf->totalSize) {

        void* ptr = buf->mem + buf->offset;
        buf->offset = newOffset;
        return ptr;
    return NULL; /* out of memory */

It is of course possible to support aligned allocations by applying the required bitwise operations to the offset during the course of the allocation. This can be incredibly useful, but be aware that depending on the size of the data you’re allocating from your buffer (and in some cases the order in which those allocations are made) you may find that you get some wasted space in the buffer between allocations. This wasted space is typically okay for alignments of a reasonable size, but can get prohibitively wasteful if you are allocating memory which requires a much larger alignment, for example 1MB. The ordering of your allocations from linear allocators can have a drastic effect on the amount of wasted memory in these types of situations.

To reset the allocator (perhaps at the end of a level), all we need to do is set the value of offset to zero. Just like with all allocations you would do, clients of the allocator need to ensure they’re not hanging on to any pointers that you’ve effectively de-allocated by doing this, otherwise you risk corrupting the allocator. Any C++ objects you’ve allocated from the buffer will also need their destructors calling manually.

The Stack

Let’s get something out of the way before we start into this allocator. When I talk about a “stack allocator” in this particular case, I’m not talking about the call stack, however, that stack does have an important part to play in avoiding run-time heap allocations as we shall see later on. So what am I talking about then? Well, the linear allocator I just described above is an excellent solution to many allocation problems, but what if you want slightly more control over how you free up your resources? The stack allocator will give you this.

Towards the end of my description of the linear allocator, I mentioned that to reset the allocator you can simply set the offset to zero in order to free up all the resources in the allocator. The principle of setting the offset to a particular value is the principle that guides the implementation of the stack allocator. If you are not familiar with the concept of the stack data structure then now is probably a good time to get acquainted, you can do so here.

Back? Okay, each allocation from our stack allocator will optionally be able to get a handle which represents the state of the stack allocator just before that allocation was made. This means that if we restore the allocator to this state (by changing its offset) we are effectively ‘freeing’ up the memory to be reused again. This is shown in the diagram below.

This can be a very useful thing if you want some memory allocated temporarily for data which has a limited scope. For example; the life time of a specific function or sub-system. This strategy can also be useful for things like level resources, which have a well defined order that objects need to be freed up in (that is reverse order to which they are allocated). Here is some example C code to illustrate what I’ve just explained:

typedef uint32_t StackHandle;
void* stackBufferAlloc(StackBuffer* buf, uint32_t size, StackHandle* handle) {

    if(!buf || !size)
        return NULL;

    const uint32_t currOffset = buf->offset;
    if(currOffset + size <= buf->totalSize) {

        uint8_t* ptr = buf->mem + currOffset;
        buf->offset += size;
            *handle = currOffset; /* set the handle to old offset */
        return (void*)ptr;
    return NULL;

void stackBufferSet(StackBuffer* buf, StackHandle handle) {

    buf->offset = handle;

The Double Stack

If you’re comfortable with the stack concept above, we can now move on to the double-ended stack. This is similar to the stack allocator we just described except that there are two stacks, one which grows from the bottom of the buffer upward and one which grows from the top of the buffer downward. This is shown in the diagram below.

Where would this be useful? A good example would be any situation where you have data of a similar type, but which have distinctly different lifetimes or perhaps if you had data that was related and should be allocated from the same static memory buffer (i.e.: part of the same subsystem), but had different size properties. It should be noted that it is not mandated where the offsets of the two stacks meet, they don’t have to be exactly half the buffer. In one case the bottom stack can grow very large and the top stack smaller and vice versa. This added flexibility can sometimes be required to make the best use of your memory buffers.

I don’t think I really need insult your intelligence by including a code sample for the double stack allocator due to its inherent resemblance to the single stack allocator discussed previously.

The Pool

We’re going to shift gears a little now from the family of allocators described above that are based on linearly advancing pointers or offsets and move to something a little different. The pool allocator I’m about to describe is designed to work with data types that are of the same kind or size. It splits up the memory buffer it is managing into equally sized chunks, and then allows the client to allocate and free these chunks at will (see the diagram below). To do this, it must keep a track of which chunks are free and I’ve seen several ways of implementing this. I personally shy away from implementations such as those using a stack of indices (or pointers) to available chunks, due to the extra storage required which can often be prohibitive, but I’ve seen them around. The implementation I shall describe here uses no additional storage to manage which chunks in the pool are free. In fact the header structure for this allocator contains only two member variables, making it the smallest of all the allocators described in this post.

So how does it work internally? To manage which chunks are free we’re going to use a data structure known as a linked list. Again if you’re not acquainted with this data structure then try reading this. Coming from a PlayStation3 and Xbox360 background, where memory access is expensive I generally find node-based structures (such as the linked list) leave a rather sour taste, but I think this is perhaps one of the applications I may approve of. Essentially the header structure for the allocator will contain a pointer to a linked list. The linked list itself is spread throughout out the pool, occupying the same space as the free chunks in the memory buffer. When we initialise the allocator, we move through the memory buffer’s chunks and write a pointer in the first four (or eight) bytes of each chunk, with the address (or index) of the next free chunk in the buffer. The header then points to the first element in that linked list. A limitation of storing pointers in the pool’s free chunks in this way is that chunks must be at least the same size as a pointer on your target hardware. See the diagram below.

When allocations are made from the pool we simply need to make the linked list pointer in the header structure point at the second element in the linked list and then return the pointer we originally had to the first element. It’s very simple, we always return the first element in the linked list when allocating. Similarly, to free a chunk and return it to the pool, we simply insert it into the front of the linked list. Inserting chunks we want to free at the front of the list as opposed to the back has a couple of benefits, firstly we don’t need to a traverse the linked list (or store an extraneous tail pointer in the header structure) and secondly (and more importantly) we stand a better chance of the node we just freed up being in the cache for subsequent allocations from the pool. After a few allocations and de-allocations your pool might look a little like the diagram below.

Some C code for initialising the allocator and making allocations and de-allocations from it is provided below.

/* allocate a chunk from the pool. */
void* poolAlloc(Pool* buf) {

        return NULL;

        return NULL; /* out of memory */

    uint8_t* currPtr = buf->head;
    buf->head = (*((uint8_t**)(buf->head)));
    return currPtr;

/* return a chunk to the pool. */
void poolFree(Pool* buf, void* ptr) {

    if(!buf || !ptr)

    *((uint8_t**)ptr) = buf->head;
    buf->head = (uint8_t*)ptr;

/* initialise the pool header structure,
and set all chunks in the pool as empty. */
void poolInit(Pool* buf, uint8_t* mem, uint32_t size, uint32_t chunkSize) {

    if(!buf || !mem || !size || !chunkSize)

    const uint32_t chunkCount = (size / chunkSize) - 1;
    for(uint32_t chunkIndex=0; chunkIndex<chunkCount; ++chunkIndex) {

        uint8_t* currChunk = mem + (chunkIndex * chunkSize);
        *((uint8_t**)currChunk) = currChunk + chunkSize;

    *((uint8_t**)&mem[chunkCount * chunkSize]) = NULL; /* terminating NULL */
    buf->mem = buf->head = mem;

A Note on Stack-based Allocation (alloca is your friend)

Earlier on, you may recall that I said there’d be a mention of stack based allocations in the context of the call stack. I’m sure you’ve seen code which conceptually looks something like this:

myFunction() {

    myTemporaryMemoryBuffer = malloc(myMemorySize);
    /* do processing limited to this function. */

There is a function you can use which comes with most C compilers which should mean (depending on the size of your allocation) that you won’t have to resort to heap allocations for temporary buffers of this ilk. That function is alloca. How alloca works internally is architecture dependant, but essentially it performs allocations by adjusting the stack frame for your function to allow you to write data to an area, this can be as simple as moving the stack pointer (just like the linear allocator we mentioned right at the start). The memory returned to you by alloca is then freed up when the function returns.

There are a few caveats with using alloca that you should be aware of however. Be sure to check the size of your allocation to make sure you’re not requesting an unreasonable amount from the stack, this can cause nasty crashes later on, if your gets so large that it stack overflows. For this reason it is also best to know all the places where your function will be called in the context of the program’s overall flow if you’re contemplating allocating a large chunk with alloca. Use of alloca can affect portability in some limited circumstances (apparently), but I’ve yet to come across a compiler that doesn’t support it.

For more details you can consult your favourite search engine.

A Final Thought…

Often the memory one allocates during the course of a processing task is temporary and persists only for the lifetime of a single frame. Taking advantage of this type of knowledge and moving allocations of this type to separate memory buffers is essential to combating the adverse affects of fragmentation in a limited memory system. Any of the above allocation schemes will work, but I would probably suggest going with one of the linear ones, as they are much easier to reset than the pool implementation I’ve described here. Noel Llopis talks about this topic in more detail in this excellent blog post.

The right allocator for you depends on many factors and what the problem you’re trying to solve demands. The advice I would offer is to think carefully about the patterns of allocations and de-allocations you wish to perform with your system, think about the sizes and lifetimes of these allocations and try to manage them with allocators that make sense in that context. It can sometimes help to draw memory layouts on paper (yeah, with a pen and everything) or to graph roughly how you expect your memory usage will look over time, believe it or not, this can really help you to understand how a system produces and consumes data. Doing these things can help you make calls about how to separate your allocations to make memory management as easy, quick and fragmentation-free as possible. Remember, what I’ve talked about here is just a small selection of the some of the simplest strategies that I tend to favour when writing code for console games. It is my hope that if you’ve made it this far and you weren’t already doing this stuff, that your brain is alive with ideas about code you’ve written in the past which could have taken advantage of these techniques to mitigate the substantial drawbacks associated with dynamic memory allocations, or of other strategies you can exploit to solve your memory management problems without dynamic allocation in limited memory systems.

In closing, I believe that programmers should be mindful of the impact of making dynamic allocations (especially in console games) and think twice about using the malloc function or the new operator. It is easy to have the attitude that you’re not doing many allocations so it doesn’t really matter, but this type of thinking spread across a whole team quickly snowballs and results in a death by a thousand paper cuts. If not nipped in the bud, fragmentation and the performance penalties arising from the use of dynamic memory allocations can have catastrophic consequences later on in your development lifecycle which are not easy to solve. Projects where memory allocation and management is not thought through and managed properly often suffer from random crashes after prolonged playing session due to out of memory conditions (which by the way are near impossible to reproduce) and cost hundreds of programmer hours trying to free up memory and reorganise allocations.

Remember: You can never start thinking about memory early enough in your project and you’ll always wish you had started earlier!

More Information

Here are some links to related topics, or to topics that I didn’t have time to cover:





Thanks to Sarah, Jaymin, Richard and Dat for proof reading this drivel.

Hello all,

First of all apologies that this blog has been a little light on content this year thus far. I have a few bits and pieces I’d like to write about, but I’ve been very busy of late with work (and also with a few bits outside of work), both of which I hope to be able to discuss more openly soon.

Anyways, a couple of books came out recently which I have contributed chapters to. The first is GPU Pro which just dropped (at least here in Europe). For GPU Pro, I co-authored a chapter about our SPU-based deferred lighting system that I wrote for Blur. The article was actually penned about 12 months ago, but getting books out takes a little time so apologies for not being able to share this stuff sooner. The other book I contributed a chapter to has been out a couple of months now and is the latest installment of the hugely popular “Game Programming Gems” series. The chapter is about general techniques for SPU-assisted rendering on the PlayStation3 platform.

As always I welcome contact about any of my work, so please feel free to drop me a line or hit me up on twitter if you’d like to discuss any aspects of these chapters.

GPU pro covergems 8 cover



A few days back I did an interview with fellow Bizarreo, Charlie Birtwistle, about the technology behind Blur. I’m pleased to announe that it is now up for your reading pleasure over at Eurogamer, see the link below. Any comments or questions are most welcome.

Eurogamer Technology Interview


Just a quick note for those of you asking me for the Develop North slides. It’s going to be 2010 now until I might be able to release them. They have to go through some pre-approval process at work before they can be release onto the public at large. Unfortunately for the slides, but fortunately for everyone else, everyone is kinda busy right now making Blur even more stunningly beautiful and fun than it already is, :). So it’s not so high on our priority list right now.

Theme © 2005 - 2009 FrederikM.de
BlueMod is a modification of the blueblog_DE Theme by Oliver Wunder