A draft of this was first penned back in 2009 and appeared in GPU Pro : Advanced Rendering Techniques. It was co-written with my good friend, Stephen McAuley. Some of the stuff in here I don’t really agree with anymore, but I thought I’d post it up pretty much unedited (only minor corrections).

The light pre-pass renderer [Engel08, Engel09, Engel09a] is becoming an ever more popular choice of rendering architecture for modern real-time applications that have extensive dynamic lighting requirements. In this article we introduce and describe techniques that can be used to accelerate the real-time lighting of an arbitrary 3D scene on the Cell Broadband Engine™ without adding any additional frames of latency to the target application. The techniques described in this article were developed for the forthcoming PLAYSTATION®3[1] version of Blur which was released in May 2010 [2].

Figure 1. A screenshot from Blur.

1. Introduction

As GPUs have become more powerful, people have sought to use them for purposes other than graphics. This has opened an area of research called GPGPU (General Purpose GPU), which major graphics card manufacturers are embracing. For example, all NVIDIA® GeForce® GPUs now support PhysX® technology, which enables physics calculations to be performed on the GPU. However, much less has been made of the opposite phenomenon – with the increase in speed and number of CPUs in a system, it is becoming feasible on some architectures to move certain graphics calculations from the GPU back onto the CPU. Forthcoming hardware such as Intel’s Larrabee even combines both components [Seiler08], which will certainly open the door for CPU-based approaches to previously GPU-only problems becoming more popular. Today, one such architecture is the PLAYSTATION®3 where the powerful Cell Broadband Engine™ was designed from the outset to support the GPU in its processing activities [Shippy09]. This article explains how the Cell Broadband Engine™ can be used to calculate lighting within the context of a light pre-pass rendering engine.

2. Light Pre-Pass Rendering

A commonly encountered problem in computer graphics has been how to construct a renderer that can handle many dynamic lights in a scene. Traditional forward rendering does not perform well with multiple lights. For example, if a pixel shader is written for up to four point lights, then only four point lights can be drawn (and no spotlights). We could either increase the number of pixel shader combinations to handle as many cases as possible, or we could render the geometry multiple times, once more for each additional light. Neither of these solutions is desirable as they increase the number of state changes and draw calls to uncontrollable levels.
A popular solution to this problem is to use a deferred renderer [Deering88]. Instead of writing out fully lit pixels from the pixel shader, we instead write out information about the surface into a G-Buffer, which would include depth, normal and material information. An example G-buffer format is shown below:

Figure 2. An example G-Buffer format from a deferred rendering engine (after [Valient07]).

We then additively blend the lights into the scene, using the information provided in the G-Buffer. Thus many lights can be rendered, without additional geometry cost or shader permutations. In addition, by rendering closed volumes for each light, we can ensure that only calculations for pixels directly affected by a light are carried out. However, with deferred rendering, all materials must use the same lighting equation, and can only vary by the properties stored in the G-Buffer. There are also huge memory bandwidth costs to rendering to (and reading from) so many buffers, which increases with MSAA.

In order to solve these problems, Engel suggested the light pre-pass renderer, first online in [Engel08] and then later published in [Engel09], although a similar idea was independently developed by others and has been recently used in games such as Uncharted: Drake’s Fortune [Balestra08]. Instead of rendering out the entire G-Buffer, the light pre-pass renderer stores depth and normals in one or two render targets. The lighting phase is then performed, with the properties of all lights accumulated into a lighting buffer. The scene is then rendered for a second time, sampling the lighting buffer to determine the lighting on that pixel.

Using a Blinn-Phong lighting model means that the red, green and blue channels of the lighting buffer store the diffuse calculation, while we can fit a specular term in the alpha channel, the details of which are described in [Engel09]. This means that unlike a deferred renderer, different materials can handle the lighting values differently. This increased flexibility, combined with reduced memory bandwidth costs, has seen the light pre-pass renderer quickly increase in popularity and is now in use in many recent games on a variety of hardware platforms.

Yet the deferred renderer and light pre-pass renderer share the fact that lighting is performed in image space, and as such requires little to no rasterization. This makes the lighting pass an ideal candidate to move from the GPU back onto the CPU. Swoboda first demonstrated this method with a deferred renderer on the PLAYSTATION®3 and Cell Broadband Engine™ in [Swoboda09], and now we expand upon his work and apply similar techniques to the light pre-pass renderer.

3. The PLAYSTATION®3 and the CBE™

Sony Computer Entertainment released the PLAYSTATION®3 in 2006. It contains the Cell Broadband Engine™ which was developed jointly by Sony Computer Entertainment, Toshiba Inc. and IBM Corp. [Shippy09, Möller08, IBM08]. The Cell is the Central Processing Unit (CPU) of the PLAYSTATION®3. In addition to the Cell chip, the PLAYSTATION®3 also has a GPU, the Reality Synthesizer, or RSX®. The RSX® was developed by NVIDIA Corporation, and is essentially a modified GeForce®7800 [Möller08]. A high-level view of the architecture can be found in figure 3.

Figure 3. The PLAYSTATION®3 architecture. (Illustration after [Möller08, Perthuis06]).

Inside the Cell chip one can find two distinctly different types of processor. There is the PowerPC Processing Element (PPE) and eight[3] pure SIMD processors [Möller08] known as Synergistic Processing Elements (SPEs) all of which are connected by a high speed, “token-ring style” bus known as the Element Interconnect Bus (EIB), see figure 4. The techniques introduced and described in this paper are chiefly concerned with the usage of the SPEs and as such further discussion of the PPE has been omitted.

Figure 4. The Cell Broadband Engine (after [IBM08]).

One interesting quirk of the SPE is that it does not directly have access to the main address space and instead has its own internal memory known as the local store. The local store on current implementations of the CBE is 256KB in size. The memory is unified, untranslatable and unprotected [Bader06, IBM08] and must contain the SPE’s program code, call stack and any data that it may happen to be processing. To load or store data from or to the main address space a programmer must explicitly use the Memory Flow Controller (MFC). Each SPE has its own MFC which is capable of queuing up to sixteen Direct Memory Accesses (DMAs) [IBM08].

As the SPU ISA operates primarily on SIMD vector operands, both fixed-point and floating-point [IBM09], it is very well equipped to process large quantities of vectorised data. It has a very large register file (4KB) which is helpful to hide the latencies of pipelined and unrolled loops, and while the local store is relatively small in capacity, it is usually sufficient to allow a programmer is able to hide the large latency of main memory accesses[4] through effective multi-buffering. Code that is to efficiently execute on the SPE should be written to play to the SPE’s strengths.

A more in-depth discussion of the PLAYSTATION®3 and the Cell Broadband Engine™ is out of the scope of this paper, interested readers can refer to IBM’s website for  more in depth details about the Cell chip [IBM09], and Möller, Haines and Hoffman describe some of the PLAYSTATION®3 architecture in [Möller08].

4. GPU/SPE Synchronization

As the number of processors in our target platforms becomes ever greater, the need to automate the scheduling of work being carried out by these processing elements also becomes greater. This has continued to the point where game development teams now build their games and technology around the concept of the job scheduler [Capcom06]. Our engine is no exception to this trend and the solution we propose for GPU/SPE inter-processor communication relies on close integration with such technology. It is for this reason we believe our solution to be a robust and viable solution to the problem of RSX®/SPE communication that many others can easily foster into their existing scheduling frameworks. In order to perform fragment shading on the SPE without introducing unwanted latency into the rendering pipeline there needs to be a certain amount of inter-processor communication between the GPU and SPEs. This section discusses the approach we used in achieving this synchronization.
Each SPE has several Memory Mapped I/O (MMIO) registers it can use for inter-processor communication with other SPEs or the PPU. However, these are unfortunately not trivially writable from the RSX®. An alternative approach is required in order to have the RSX® signal the SPEs that the rendering of the normal/depth buffer is complete and that they can now begin their work, without having the desired SPE programs spinning on all six of the available SPEs wasting valuable processing time.

When adding a job to our job scheduler it is optionally given an address in RSX®-mapped memory upon which the job is dependent. When the scheduler is pulling the next job from the job queue it polls this address to ensure that it is written to a known value by the RSX®. If this is not the case, the job is skipped and the next one fetched from the queue and processed, if the location in memory is written however, then our job is free to run. This dependency is visualized in figure 5.

Figure 5. The RSX® and SPE communication, the RSX® writes a 128 byte value when the normal/depth buffer is available for processing. The SPEs poll the same location to know when to begin their work.

The problem of ensuring that the GPU waits for the light buffer to be available from the SPEs is solved by a technique that is well-known to PLAYSTATION®3 developers, but unfortunately we cannot disclose it here. Interested PLAYSTATION®3 developers can consult Sony’s official development support website.

It is desirable for the RSX® to continue doing useful work in parallel with the SPEs performing the lighting calculations. In Blur we are fortunate in that we have a number of additional views that are rendered which do not rely on the lighting buffer, for example planar reflections and a rear-view mirror (in other applications these might also include the rendering of a shadow buffer). This is shown in figure 6. If no useful work can be performed on the RSX®, it may be possible (depending on your memory budget and the requirements of your application) to perform the lighting calculations one frame latent as in [Swoboda09], this approach also has the added benefit of reducing the likelihood of stalling the RSX®.

Figure 6. The RSX® continues to do useful work as the SPEs calculate the dynamic lighting for our scene.

5. The Pre-Pass

To begin the lighting pre-pass we must first construct the normal and depth buffers. We store view space normals in an 8:8:8:8 format, and since we are able to read from the depth buffer on our chosen hardware, we have no need for a separate target for the depth. We chose to perform our lighting in view space as we find it faster compared with world space.

Next we render the relevant solid and alpha-test geometry into the buffers. We only render the geometry affected by the lights – we cull all draw calls against the bounding spheres of the lights and also bring in the far clip plane. (Note that a simple sphere test is not sufficient, since we also need to render near objects that occlude the light spheres.) These methods of culling reduce the cost of drawing the pre-pass geometry by approximately half.

When rendering the scene, we enable stencil writes with a reference value of 0xff. The stencil buffer is cleared to 0x00 beforehand, which gives us the relevant region of the screen masked out in the stencil buffer. Whether rendering lights on the RSX® or the SPE, this enables us to use the early stencil to ensure that we only light relevant pixels.

We do not currently render the pre-pass or the light buffers with MSAA. This has a number of disadvantages, including some lighting artefacts around the edges of objects, and the loss of the ability to use the depth buffer as a depth pre-pass with the main scene (which we render with MSAA). However, we found the artefacts minimal, and the extra cost of rendering the light pre-pass MSAA outweighed the saving from having a depth pre-pass. This is still an area we wish to return to in the future.

Once we have the normal and depth buffers, we are able to perform the lighting. Currently, we use the Lambert diffuse model for our lights, and render the lights into an 8:8:8:8 buffer. This is for simplicity and performance reasons, but with the cost of no specular and limited lighting range. This also means that the alpha channel of the lighting buffer is unused. Some ideas for its use are explained in the section on further work.

We maintain a GPU implementation of our lighting model for reference and for other platforms. First, the stencil test is set to “equals” with a reference value of 0xff, so we only render to pixels marked in the stencil buffer. Then, the lights are rendered, with point lights and spot lights using two very different methods.

Point lights are rendered as in [Balestra08] – the frame buffer is split into tiles, and we gather lists of lights (up to a maximum of eight) that affect each tile. We then render each tile using a shader corresponding to its number of lights. This method saves on fill rate, and enables us to perform the reconstruction of view space position and normal from our normal and depth buffers only once per pixel, no matter the number of point lights.

Spot lights use the more standard method of rendering the bounding volumes of the lights – in this case, cones. We render front faces, unless we are inside the volume, in which case we render back faces.

We further optimize the lighting code by making use of the depth bounds test, when it is available on the target hardware. The depth bounds test compares the depth buffer value at the current fragment’s coordinates to a given minimum and maximum depth value. If the stored depth value is outside the given range, then the fragment is discarded. When drawing either a tile of point lights, or a spot light volume, we set the depth bounds range to be the minimum and maximum depth extents of the light (or lights, in case of the point lights).

This gives us a fast, optimized GPU implementation of our lighting model. However, it is still a significant percentage of our frame rendering time, and its image space nature makes it a perfect candidate to offload from the GPU onto the SPEs.

6. The Lighting SPE Program

This section describes in detail the SPE program that performs our lighting calculations. In order to try to contextualize each sub-section we have included figure 7 which shows the high-level structure of the SPE program as a whole.

Figure 7. The high-level flow of our SPE lighting program.

6.1 The Atomic Tile Arbiter

Due to the relatively large memory footprint of a 720p frame buffer; the limitations imposed by the size of an SPE’s local store; and the internal format of a surface created by PLAYSTATION®3’s RSX®, our lighting SPE program works on tiles of the frame buffer, 64×64 pixels in size, as shown in figure 8. Thus, there is a need to keep track of which tile is free to bring in to local store for processing. The simplest and most concurrent way we found of achieving this was by way of an atomically incremented tile index which resides in main memory. It should be noted that the SPE and RSX® are only able to efficiently co-operate on the processing of resources that are placed into correctly mapped main memory.

Figure 8. Each SPE assigns itself a task by atomically incrementing a tile index held in main memory.

For efficiency (and to avoid contention for the cache line) the tile index is aligned to a 128 byte boundary and is padded to 128 bytes in size to exactly match the cache line width of the SPEs atomic unit (ATO) [IBM08, IBM07]. The effective address (EA) of the tile is given by the product of the tile index and the total size of a single tile summed with the address of the beginning of the frame buffer, as in equation 6.1. For our chosen tile size, the resulting effective address always falls on a 16 byte boundary since our tile size is itself a 16 byte multiple.


6.2. Multi-Buffering

Multi-buffering is a must for almost all SPE programs that process any significant volume of data [Bader07] and our lighting program is no exception. In our implementation we use triple buffering to minimize the latency of accesses to the normal/depth buffer in main memory. Each buffer in the triple buffer has enough space to support a single unit of work (i.e.: a single tile of the frame buffer). The first of the buffers in our triple buffer is used as the target for inbound DMA, it utilizes its own tag group and DMA into this buffer are initiated as soon as the tile decoding process[5] on the previous tile has completed. The second and third buffers are used as the output targets for the decoding process. In addition to this, they act as scratch memory for the lighting calculations and are the source of the outgoing DMA from the running SPE program back to the light buffer in main memory[6]. This is achieved by using the two buffers alternately in order to allow outgoing DMA to complete asynchronously of the tile decoding and lighting of the next tile. A high level view of our multi-buffering strategy is depicted in figure 9.

Figure 9. The triple-buffering strategy in our lighting SPE program.

The multi-buffering system described here works so effectively that our SPE program spends an average of 5μs per frame waiting for data to be transferred to and from main memory per SPU, with the bulk of this delay being introduced early in the program’s execution as one should expect.

6.3. Light Gathering

When the incoming DMAs for the normal buffer and depth-stencil buffer tiles have completed, we can begin processing. Before we light, we first gather the lights that affect a given tile. We do this by constructing a view frustum for each tile and culling the bounding spheres of the lights against the frustum. In addition, we also cull against the stencil buffer. This is a vital optimization as it minimizes the work done in the expensive lighting phase.

In order to perform the culling and the lighting, we actually work on sub-tiles of the frame buffer tile, 3,216 pixels in size. Culling over a smaller region is more effective, and we found sub-tiles of the above size to be optimal in our case.

Next we iterate over the depth-stencil tile to collect the minimum and maximum depth values and the minimum and maximum stencil values for each sub-tile. The depth values will form the near and far clip planes of our sub-tile’s view frustum and the stencil values allow us to do a stencil test akin to the early-stencil hardware on a GPU.

In section 5 above we describe how we write 0xff into the stencil buffer when rendering the pre-pass buffers, hence any pixels with a stencil of 0x00 we do not wish to light. However, we do not stencil on a per-pixel basis, but instead skip the lighting on a sub-tile if the maximum stencil value is equal to 0x00 (hence it is 0x00 across the entire sub-tile).

Once a sub-tile has passed the stencil test, we construct its view frustum. Knowing the screen-space position of the corners of the sub-tile, using values from the projection matrix we can construct its position in view-space space, at a fixed distance of one meter from the camera (see equation 6.5). Multiplication by the minimum and maximum view-space depth values then gives us the eight vertices of the frustum, from which we can construct the frustum’s six planes (see equation 6.4 for how to construct view-space z from a depth buffer value).

We then construct separate lists of point lights and spot lights which intersect this view frustum. The bounding sphere of each light is tested against each frustum plane, with successful point lights added to a point light list, and successful spot lights added to a spot light list.


If no lights are gathered, then the sub-tile follows the same path as one which fails the stencil test – the lighting is skipped and we output zero to the frame buffer. However, if at least one light does affect the sub-tile, then the lists of point lights and spot lights are passed into our lighting function to perform the most important part of the work.

6.4. Point Lights

The SPE program used for lighting is written in C and makes heavy use of SPE-specific language extensions. We made the choice early on to favour the si-style of intrinsic over the higher-level spu-style. This is due to a closer mapping to the underlying opcodes generated by the compiler [Acton08].

Lighting code is an excellent candidate for both software pipelining and loop unrolling; our lighting is performed on batches of 16 pixels at a time. We found that 16 pixels gave us a very small number of wasted cycles per iteration of our lighting loops while still allowing us to fit everything we needed into the 4KB (128, 16 byte) register file[7]. The large numbers of independent instructions that result from lighting a relatively large set of pixels mean that the latency caused by dependent instructions closely following one another is almost completely eliminated and overall performance is massively increased (limited only by the number of issued instructions). Non-dependent instructions are interleaved with one-another with the results being used some time later when they are available; this well-known optimization technique also has the side effect of improving the balance of instructions over the odd and even execution pipelines because there are a greater number of suitable, none-dependent instructions that can occupy a single fetch group in the Synergistic Execute Unit (SXU). We found that we were able to achieve approximately three times the pixel throughput from batching pixels into groups of 16 over our earlier attempts which loosely mimicked RSX® quads by lighting smaller groups of four pixels.


Before any lighting can begin it is important to reconstruct the correct input to our lighting equation expressed in equation 6.3. Equation 6.4 demonstrates how to reconstruct the z component of the view-space position of a pixel given its depth buffer value and the near and far planes of the view frustum. Calculating the x and y components of the view-space position is equally trivial when given the x and y coordinates of the pixel in screen-space and the view projection matrix. This is shown by equation 6.5.



In HLSL/Cg shaders it is quite common to use the saturate intrinsic function to clamp values to a [0..1] range. To achieve this on the SPE there is a clever trick that we feel is certainly worthy of mention here. Day et al. introduced the fast saturate/clamp technique which uses the SPU’s floating-point conversion instructions in order to achieve clamping of a floating point value to a variety of different ranges. This depends on the combination of scale bias operands issued with the instructions [Day08]. In a pipelined loop, such as our lighting loop, instruction count is oftentimes the overriding determinant of the code’s execution speed and as such we are able to employ this trick to great effect. Listing 1 demonstrates this technique.

/* HLSL saturate, clamp to [0..1]. */
qword x = si_cfltu(q, 0x20);
qword y = si_cuflt(x, 0x20);

Listing 1. Saturate a qword in two even pipeline instructions.

One interesting difference between the GPU implementation of our lighting and the SPE implementation is the switch from the default Array of Structures (AoS) data layout on the GPU, to the transposed, SIMD-friendly Structure of Arrays (SoA)[8] data layout on the SPE. The difference in format of the data is illustrated below in figure 10. By storing, loading and shuffling data into a SoA layout we are able to perform our lighting calculations much more optimally on the SPEs. A pleasant side-effect of the switch is that the resulting C code becomes much more scalar-like in appearance, making it easier for other programmers to follow [Bader06].

Figure 10. Shuffling an AoS into a SoA.

The SPE is only equipped to deal with 16 byte aligned writes and reads to and from its local store [Bader06, IBM08, IBM08a]. The targets from all load and store operations first undergo a logical ‘and’ with the LSLR register (set to 0x3ffff for current implementations of the CBE) before the SPU Store and Load unit (SLS) fetches or writes the address [IBM08, IBM08a]. Writing scalar values is achieved by way of a load-shuffle-store pattern. It is therefore desirable to perform loads and stores on 16 byte boundaries only. As our program required a lot of 4 byte loads from our normal/depth buffer and a lot of similarly sized writes to our light buffer we ended up batching these loads and stores into 16 byte chunks in order to eliminate the overhead of the additional code that would be required if we were to perform these operations on a pixel-by-pixel basis. This proved to deliver a significant performance increase, especially in the case of storing where nearly all pipeline bubbles were eliminated. We present a portion of our pixel writing code in listing 2 for a single four pixel block:

qword c0        = si_cfltu(dif0, 0x20);
qword c1        = si_cfltu(dif1, 0x20);
qword c2        = si_cfltu(dif2, 0x20);
qword c3        = si_cfltu(dif3, 0x20);
      dif       = si_ila(0x8000);
qword scale     = si_ilh(0xff00);
      dif0      = si_mpyhhau(c0, scale, dif);
      dif1      = si_mpyhhau(c1, scale, dif);
      dif2      = si_mpyhhau(c2, scale, dif);
      dif3      = si_mpyhhau(c3, scale, dif);
const vector unsigned char _shuf_uint = {
    0xc0, 0x00, 0x04, 0x08,
    0xc0, 0x10, 0x14, 0x18,
    0xc0, 0x00, 0x04, 0x08,
    0xc0, 0x10, 0x14, 0x18 };

qword s_uint    = (const qword)_shuf_uint;
qword base_addr = si_from_ptr(result);
qword p0_01     = si_shufb(dif0, dif1, s_uint);
qword p0_02     = si_shufb(dif2, dif3, s_uint);
qword p0        = si_selb(p0_01, p0_02, m_00ff);
                  si_stqd(pixel0, base_addr, 0x00);

Listing 2. Pixels are converted from their floating-point representations into 32 bit values, batched into 16 byte chunks, and stored.

6.5 Spot Lights

In the interest of completeness we present the mathematics used for our, ‘regular’ spotlights in equation 6.6.


Note that this is the same as the equation for the point lights, with an additional term at the end. d is the direction of the light (as opposed to l, which is the direction from the light to the point),  is the angle of the inner cone and  is the angle of the outer cone. However, we store their cosines on the light rather than calculating them every time. All lighting values for both point and spot lights are summed for each pixel, yielding the equation in 6.7.


7. The Main Pass

When the SPEs have finished calculating the light buffer, they then signal to the RSX® that the main pass can be rendered. As mentioned above, the synchronization at this stage is very important – we do not want to be reading from an incomplete light buffer. To composite the light buffer with the main scene, we read it as a texture in the pixel shaders. However, as not every pixel in our scene receives light from our pre-pass (see above, we only render geometry into the pre-pass that receives light), we use two shader techniques in the scene: one which samples from the light buffer, and one which does not. For the former technique, each pixel looks up its lighting in the light buffer using its screen-space coordinate, and then composites the light value as a diffuse light, as follows:


It might be tempting to simply additively or multiplicatively blend the lighting buffer over the scene, but as can be seen above, that method will result in incorrect lighting. This is due to the presence of additional static lighting in our scene.

It is also possible to read from the normal buffer in the main pass. This means that reading from normal maps and converting from tangent space to view (or world) space only happens once. However, this also means that the low precision of the normals stored in the pre-pass becomes more noticeable (only 8 bits per component). For this reason and others we did not use this option.

At the end of rendering we have a scene with many dynamic lights rendered using the Cell Broadband Engine™. Not only does this open up exciting new possibilities for our rendering engine, but it does so with minimal GPU cost, with a large amount of work performed on the CPU.

8. Conclusion

We have presented a method which splits the work of light pre-pass rendering between the RSX® and the SPEs on the Cell Broadband Engine™. We use the strengths of both components to our advantage: the rasterization performance of the RSX® to render the pre-pass geometry, and the vector maths performance of the SPEs to calculate the lighting buffer. By parallelizing the lighting calculation on the SPEs with some other rendering on the RSX® (for instance, a dynamic cube map), the lighting becomes free and thus this can be a major GPU optimization. In fact, even without the added bonus of parallelization, we found that in some cases, five SPEs running carefully crafted programs could outperform the RSX® when performing lighting calculations.

Figure 11. Another screenshot from Blur.

As new architectures emerge we believe there will be increasing opportunities to take processing load off the GPU and place it back onto the CPU. It remains to be seen how things will pan out when the two are combined in Intel’s Larrabee [Seiler08], but on the Cell Broadband Engine™ we offer that the GPU can be massively accelerated in cases such as deferred lighting or light pre-pass rendering by writing a custom CPU implementation that executes on the SPEs.

9. Further Work

There are many improvements that could be done to techniques we describe. Firstly, we currently omit specular from our lighting model. We propose either writing out specular to a separate lighting buffer or placing a monochrome specular term in the alpha channel of the lighting buffer as in [Engel09]. Material properties could be controlled by adding a specular power in the alpha channel of the normal buffer. Another problem is that our lighting is currently LDR, as it is stored in an 8:8:8:8 integer format. One option is moving to a 16:16:16:16 float, but Wilson suggests instead using the CIE Luv colour space [Wilson09]. Using this method, we can still use an 8:8:8:8 buffer, but with the luminance part of the colour using 16 bits. This technique has problems on the GPU as additive blending of lights on top of each other no longer works, but in the SPE program we have no such problem and thus this becomes more feasible; if one wished to implement a more GPU-friendly technique then diffuse light intensity could also be stored in the alpha channel as in [Valient07].

Both of the previous suggestions involve making use of the currently unused alpha channel in the lighting buffer. While there are certainly many possible uses for this byte, one idea we are currently investigating is storing the amount of fog for each pixel. We believe this could be especially beneficial for more expensive fogging equations, for instance, if height fog is being used. This is an example of adding value to the SPE program [Swoboda09a].

Given the amount of work already being done, including processing the entire normal and depth buffers, there is extra rendering work that could be done in the SPE program. One simple example is performing a down-sample of the depth buffer to a quarter resolution – this could be output asynchronously through the MFC, adding little overhead to the SPE program, and would be useful for many reduced resolution effects such as motion blur, soft particles, occlusion culling and even screen-space ambient occlusion. It would be possible to reduce the amount of processing on the normal depth buffers by combining the view-space normals and depth into a single 32 bit buffer. By encoding the x and y components of the normal into the first two channels (or by converting them to spherical coordinates), and packing linear view-space depth into the remaining 16 bits. This halves the amount of data needed by our SPE program. In fact, this approach is the method we chose for the final version of Blur.

Finally, it is our intention to remove the decoding of the buffers altogether and perform lighting on encoded normal/depth buffers, this has several advantages. The decoding process can be replaced with a simple pass over all the pixels in the frame buffer tile, which should yield a minor increase in overall lighting performance together with saving the memory required for the lighting buffer. However, this extra performance and improved memory footprint come at the cost of added mathematical complexity, as deriving the view-space position of pixels becomes non-trivial. This is due to the need to take into account the effects of the encoded buffer’s format on the final view-space position of the pixel.

10. Acknowledgements

First and foremost we would like to extend our unending thanks to Matt Swoboda of SCEE R&D for laying the groundwork for our continuing efforts and for his suggestions for our implementation. We would also like to thank Colin Hughes of SCEE R&D for his help and suggestions with optimizations.

We also extend our thanks to all the supremely talented individuals that form the Core Technologies Team at Bizarre Creations Ltd., especially to Ian Wilson, Paul Malin, Lloyd Wright, Ed Clay, Jose Sanchez, Charlie Birtwistle, Jan van Valburg, Kier Storey, Fengyun Lu, Jason Denton, Dave Hampson, Chris Cookson and Richard Thomas.

11. Bibliography

[Acton08] M. Acton, E. Christensen, “Insomniac SPU Best Practices,”
http://www.insomniacgames.com/tech/articles/0208/insomniac_spu_programming_gdc08.php, accessed on 2nd July 2009.

[Bader07] D. A. Bader, “Cell Programming Tips & Techniques,”
http://www.cc.gatech.edu/~bader/CellProgramming.html, accessed on 2nd July 2009.

[Balestra08] C. Balestra and P. Engstad, “The Technology of Uncharted: Drake’s Fortune,” Game
Developers Conference 2008, available online at

[Capcom06] Capcom Inc. “The MT Framework,” http://game.watch.impress.co.jp/docs/20070131/3dlp.htm,
accessed on 3rd July 2009.

[Day08] M. Day, J. Garrett, “Faster SPU Clamp,”
http://www.insomniacgames.com/tech/articles/0308/faster_spu_clamp.php, accessed on 2nd July 2009.

[Deering88] M. Deering, “The Triangle Processor and Normal Vector Shader: A VLSI System for High
Performance Graphics,”

[Engel08] W. Engel, “Light Pre-Pass Renderer,”
http://diaryofagraphicsprogrammer.blogspot.com/2008/03/light-pre-pass-renderer.html, accessed on 4th July 2009.

[Engel09] W. Engel, “Designing a Renderer for Multiple Lights: The Light Pre-Pass Renderer,” in
ShaderX7, edited by Wolfgang Engel, Charles River Media, 2008: pp. 655-666.

[Engel09a] W. Engel, “The Light Pre-Pass Renderer Mach III,” to appear in proceedings of ACM
SIGGRAPH09, 2009.

[IBM08] IBM Corp., “Cell Broadband Engine Programming Handbook Version 1.11”.

[IBM08a] IBM Corp., “Synergistic Processing Unit Instruction Set Architecture”.

[IBM09] IBM Corp. “The Cell Project at IBM,” http://researchweb.watson.ibm.com/cell/home.html, accessed
on 4th July 2009.

[Möller08] T. Akenine-Möller, E. Haines, N. Hoffman, “Real-Time Rendering 3rd Edition,” 978-1-56881-424-7, A.K. Peters Ltd. 2008.

[Perthuis06] C. Perthuis, “Introduction to the Graphics Pipeline of the PS3,” Eurographics 2006.

[Seiler08] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juni, P. Hanrahan, “Larabee: a many core x86 architecture for visual computing,” ACM Transactions on Graphics, Volume 27, Issue 3, ACM SIGGRAPH 2008, August 2008.

[Shippy09] D. Shippy, M. Phipps “The Race for a New Games Machine: Creating The Chips Inside The New Xbox360 & The Playstation 3,” 978-0-8065-3101-4, Citadel Press, 2009.

[Swoboda09] M. Swoboda, “Deferred Lighting and Post Processing on PLAYSTATION®3,” Game Developers Conference 2009, available online at

[Swoboda09a] M. Swoboda, Private Correspondance, 2009.

[Valient07] M. Valient, “Deferred Rendering in Killzone 2,http://www.dimension3.sk/mambo/Download-document/Deferred-Rendering-In-Killzone.php, Accessed on 4th July 2009.

[Wilson09] P. Wilson, “Light Pre-Pass Renderer: Using the CIE Luv Color Space,” in ShaderX7, edited by Wolfgang Engel, Charles River Media, 2009: pp.667-677.

[1] “PlayStation”, “PLAYSTATION” and the “PS” family logo are registered trademarks and “Cell Broadband Engine” is a trademark of Sony Computer Entertainment Inc. The “Blu-ray Disc” and “Blu-ray Disc” logos are trademarks.

[2] Screenshots of Blur appear courtesy of Activision Blizzard Inc. and Bizarre Creations Ltd.

[3] One of the eight SPEs is locked out to increase chip yield and another is reserved by the Sony’s Cell OS. Applications running on the PLAYSTATION®3 actually have six SPEs to take advantage of.

[4] As one might expect, linear access patterns fare significantly better than random access.

[5] For information on how to decode, Sony PLAYSTATION®3 developers can consult the RSX® User’s Manual.

[6] We do not currently encode the lighting results; please see further work for more information.

[7] Any more pixels in a single loop in our implementation would risk causing registers to be spilled.

[8] SOA organization is also known as “parallel-array”.

Hello all,

First of all apologies that this blog has been a little light on content this year thus far. I have a few bits and pieces I’d like to write about, but I’ve been very busy of late with work (and also with a few bits outside of work), both of which I hope to be able to discuss more openly soon.

Anyways, a couple of books came out recently which I have contributed chapters to. The first is GPU Pro which just dropped (at least here in Europe). For GPU Pro, I co-authored a chapter about our SPU-based deferred lighting system that I wrote for Blur. The article was actually penned about 12 months ago, but getting books out takes a little time so apologies for not being able to share this stuff sooner. The other book I contributed a chapter to has been out a couple of months now and is the latest installment of the hugely popular “Game Programming Gems” series. The chapter is about general techniques for SPU-assisted rendering on the PlayStation3 platform.

As always I welcome contact about any of my work, so please feel free to drop me a line or hit me up on twitter if you’d like to discuss any aspects of these chapters.

GPU pro covergems 8 cover



A few days back I did an interview with fellow Bizarreo, Charlie Birtwistle, about the technology behind Blur. I’m pleased to announe that it is now up for your reading pleasure over at Eurogamer, see the link below. Any comments or questions are most welcome.

Eurogamer Technology Interview


Eurogamer posted the following comparison of Blur on 360 and PS3:


Finally got around to posting these guys up :) Enjoy!


Just a quick note to say I will be speaking at Develop Conference 2010. The talk is called “SPU-Assisted Rendering”, and will cover some of the SPU enhanced rendering we do in “Horizon,” the core technology at Bizarre Creations that powers Blur. Hope to see any of you there!


Just a quick note for those of you asking me for the Develop North slides. It’s going to be 2010 now until I might be able to release them. They have to go through some pre-approval process at work before they can be release onto the public at large. Unfortunately for the slides, but fortunately for everyone else, everyone is kinda busy right now making Blur even more stunningly beautiful and fun than it already is, :). So it’s not so high on our priority list right now.


So first of all, yet another new look to the blog. This is due to a complete re-installation of wordpress to hoepfully stop people trying to break into it… I wish they’d just give up and leave it alone, ah well… Takes all sorts I guess. So this should mean anyone taking RSS feeds from me and viewing them in Google or something similar should stop getting a load of bullshit in my post content, and start getting… well okay, more bullshit, but at least stuff I actually intended to be visible :)

I have just finished and submitted the article I co-authored with Steve McAuley. It should be appearing in GPU Pro : Advanced Rendering Techniques and is entitled “Parallelized Light Pre-Pass Rendering with the Cell Broadband Engine”. As such I have been invited by Wolfgang Engel to contribute to the book’s fantastic new blog. So hopefully I will have some worthwhile material to contribute over there too.

Ste :)

Don’t you hate it when you get flashes of “inspiration”, only to find that those flashes aren’t common practise for a reason? I was talking to a colleague today about the different approaches to the humble vertex transformation and which would be quickest. After some experimentation and thought it occurred to me that it should be possible to perform a transformation of a vertex by an arbitrary 4×4 matrix with just 3 instructions, beating the mul intrinsic function’s 4 instruction implementation, which takes roughly 5 cycles on NVIDIA SM3.0 compliant hardware, by a full instruction and maybe, if we’re lucky a cycle or two. Sure the approach would only work if the w component of the vector being transformed is exactly 1, but that is nearly always the case (if you can see where I’m going with this, bare with me, you may be surprised).

Let me explain, when using the mul function with a row-major matrix (i.e.: mul(v, m) as opposed to mul(m, v)), mul will typically compile down to the following:

dp4 oPos.x, v0, c0
dp4 oPos.y, v0, c1
dp4 oPos.z, v0, c2
dp4 oPos.w, v0, c3

This is a fairly intuitive peice of assembly, we’re just multiply and summing each column of our matrix with our column vector, but what happens when we swap the arguments to mul around and use a transposed matrix for the shader constant?

mul r0, v0.y, c1
mad r0, c0, v0.x, r0
mad r0, c2, v0.z, r0
mad oPos, c3, v0.w, r0

We get the same number of instructions, albeit mad isn’t quite as fast as dp4, taking roughly 7 cycles to execute on most NVIDIA SM3.0 compliant hardware. The important thing to get from this though is the approach, with the dp4 implementation it wasn’t really possible to isolate (and hence drop) the w component from our calculation. You might be tempted to somehow replace the last dp4 with a mov, but that actually takes exactly the same time as a dp4 instruction anyway and you might just as well have the added flexibility of supporting uniform scaling with 1/s in the w component! With column-major matrices we can trivially isolate the w component in our calculation. Moreover, we certainly have room for manoeuvre in terms of our instruction usage, with the first instruction is a mul rather than a mad, we’re wasting an add operation here, which luckily for us is exactly what we need! Please note mul here is a component-wise multiplication of two vectors and should not be confused with the mul HLSL function mentioned earlier. So with a little jiggery-pokery we get this:

mad oPos, v0.y, c1, c3
mad oPos, c0, v0.x, oPos
mad oPos, c2, v0.z, oPos

Pretty isn’t it? So is that job done? Actually no. If you scroll back up to the first line of this entry you’ll read about how the approach is flawed, the title gives away the reason behind this. Mathematically and technically speaking this approach is sound and indeed if you’re compiling for vs_4_0 you’re away! Unfortunately, us current gen console devs live firmly in the world of vs_3_0, and have no such luxury. On SM3.0 the mad instruction has one crippling limitation that puts the nail in the coffin of this little optimisation, it can only access the contents of a single constant register at a time! Ah well, looks like i’ll be sticking to mul(v, m) for the foreseeable future! :)

Theme © 2005 - 2009 FrederikM.de
BlueMod is a modification of the blueblog_DE Theme by Oliver Wunder