A draft of this was first penned back in 2009 and appeared in GPU Pro : Advanced Rendering Techniques. It was co-written with my good friend, Stephen McAuley. Some of the stuff in here I don’t really agree with anymore, but I thought I’d post it up pretty much unedited (only minor corrections).

The light pre-pass renderer [Engel08, Engel09, Engel09a] is becoming an ever more popular choice of rendering architecture for modern real-time applications that have extensive dynamic lighting requirements. In this article we introduce and describe techniques that can be used to accelerate the real-time lighting of an arbitrary 3D scene on the Cell Broadband Engine™ without adding any additional frames of latency to the target application. The techniques described in this article were developed for the forthcoming PLAYSTATION®3[1] version of Blur which was released in May 2010 [2].


Figure 1. A screenshot from Blur.

1. Introduction

As GPUs have become more powerful, people have sought to use them for purposes other than graphics. This has opened an area of research called GPGPU (General Purpose GPU), which major graphics card manufacturers are embracing. For example, all NVIDIA® GeForce® GPUs now support PhysX® technology, which enables physics calculations to be performed on the GPU. However, much less has been made of the opposite phenomenon – with the increase in speed and number of CPUs in a system, it is becoming feasible on some architectures to move certain graphics calculations from the GPU back onto the CPU. Forthcoming hardware such as Intel’s Larrabee even combines both components [Seiler08], which will certainly open the door for CPU-based approaches to previously GPU-only problems becoming more popular. Today, one such architecture is the PLAYSTATION®3 where the powerful Cell Broadband Engine™ was designed from the outset to support the GPU in its processing activities [Shippy09]. This article explains how the Cell Broadband Engine™ can be used to calculate lighting within the context of a light pre-pass rendering engine.

2. Light Pre-Pass Rendering

A commonly encountered problem in computer graphics has been how to construct a renderer that can handle many dynamic lights in a scene. Traditional forward rendering does not perform well with multiple lights. For example, if a pixel shader is written for up to four point lights, then only four point lights can be drawn (and no spotlights). We could either increase the number of pixel shader combinations to handle as many cases as possible, or we could render the geometry multiple times, once more for each additional light. Neither of these solutions is desirable as they increase the number of state changes and draw calls to uncontrollable levels.
A popular solution to this problem is to use a deferred renderer [Deering88]. Instead of writing out fully lit pixels from the pixel shader, we instead write out information about the surface into a G-Buffer, which would include depth, normal and material information. An example G-buffer format is shown below:


Figure 2. An example G-Buffer format from a deferred rendering engine (after [Valient07]).

We then additively blend the lights into the scene, using the information provided in the G-Buffer. Thus many lights can be rendered, without additional geometry cost or shader permutations. In addition, by rendering closed volumes for each light, we can ensure that only calculations for pixels directly affected by a light are carried out. However, with deferred rendering, all materials must use the same lighting equation, and can only vary by the properties stored in the G-Buffer. There are also huge memory bandwidth costs to rendering to (and reading from) so many buffers, which increases with MSAA.

In order to solve these problems, Engel suggested the light pre-pass renderer, first online in [Engel08] and then later published in [Engel09], although a similar idea was independently developed by others and has been recently used in games such as Uncharted: Drake’s Fortune [Balestra08]. Instead of rendering out the entire G-Buffer, the light pre-pass renderer stores depth and normals in one or two render targets. The lighting phase is then performed, with the properties of all lights accumulated into a lighting buffer. The scene is then rendered for a second time, sampling the lighting buffer to determine the lighting on that pixel.

Using a Blinn-Phong lighting model means that the red, green and blue channels of the lighting buffer store the diffuse calculation, while we can fit a specular term in the alpha channel, the details of which are described in [Engel09]. This means that unlike a deferred renderer, different materials can handle the lighting values differently. This increased flexibility, combined with reduced memory bandwidth costs, has seen the light pre-pass renderer quickly increase in popularity and is now in use in many recent games on a variety of hardware platforms.

Yet the deferred renderer and light pre-pass renderer share the fact that lighting is performed in image space, and as such requires little to no rasterization. This makes the lighting pass an ideal candidate to move from the GPU back onto the CPU. Swoboda first demonstrated this method with a deferred renderer on the PLAYSTATION®3 and Cell Broadband Engine™ in [Swoboda09], and now we expand upon his work and apply similar techniques to the light pre-pass renderer.

3. The PLAYSTATION®3 and the CBE™

Sony Computer Entertainment released the PLAYSTATION®3 in 2006. It contains the Cell Broadband Engine™ which was developed jointly by Sony Computer Entertainment, Toshiba Inc. and IBM Corp. [Shippy09, Möller08, IBM08]. The Cell is the Central Processing Unit (CPU) of the PLAYSTATION®3. In addition to the Cell chip, the PLAYSTATION®3 also has a GPU, the Reality Synthesizer, or RSX®. The RSX® was developed by NVIDIA Corporation, and is essentially a modified GeForce®7800 [Möller08]. A high-level view of the architecture can be found in figure 3.


Figure 3. The PLAYSTATION®3 architecture. (Illustration after [Möller08, Perthuis06]).

Inside the Cell chip one can find two distinctly different types of processor. There is the PowerPC Processing Element (PPE) and eight[3] pure SIMD processors [Möller08] known as Synergistic Processing Elements (SPEs) all of which are connected by a high speed, “token-ring style” bus known as the Element Interconnect Bus (EIB), see figure 4. The techniques introduced and described in this paper are chiefly concerned with the usage of the SPEs and as such further discussion of the PPE has been omitted.


Figure 4. The Cell Broadband Engine (after [IBM08]).

One interesting quirk of the SPE is that it does not directly have access to the main address space and instead has its own internal memory known as the local store. The local store on current implementations of the CBE is 256KB in size. The memory is unified, untranslatable and unprotected [Bader06, IBM08] and must contain the SPE’s program code, call stack and any data that it may happen to be processing. To load or store data from or to the main address space a programmer must explicitly use the Memory Flow Controller (MFC). Each SPE has its own MFC which is capable of queuing up to sixteen Direct Memory Accesses (DMAs) [IBM08].

As the SPU ISA operates primarily on SIMD vector operands, both fixed-point and floating-point [IBM09], it is very well equipped to process large quantities of vectorised data. It has a very large register file (4KB) which is helpful to hide the latencies of pipelined and unrolled loops, and while the local store is relatively small in capacity, it is usually sufficient to allow a programmer is able to hide the large latency of main memory accesses[4] through effective multi-buffering. Code that is to efficiently execute on the SPE should be written to play to the SPE’s strengths.

A more in-depth discussion of the PLAYSTATION®3 and the Cell Broadband Engine™ is out of the scope of this paper, interested readers can refer to IBM’s website for  more in depth details about the Cell chip [IBM09], and Möller, Haines and Hoffman describe some of the PLAYSTATION®3 architecture in [Möller08].

4. GPU/SPE Synchronization

As the number of processors in our target platforms becomes ever greater, the need to automate the scheduling of work being carried out by these processing elements also becomes greater. This has continued to the point where game development teams now build their games and technology around the concept of the job scheduler [Capcom06]. Our engine is no exception to this trend and the solution we propose for GPU/SPE inter-processor communication relies on close integration with such technology. It is for this reason we believe our solution to be a robust and viable solution to the problem of RSX®/SPE communication that many others can easily foster into their existing scheduling frameworks. In order to perform fragment shading on the SPE without introducing unwanted latency into the rendering pipeline there needs to be a certain amount of inter-processor communication between the GPU and SPEs. This section discusses the approach we used in achieving this synchronization.
Each SPE has several Memory Mapped I/O (MMIO) registers it can use for inter-processor communication with other SPEs or the PPU. However, these are unfortunately not trivially writable from the RSX®. An alternative approach is required in order to have the RSX® signal the SPEs that the rendering of the normal/depth buffer is complete and that they can now begin their work, without having the desired SPE programs spinning on all six of the available SPEs wasting valuable processing time.

When adding a job to our job scheduler it is optionally given an address in RSX®-mapped memory upon which the job is dependent. When the scheduler is pulling the next job from the job queue it polls this address to ensure that it is written to a known value by the RSX®. If this is not the case, the job is skipped and the next one fetched from the queue and processed, if the location in memory is written however, then our job is free to run. This dependency is visualized in figure 5.


Figure 5. The RSX® and SPE communication, the RSX® writes a 128 byte value when the normal/depth buffer is available for processing. The SPEs poll the same location to know when to begin their work.

The problem of ensuring that the GPU waits for the light buffer to be available from the SPEs is solved by a technique that is well-known to PLAYSTATION®3 developers, but unfortunately we cannot disclose it here. Interested PLAYSTATION®3 developers can consult Sony’s official development support website.

It is desirable for the RSX® to continue doing useful work in parallel with the SPEs performing the lighting calculations. In Blur we are fortunate in that we have a number of additional views that are rendered which do not rely on the lighting buffer, for example planar reflections and a rear-view mirror (in other applications these might also include the rendering of a shadow buffer). This is shown in figure 6. If no useful work can be performed on the RSX®, it may be possible (depending on your memory budget and the requirements of your application) to perform the lighting calculations one frame latent as in [Swoboda09], this approach also has the added benefit of reducing the likelihood of stalling the RSX®.


Figure 6. The RSX® continues to do useful work as the SPEs calculate the dynamic lighting for our scene.

5. The Pre-Pass

To begin the lighting pre-pass we must first construct the normal and depth buffers. We store view space normals in an 8:8:8:8 format, and since we are able to read from the depth buffer on our chosen hardware, we have no need for a separate target for the depth. We chose to perform our lighting in view space as we find it faster compared with world space.

Next we render the relevant solid and alpha-test geometry into the buffers. We only render the geometry affected by the lights – we cull all draw calls against the bounding spheres of the lights and also bring in the far clip plane. (Note that a simple sphere test is not sufficient, since we also need to render near objects that occlude the light spheres.) These methods of culling reduce the cost of drawing the pre-pass geometry by approximately half.

When rendering the scene, we enable stencil writes with a reference value of 0xff. The stencil buffer is cleared to 0x00 beforehand, which gives us the relevant region of the screen masked out in the stencil buffer. Whether rendering lights on the RSX® or the SPE, this enables us to use the early stencil to ensure that we only light relevant pixels.

We do not currently render the pre-pass or the light buffers with MSAA. This has a number of disadvantages, including some lighting artefacts around the edges of objects, and the loss of the ability to use the depth buffer as a depth pre-pass with the main scene (which we render with MSAA). However, we found the artefacts minimal, and the extra cost of rendering the light pre-pass MSAA outweighed the saving from having a depth pre-pass. This is still an area we wish to return to in the future.

Once we have the normal and depth buffers, we are able to perform the lighting. Currently, we use the Lambert diffuse model for our lights, and render the lights into an 8:8:8:8 buffer. This is for simplicity and performance reasons, but with the cost of no specular and limited lighting range. This also means that the alpha channel of the lighting buffer is unused. Some ideas for its use are explained in the section on further work.

We maintain a GPU implementation of our lighting model for reference and for other platforms. First, the stencil test is set to “equals” with a reference value of 0xff, so we only render to pixels marked in the stencil buffer. Then, the lights are rendered, with point lights and spot lights using two very different methods.

Point lights are rendered as in [Balestra08] – the frame buffer is split into tiles, and we gather lists of lights (up to a maximum of eight) that affect each tile. We then render each tile using a shader corresponding to its number of lights. This method saves on fill rate, and enables us to perform the reconstruction of view space position and normal from our normal and depth buffers only once per pixel, no matter the number of point lights.

Spot lights use the more standard method of rendering the bounding volumes of the lights – in this case, cones. We render front faces, unless we are inside the volume, in which case we render back faces.

We further optimize the lighting code by making use of the depth bounds test, when it is available on the target hardware. The depth bounds test compares the depth buffer value at the current fragment’s coordinates to a given minimum and maximum depth value. If the stored depth value is outside the given range, then the fragment is discarded. When drawing either a tile of point lights, or a spot light volume, we set the depth bounds range to be the minimum and maximum depth extents of the light (or lights, in case of the point lights).

This gives us a fast, optimized GPU implementation of our lighting model. However, it is still a significant percentage of our frame rendering time, and its image space nature makes it a perfect candidate to offload from the GPU onto the SPEs.

6. The Lighting SPE Program

This section describes in detail the SPE program that performs our lighting calculations. In order to try to contextualize each sub-section we have included figure 7 which shows the high-level structure of the SPE program as a whole.

Figure 7. The high-level flow of our SPE lighting program.

6.1 The Atomic Tile Arbiter

Due to the relatively large memory footprint of a 720p frame buffer; the limitations imposed by the size of an SPE’s local store; and the internal format of a surface created by PLAYSTATION®3’s RSX®, our lighting SPE program works on tiles of the frame buffer, 64×64 pixels in size, as shown in figure 8. Thus, there is a need to keep track of which tile is free to bring in to local store for processing. The simplest and most concurrent way we found of achieving this was by way of an atomically incremented tile index which resides in main memory. It should be noted that the SPE and RSX® are only able to efficiently co-operate on the processing of resources that are placed into correctly mapped main memory.


Figure 8. Each SPE assigns itself a task by atomically incrementing a tile index held in main memory.

For efficiency (and to avoid contention for the cache line) the tile index is aligned to a 128 byte boundary and is padded to 128 bytes in size to exactly match the cache line width of the SPEs atomic unit (ATO) [IBM08, IBM07]. The effective address (EA) of the tile is given by the product of the tile index and the total size of a single tile summed with the address of the beginning of the frame buffer, as in equation 6.1. For our chosen tile size, the resulting effective address always falls on a 16 byte boundary since our tile size is itself a 16 byte multiple.


(6.1)

6.2. Multi-Buffering

Multi-buffering is a must for almost all SPE programs that process any significant volume of data [Bader07] and our lighting program is no exception. In our implementation we use triple buffering to minimize the latency of accesses to the normal/depth buffer in main memory. Each buffer in the triple buffer has enough space to support a single unit of work (i.e.: a single tile of the frame buffer). The first of the buffers in our triple buffer is used as the target for inbound DMA, it utilizes its own tag group and DMA into this buffer are initiated as soon as the tile decoding process[5] on the previous tile has completed. The second and third buffers are used as the output targets for the decoding process. In addition to this, they act as scratch memory for the lighting calculations and are the source of the outgoing DMA from the running SPE program back to the light buffer in main memory[6]. This is achieved by using the two buffers alternately in order to allow outgoing DMA to complete asynchronously of the tile decoding and lighting of the next tile. A high level view of our multi-buffering strategy is depicted in figure 9.


Figure 9. The triple-buffering strategy in our lighting SPE program.

The multi-buffering system described here works so effectively that our SPE program spends an average of 5μs per frame waiting for data to be transferred to and from main memory per SPU, with the bulk of this delay being introduced early in the program’s execution as one should expect.

6.3. Light Gathering

When the incoming DMAs for the normal buffer and depth-stencil buffer tiles have completed, we can begin processing. Before we light, we first gather the lights that affect a given tile. We do this by constructing a view frustum for each tile and culling the bounding spheres of the lights against the frustum. In addition, we also cull against the stencil buffer. This is a vital optimization as it minimizes the work done in the expensive lighting phase.

In order to perform the culling and the lighting, we actually work on sub-tiles of the frame buffer tile, 3,216 pixels in size. Culling over a smaller region is more effective, and we found sub-tiles of the above size to be optimal in our case.

Next we iterate over the depth-stencil tile to collect the minimum and maximum depth values and the minimum and maximum stencil values for each sub-tile. The depth values will form the near and far clip planes of our sub-tile’s view frustum and the stencil values allow us to do a stencil test akin to the early-stencil hardware on a GPU.

In section 5 above we describe how we write 0xff into the stencil buffer when rendering the pre-pass buffers, hence any pixels with a stencil of 0x00 we do not wish to light. However, we do not stencil on a per-pixel basis, but instead skip the lighting on a sub-tile if the maximum stencil value is equal to 0x00 (hence it is 0x00 across the entire sub-tile).

Once a sub-tile has passed the stencil test, we construct its view frustum. Knowing the screen-space position of the corners of the sub-tile, using values from the projection matrix we can construct its position in view-space space, at a fixed distance of one meter from the camera (see equation 6.5). Multiplication by the minimum and maximum view-space depth values then gives us the eight vertices of the frustum, from which we can construct the frustum’s six planes (see equation 6.4 for how to construct view-space z from a depth buffer value).

We then construct separate lists of point lights and spot lights which intersect this view frustum. The bounding sphere of each light is tested against each frustum plane, with successful point lights added to a point light list, and successful spot lights added to a spot light list.


(6.2)

If no lights are gathered, then the sub-tile follows the same path as one which fails the stencil test – the lighting is skipped and we output zero to the frame buffer. However, if at least one light does affect the sub-tile, then the lists of point lights and spot lights are passed into our lighting function to perform the most important part of the work.

6.4. Point Lights

The SPE program used for lighting is written in C and makes heavy use of SPE-specific language extensions. We made the choice early on to favour the si-style of intrinsic over the higher-level spu-style. This is due to a closer mapping to the underlying opcodes generated by the compiler [Acton08].

Lighting code is an excellent candidate for both software pipelining and loop unrolling; our lighting is performed on batches of 16 pixels at a time. We found that 16 pixels gave us a very small number of wasted cycles per iteration of our lighting loops while still allowing us to fit everything we needed into the 4KB (128, 16 byte) register file[7]. The large numbers of independent instructions that result from lighting a relatively large set of pixels mean that the latency caused by dependent instructions closely following one another is almost completely eliminated and overall performance is massively increased (limited only by the number of issued instructions). Non-dependent instructions are interleaved with one-another with the results being used some time later when they are available; this well-known optimization technique also has the side effect of improving the balance of instructions over the odd and even execution pipelines because there are a greater number of suitable, none-dependent instructions that can occupy a single fetch group in the Synergistic Execute Unit (SXU). We found that we were able to achieve approximately three times the pixel throughput from batching pixels into groups of 16 over our earlier attempts which loosely mimicked RSX® quads by lighting smaller groups of four pixels.


(6.3)

Before any lighting can begin it is important to reconstruct the correct input to our lighting equation expressed in equation 6.3. Equation 6.4 demonstrates how to reconstruct the z component of the view-space position of a pixel given its depth buffer value and the near and far planes of the view frustum. Calculating the x and y components of the view-space position is equally trivial when given the x and y coordinates of the pixel in screen-space and the view projection matrix. This is shown by equation 6.5.


(6.4)


(6.5)

In HLSL/Cg shaders it is quite common to use the saturate intrinsic function to clamp values to a [0..1] range. To achieve this on the SPE there is a clever trick that we feel is certainly worthy of mention here. Day et al. introduced the fast saturate/clamp technique which uses the SPU’s floating-point conversion instructions in order to achieve clamping of a floating point value to a variety of different ranges. This depends on the combination of scale bias operands issued with the instructions [Day08]. In a pipelined loop, such as our lighting loop, instruction count is oftentimes the overriding determinant of the code’s execution speed and as such we are able to employ this trick to great effect. Listing 1 demonstrates this technique.

/* HLSL saturate, clamp to [0..1]. */
qword x = si_cfltu(q, 0x20);
qword y = si_cuflt(x, 0x20);

Listing 1. Saturate a qword in two even pipeline instructions.

One interesting difference between the GPU implementation of our lighting and the SPE implementation is the switch from the default Array of Structures (AoS) data layout on the GPU, to the transposed, SIMD-friendly Structure of Arrays (SoA)[8] data layout on the SPE. The difference in format of the data is illustrated below in figure 10. By storing, loading and shuffling data into a SoA layout we are able to perform our lighting calculations much more optimally on the SPEs. A pleasant side-effect of the switch is that the resulting C code becomes much more scalar-like in appearance, making it easier for other programmers to follow [Bader06].


Figure 10. Shuffling an AoS into a SoA.

The SPE is only equipped to deal with 16 byte aligned writes and reads to and from its local store [Bader06, IBM08, IBM08a]. The targets from all load and store operations first undergo a logical ‘and’ with the LSLR register (set to 0x3ffff for current implementations of the CBE) before the SPU Store and Load unit (SLS) fetches or writes the address [IBM08, IBM08a]. Writing scalar values is achieved by way of a load-shuffle-store pattern. It is therefore desirable to perform loads and stores on 16 byte boundaries only. As our program required a lot of 4 byte loads from our normal/depth buffer and a lot of similarly sized writes to our light buffer we ended up batching these loads and stores into 16 byte chunks in order to eliminate the overhead of the additional code that would be required if we were to perform these operations on a pixel-by-pixel basis. This proved to deliver a significant performance increase, especially in the case of storing where nearly all pipeline bubbles were eliminated. We present a portion of our pixel writing code in listing 2 for a single four pixel block:


qword c0        = si_cfltu(dif0, 0x20);
qword c1        = si_cfltu(dif1, 0x20);
qword c2        = si_cfltu(dif2, 0x20);
qword c3        = si_cfltu(dif3, 0x20);
      dif       = si_ila(0x8000);
qword scale     = si_ilh(0xff00);
      dif0      = si_mpyhhau(c0, scale, dif);
      dif1      = si_mpyhhau(c1, scale, dif);
      dif2      = si_mpyhhau(c2, scale, dif);
      dif3      = si_mpyhhau(c3, scale, dif);
const vector unsigned char _shuf_uint = {
    0xc0, 0x00, 0x04, 0x08,
    0xc0, 0x10, 0x14, 0x18,
    0xc0, 0x00, 0x04, 0x08,
    0xc0, 0x10, 0x14, 0x18 };

qword s_uint    = (const qword)_shuf_uint;
qword base_addr = si_from_ptr(result);
qword p0_01     = si_shufb(dif0, dif1, s_uint);
qword p0_02     = si_shufb(dif2, dif3, s_uint);
qword p0        = si_selb(p0_01, p0_02, m_00ff);
                  si_stqd(pixel0, base_addr, 0x00);

Listing 2. Pixels are converted from their floating-point representations into 32 bit values, batched into 16 byte chunks, and stored.

6.5 Spot Lights

In the interest of completeness we present the mathematics used for our, ‘regular’ spotlights in equation 6.6.


(6.6)

Note that this is the same as the equation for the point lights, with an additional term at the end. d is the direction of the light (as opposed to l, which is the direction from the light to the point),  is the angle of the inner cone and  is the angle of the outer cone. However, we store their cosines on the light rather than calculating them every time. All lighting values for both point and spot lights are summed for each pixel, yielding the equation in 6.7.


(6.7)

7. The Main Pass

When the SPEs have finished calculating the light buffer, they then signal to the RSX® that the main pass can be rendered. As mentioned above, the synchronization at this stage is very important – we do not want to be reading from an incomplete light buffer. To composite the light buffer with the main scene, we read it as a texture in the pixel shaders. However, as not every pixel in our scene receives light from our pre-pass (see above, we only render geometry into the pre-pass that receives light), we use two shader techniques in the scene: one which samples from the light buffer, and one which does not. For the former technique, each pixel looks up its lighting in the light buffer using its screen-space coordinate, and then composites the light value as a diffuse light, as follows:


(7.1)

It might be tempting to simply additively or multiplicatively blend the lighting buffer over the scene, but as can be seen above, that method will result in incorrect lighting. This is due to the presence of additional static lighting in our scene.

It is also possible to read from the normal buffer in the main pass. This means that reading from normal maps and converting from tangent space to view (or world) space only happens once. However, this also means that the low precision of the normals stored in the pre-pass becomes more noticeable (only 8 bits per component). For this reason and others we did not use this option.

At the end of rendering we have a scene with many dynamic lights rendered using the Cell Broadband Engine™. Not only does this open up exciting new possibilities for our rendering engine, but it does so with minimal GPU cost, with a large amount of work performed on the CPU.

8. Conclusion

We have presented a method which splits the work of light pre-pass rendering between the RSX® and the SPEs on the Cell Broadband Engine™. We use the strengths of both components to our advantage: the rasterization performance of the RSX® to render the pre-pass geometry, and the vector maths performance of the SPEs to calculate the lighting buffer. By parallelizing the lighting calculation on the SPEs with some other rendering on the RSX® (for instance, a dynamic cube map), the lighting becomes free and thus this can be a major GPU optimization. In fact, even without the added bonus of parallelization, we found that in some cases, five SPEs running carefully crafted programs could outperform the RSX® when performing lighting calculations.


Figure 11. Another screenshot from Blur.

As new architectures emerge we believe there will be increasing opportunities to take processing load off the GPU and place it back onto the CPU. It remains to be seen how things will pan out when the two are combined in Intel’s Larrabee [Seiler08], but on the Cell Broadband Engine™ we offer that the GPU can be massively accelerated in cases such as deferred lighting or light pre-pass rendering by writing a custom CPU implementation that executes on the SPEs.

9. Further Work

There are many improvements that could be done to techniques we describe. Firstly, we currently omit specular from our lighting model. We propose either writing out specular to a separate lighting buffer or placing a monochrome specular term in the alpha channel of the lighting buffer as in [Engel09]. Material properties could be controlled by adding a specular power in the alpha channel of the normal buffer. Another problem is that our lighting is currently LDR, as it is stored in an 8:8:8:8 integer format. One option is moving to a 16:16:16:16 float, but Wilson suggests instead using the CIE Luv colour space [Wilson09]. Using this method, we can still use an 8:8:8:8 buffer, but with the luminance part of the colour using 16 bits. This technique has problems on the GPU as additive blending of lights on top of each other no longer works, but in the SPE program we have no such problem and thus this becomes more feasible; if one wished to implement a more GPU-friendly technique then diffuse light intensity could also be stored in the alpha channel as in [Valient07].

Both of the previous suggestions involve making use of the currently unused alpha channel in the lighting buffer. While there are certainly many possible uses for this byte, one idea we are currently investigating is storing the amount of fog for each pixel. We believe this could be especially beneficial for more expensive fogging equations, for instance, if height fog is being used. This is an example of adding value to the SPE program [Swoboda09a].

Given the amount of work already being done, including processing the entire normal and depth buffers, there is extra rendering work that could be done in the SPE program. One simple example is performing a down-sample of the depth buffer to a quarter resolution – this could be output asynchronously through the MFC, adding little overhead to the SPE program, and would be useful for many reduced resolution effects such as motion blur, soft particles, occlusion culling and even screen-space ambient occlusion. It would be possible to reduce the amount of processing on the normal depth buffers by combining the view-space normals and depth into a single 32 bit buffer. By encoding the x and y components of the normal into the first two channels (or by converting them to spherical coordinates), and packing linear view-space depth into the remaining 16 bits. This halves the amount of data needed by our SPE program. In fact, this approach is the method we chose for the final version of Blur.

Finally, it is our intention to remove the decoding of the buffers altogether and perform lighting on encoded normal/depth buffers, this has several advantages. The decoding process can be replaced with a simple pass over all the pixels in the frame buffer tile, which should yield a minor increase in overall lighting performance together with saving the memory required for the lighting buffer. However, this extra performance and improved memory footprint come at the cost of added mathematical complexity, as deriving the view-space position of pixels becomes non-trivial. This is due to the need to take into account the effects of the encoded buffer’s format on the final view-space position of the pixel.

10. Acknowledgements

First and foremost we would like to extend our unending thanks to Matt Swoboda of SCEE R&D for laying the groundwork for our continuing efforts and for his suggestions for our implementation. We would also like to thank Colin Hughes of SCEE R&D for his help and suggestions with optimizations.

We also extend our thanks to all the supremely talented individuals that form the Core Technologies Team at Bizarre Creations Ltd., especially to Ian Wilson, Paul Malin, Lloyd Wright, Ed Clay, Jose Sanchez, Charlie Birtwistle, Jan van Valburg, Kier Storey, Fengyun Lu, Jason Denton, Dave Hampson, Chris Cookson and Richard Thomas.

11. Bibliography

[Acton08] M. Acton, E. Christensen, “Insomniac SPU Best Practices,”
http://www.insomniacgames.com/tech/articles/0208/insomniac_spu_programming_gdc08.php, accessed on 2nd July 2009.

[Bader07] D. A. Bader, “Cell Programming Tips & Techniques,”
http://www.cc.gatech.edu/~bader/CellProgramming.html, accessed on 2nd July 2009.

[Balestra08] C. Balestra and P. Engstad, “The Technology of Uncharted: Drake’s Fortune,” Game
Developers Conference 2008, available online at
http://www.naughtydog.com/corporate/press/GDC%202008/UnchartedTechGDC2008.pdf

[Capcom06] Capcom Inc. “The MT Framework,” http://game.watch.impress.co.jp/docs/20070131/3dlp.htm,
accessed on 3rd July 2009.

[Day08] M. Day, J. Garrett, “Faster SPU Clamp,”
http://www.insomniacgames.com/tech/articles/0308/faster_spu_clamp.php, accessed on 2nd July 2009.

[Deering88] M. Deering, “The Triangle Processor and Normal Vector Shader: A VLSI System for High
Performance Graphics,”
SIGGRAPH 1988.

[Engel08] W. Engel, “Light Pre-Pass Renderer,”
http://diaryofagraphicsprogrammer.blogspot.com/2008/03/light-pre-pass-renderer.html, accessed on 4th July 2009.

[Engel09] W. Engel, “Designing a Renderer for Multiple Lights: The Light Pre-Pass Renderer,” in
ShaderX7, edited by Wolfgang Engel, Charles River Media, 2008: pp. 655-666.

[Engel09a] W. Engel, “The Light Pre-Pass Renderer Mach III,” to appear in proceedings of ACM
SIGGRAPH09, 2009.

[IBM08] IBM Corp., “Cell Broadband Engine Programming Handbook Version 1.11”.

[IBM08a] IBM Corp., “Synergistic Processing Unit Instruction Set Architecture”.

[IBM09] IBM Corp. “The Cell Project at IBM,” http://researchweb.watson.ibm.com/cell/home.html, accessed
on 4th July 2009.

[Möller08] T. Akenine-Möller, E. Haines, N. Hoffman, “Real-Time Rendering 3rd Edition,” 978-1-56881-424-7, A.K. Peters Ltd. 2008.

[Perthuis06] C. Perthuis, “Introduction to the Graphics Pipeline of the PS3,” Eurographics 2006.

[Seiler08] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juni, P. Hanrahan, “Larabee: a many core x86 architecture for visual computing,” ACM Transactions on Graphics, Volume 27, Issue 3, ACM SIGGRAPH 2008, August 2008.

[Shippy09] D. Shippy, M. Phipps “The Race for a New Games Machine: Creating The Chips Inside The New Xbox360 & The Playstation 3,” 978-0-8065-3101-4, Citadel Press, 2009.

[Swoboda09] M. Swoboda, “Deferred Lighting and Post Processing on PLAYSTATION®3,” Game Developers Conference 2009, available online at
http://www.technology.scee.net/files/presentations/gdc2009/DeferredLightingandPostProcessingonPS3.ppt

[Swoboda09a] M. Swoboda, Private Correspondance, 2009.

[Valient07] M. Valient, “Deferred Rendering in Killzone 2,http://www.dimension3.sk/mambo/Download-document/Deferred-Rendering-In-Killzone.php, Accessed on 4th July 2009.

[Wilson09] P. Wilson, “Light Pre-Pass Renderer: Using the CIE Luv Color Space,” in ShaderX7, edited by Wolfgang Engel, Charles River Media, 2009: pp.667-677.


[1] “PlayStation”, “PLAYSTATION” and the “PS” family logo are registered trademarks and “Cell Broadband Engine” is a trademark of Sony Computer Entertainment Inc. The “Blu-ray Disc” and “Blu-ray Disc” logos are trademarks.

[2] Screenshots of Blur appear courtesy of Activision Blizzard Inc. and Bizarre Creations Ltd.

[3] One of the eight SPEs is locked out to increase chip yield and another is reserved by the Sony’s Cell OS. Applications running on the PLAYSTATION®3 actually have six SPEs to take advantage of.

[4] As one might expect, linear access patterns fare significantly better than random access.

[5] For information on how to decode, Sony PLAYSTATION®3 developers can consult the RSX® User’s Manual.

[6] We do not currently encode the lighting results; please see further work for more information.

[7] Any more pixels in a single loop in our implementation would risk causing registers to be spilled.

[8] SOA organization is also known as “parallel-array”.


Feb
28.

This post is going to be much shorter as I’ve sadly been very pressed for time this last couple of weeks and as a result been unable to finish the other (longer) post I had original planned to the high standard that you all deserve. This post is talking about measuring stuff in frames per second and why you shouldn’t do it.

Many people who haven’t got any experience in profiling and optimising a game will approach performance measurements in a frankly scary way. I’m taking of course about measuring performance in terms of frame rate. Of course it’s unfair to say the games industry never mentions ‘frame rate’, we do! We have target frame rates for our games which are usually 60Hz or 30Hz (increasingly the latter), but for anything beyond this I think it’s fair to say that most sane people take the leap into a different, more logical and intuitive space. It’s called frame time and it’s measured in; you guessed it, units of time! Typical units of time used include ms, µs or even ns. The two have a relationship which allows you to convert between them given by t=1/r.

So why am I concerned about this if it’s trivial to convert between them? Well, the reason is that measuring anything in terms of the frame rate suffers from some big flaws which measuring in frame time does not and can lead to problems with your view of the situation:

1. Traditional frame rate measurements are often dependent on synchronisation with a hardware-driven event such as the vertical reset on your display device. If you’ve locked your application to this vertical sync to avoid frame tearing this can mean no matter how fast your application is going it will always appear to run at say, 60fps. If you’re application is straddling the border of 60fps and locked to vsync, this can give rise to occassions where small fluctuations in the amount of time something takes can make the game to appear to go either two times quicker or slower!

2. Frame rate can be very deceptive! To show what I mean let’s use an example that crops up time and time again among beginners, it goes as follows:

Coder A: “Check this shit out, my game is running at 2000fps! I’m such a ninja!”
Coder B: “Nice one chief, but you’re accidentally frustum culling the player there…”
Coder A: “Ah shit, okay… Let me fix that… Sorted!”
Coder A hits F5 and looks on truly horrified at the frame rate counter.
Coder B: “Heh! Not so ninja now are you? Running at 300fps!”
Coder A slopes off looking dejected.

What happened here? Apparently performance has dropped drastically by a whopping 1700 frames per second to a mere three hundred! However, if these pairs of idiots worked in a set of sensible units they’d instantly see that it’s only taking 2.5ms to render that player mesh. Small drops in frame rate in an application with an already low frame rate can point to big performance problems, where as large drops in an application with a very big frame rate are usually not much to worry about.

3. And perhaps the most compelling reason which should give any remaining doubters a firm nudge into reality; All the wonderful profiling tools out there work in time, not rate.

At Bizarre Creations (R.I.P) we had a large cardboard cut out of the Stig (among others, such as Hulk Hogan and Austin Powers) for anyone that broke the build. In honour of Bizarre which sadly closed last Friday, I offer these final Stig inspired words to bring this post to a close: Next time you even contemplate telling your colleague that your code runs at a particular frame rate remember that what your saying is more backwards than Jeremy Clarkson telling you that the Stig got the Ariel Atom 500 round the top gear track at 37.54993342 metres per second!


Obligatory Introductory Parable

I really like Sushi, it’s tasty and convenient. I like the immediacy of being able to go into a Sushi restaurant complete with conveyor belt and being able to take a seat and grab something fresh and delicious from the belt without blowing my entire lunch hour. Having said that, what I really wouldn’t like is to be a member of staff in a Sushi restaurant, especially if it was my job to show the diners to their seats and here’s why…

A group of three diners walk in and ask to be seated. Delighted at having some custom, you kindly oblige showing the diners to their seats. No sooner have they sat down and helped themselves to some tasty looking ‘Tekka Maki’, when the door opens again and four more diners walk in! Wow, you guys are on a roll (see what I did there?). Your restaurant now looks like this…

Since its lunch time, the restaurant quickly fills up to capacity. Finally after eating all they could and settling the bill, the first party to arrive (the party of three leave), and a group of two walk in and you offer the newly vacant seats to your new customers. This occurrence happens a few more times until your restaurant looks like this…

Finally a group of four dinners walk in and ask to be seated. Ever the pragmatist, you’ve been carefully keeping track of how many empty seats you have left and it’s your lucky day, you’ve got four spare seats! There is one snag though, these diners are the social type and would like to be seated together. You look around frantically, but while you have four empty seats you can’t seat the diners together! It would be very rude to your ask existing customers to move mid-meal, which sadly leaves you no option but to turn your new customers away, probably never to return again. This makes everyone very sad. If you’re sat there wondering how this little tale relates in the slightest bit to games development then read on.

In programming we have to manage memory. Memory is a precious resource which must be carefully managed, much like the seats in our metaphorical sushi restaurant. Every time we allocate memory dynamically we’re reserving memory from something called the “heap”. In C and C++ this is typically done through the use of the malloc function and the new operator respectively. To continue the somewhat fishy analogy (last one I promise!), this is like our intrepid groups of diners asking to be seated in our sushi restaurant. The real shame though is what happened in our hypothetical scenario happens in the context of memory also, but the results are much worse than a couple of empty tummies. It is called fragmentation and it is a nightmare!

What’s wrong with malloc and new?

Sadly, the rest of the discussion won’t have such a fishy flavour to it as this post is going to talk about malloc and new and why they have a very limited place in the context of embedded systems (such as games consoles). While fragmentation is just one facet of problems caused by dynamic memory allocation, it is perhaps the most serious, but before we can come up with some alternatives we should take a look at all of the problems:

1. malloc and new try to be all things to all programmers…
They will as soon as allocate you a few bytes as they will a few megabytes. They have no concept of what the data is that they’re allocating for you and what its lifetime is likely to be. Put another way, they don’t have the bigger picture that we have as programmers.

2. Run-time performance is relatively bad…
Allocations from the standard library functions or operators typical require descending into the kernel to service the allocation requests (this can involve all sorts of nasty side effects to your application’s performance, including flushing of translation lookaside buffers, copying blocks of memory, etc). For this reason alone using dynamic memory allocation can be very expensive in terms of performance. The cost of the free or delete operations in some allocation schemes can also be expensive as in many cases a lot of extra work is done to try to improve the state of the heap for subsequent allocations. “Extra work” is a fairly vague term, but can mean the merging of memory blocks and in some cases can mean walking an entire list of allocations your application has made! Certainly not something you’d want to be wasting valuable processor cycles on if you can avoid it!

3. They cause fragmentation of your heap…
If you’ve never been on a project that has suffered from the problems associated with fragmentation then count yourself very lucky, but rest of us know that heap fragmentation can be a complete and utter nightmare to address.

4. Tracking dynamic allocations can be tricky…
Dynamic allocation comes with the inherent risk of memory leaks. I’m sure we all know what memory leaks are, but if not, have a read of this. Most studios try to build some tracking infrastructure on top of their dynamic allocations to try and track what memory is in play and where.

5. Poor locality of reference…
There is essentially no way of knowing where the memory you get back from malloc or new will be in relation to any of the other memory in your application. This can lead to more us suffering more increasingly expensive cache misses than we need to, as we end up dancing through memory like Billy Elliot on Ecstasy.

So what is the alternative? The idea behind this post is provide you with the details (and some code!) for a few different alternatives that you can use in place of malloc and new to help combat the problems we’ve just discussed.

The Humble Linear Allocator

As the name of this section suggests this allocator is certainly the simplest of all those I’m going to present, although truth be told; they’re all simple (and that’s part of the beauty). The linear allocator essentially assumes that there is no fine grained de-allocation of allocated memory resources and proceeds to make allocations from a fixed pool of memory one after the other in a linear fashion. Check out the diagram below.

A good example of where a linear allocator is exceedingly useful is found in nearly all SPU programming. The persistence of data in the local store is not important beyond the scope of the currently executing job and in many cases the amount of data one brings into the local store (at least data that needs some degree of variance in its size) tends to fairly restricted. However, don’t be fooled that’s far from its only application. Here’s an example of how one might go about implementing a simple, linear allocator in C.


/** Header structure for linear buffer. */
typedef struct _LinearBuffer {
    uint8_t *mem; /*!< Pointer to buffer memory. */
    uint32_t totalSize; /*!< Total size in bytes. */
    uint32_t offset; /*!< Offset. */
} LinearBuffer;


/* non-aligned allocation from linear buffer. */
void* linearBufferAlloc(LinearBuffer* buf, uint32_t size) {

    if(!buf || !size)
        return NULL;

    uint32_t newOffset = buf->offset + size;
    if(newOffset <= buf->totalSize) {

        void* ptr = buf->mem + buf->offset;
        buf->offset = newOffset;
        return ptr;
    }
    return NULL; /* out of memory */
}

It is of course possible to support aligned allocations by applying the required bitwise operations to the offset during the course of the allocation. This can be incredibly useful, but be aware that depending on the size of the data you’re allocating from your buffer (and in some cases the order in which those allocations are made) you may find that you get some wasted space in the buffer between allocations. This wasted space is typically okay for alignments of a reasonable size, but can get prohibitively wasteful if you are allocating memory which requires a much larger alignment, for example 1MB. The ordering of your allocations from linear allocators can have a drastic effect on the amount of wasted memory in these types of situations.

To reset the allocator (perhaps at the end of a level), all we need to do is set the value of offset to zero. Just like with all allocations you would do, clients of the allocator need to ensure they’re not hanging on to any pointers that you’ve effectively de-allocated by doing this, otherwise you risk corrupting the allocator. Any C++ objects you’ve allocated from the buffer will also need their destructors calling manually.

The Stack

Let’s get something out of the way before we start into this allocator. When I talk about a “stack allocator” in this particular case, I’m not talking about the call stack, however, that stack does have an important part to play in avoiding run-time heap allocations as we shall see later on. So what am I talking about then? Well, the linear allocator I just described above is an excellent solution to many allocation problems, but what if you want slightly more control over how you free up your resources? The stack allocator will give you this.

Towards the end of my description of the linear allocator, I mentioned that to reset the allocator you can simply set the offset to zero in order to free up all the resources in the allocator. The principle of setting the offset to a particular value is the principle that guides the implementation of the stack allocator. If you are not familiar with the concept of the stack data structure then now is probably a good time to get acquainted, you can do so here.

Back? Okay, each allocation from our stack allocator will optionally be able to get a handle which represents the state of the stack allocator just before that allocation was made. This means that if we restore the allocator to this state (by changing its offset) we are effectively ‘freeing’ up the memory to be reused again. This is shown in the diagram below.

This can be a very useful thing if you want some memory allocated temporarily for data which has a limited scope. For example; the life time of a specific function or sub-system. This strategy can also be useful for things like level resources, which have a well defined order that objects need to be freed up in (that is reverse order to which they are allocated). Here is some example C code to illustrate what I’ve just explained:


typedef uint32_t StackHandle;
void* stackBufferAlloc(StackBuffer* buf, uint32_t size, StackHandle* handle) {

    if(!buf || !size)
        return NULL;

    const uint32_t currOffset = buf->offset;
    if(currOffset + size <= buf->totalSize) {

        uint8_t* ptr = buf->mem + currOffset;
        buf->offset += size;
        if(handle)
            *handle = currOffset; /* set the handle to old offset */
        return (void*)ptr;
    }
    return NULL;
}

void stackBufferSet(StackBuffer* buf, StackHandle handle) {

    buf->offset = handle;
    return;
}

The Double Stack

If you’re comfortable with the stack concept above, we can now move on to the double-ended stack. This is similar to the stack allocator we just described except that there are two stacks, one which grows from the bottom of the buffer upward and one which grows from the top of the buffer downward. This is shown in the diagram below.

Where would this be useful? A good example would be any situation where you have data of a similar type, but which have distinctly different lifetimes or perhaps if you had data that was related and should be allocated from the same static memory buffer (i.e.: part of the same subsystem), but had different size properties. It should be noted that it is not mandated where the offsets of the two stacks meet, they don’t have to be exactly half the buffer. In one case the bottom stack can grow very large and the top stack smaller and vice versa. This added flexibility can sometimes be required to make the best use of your memory buffers.

I don’t think I really need insult your intelligence by including a code sample for the double stack allocator due to its inherent resemblance to the single stack allocator discussed previously.

The Pool

We’re going to shift gears a little now from the family of allocators described above that are based on linearly advancing pointers or offsets and move to something a little different. The pool allocator I’m about to describe is designed to work with data types that are of the same kind or size. It splits up the memory buffer it is managing into equally sized chunks, and then allows the client to allocate and free these chunks at will (see the diagram below). To do this, it must keep a track of which chunks are free and I’ve seen several ways of implementing this. I personally shy away from implementations such as those using a stack of indices (or pointers) to available chunks, due to the extra storage required which can often be prohibitive, but I’ve seen them around. The implementation I shall describe here uses no additional storage to manage which chunks in the pool are free. In fact the header structure for this allocator contains only two member variables, making it the smallest of all the allocators described in this post.

So how does it work internally? To manage which chunks are free we’re going to use a data structure known as a linked list. Again if you’re not acquainted with this data structure then try reading this. Coming from a PlayStation3 and Xbox360 background, where memory access is expensive I generally find node-based structures (such as the linked list) leave a rather sour taste, but I think this is perhaps one of the applications I may approve of. Essentially the header structure for the allocator will contain a pointer to a linked list. The linked list itself is spread throughout out the pool, occupying the same space as the free chunks in the memory buffer. When we initialise the allocator, we move through the memory buffer’s chunks and write a pointer in the first four (or eight) bytes of each chunk, with the address (or index) of the next free chunk in the buffer. The header then points to the first element in that linked list. A limitation of storing pointers in the pool’s free chunks in this way is that chunks must be at least the same size as a pointer on your target hardware. See the diagram below.

When allocations are made from the pool we simply need to make the linked list pointer in the header structure point at the second element in the linked list and then return the pointer we originally had to the first element. It’s very simple, we always return the first element in the linked list when allocating. Similarly, to free a chunk and return it to the pool, we simply insert it into the front of the linked list. Inserting chunks we want to free at the front of the list as opposed to the back has a couple of benefits, firstly we don’t need to a traverse the linked list (or store an extraneous tail pointer in the header structure) and secondly (and more importantly) we stand a better chance of the node we just freed up being in the cache for subsequent allocations from the pool. After a few allocations and de-allocations your pool might look a little like the diagram below.

Some C code for initialising the allocator and making allocations and de-allocations from it is provided below.


/* allocate a chunk from the pool. */
void* poolAlloc(Pool* buf) {

    if(!buf)
        return NULL;

    if(!buf->head)
        return NULL; /* out of memory */

    uint8_t* currPtr = buf->head;
    buf->head = (*((uint8_t**)(buf->head)));
    return currPtr;
}

/* return a chunk to the pool. */
void poolFree(Pool* buf, void* ptr) {

    if(!buf || !ptr)
        return;

    *((uint8_t**)ptr) = buf->head;
    buf->head = (uint8_t*)ptr;
    return;
}

/* initialise the pool header structure,
and set all chunks in the pool as empty. */
void poolInit(Pool* buf, uint8_t* mem, uint32_t size, uint32_t chunkSize) {

    if(!buf || !mem || !size || !chunkSize)
        return;

    const uint32_t chunkCount = (size / chunkSize) - 1;
    for(uint32_t chunkIndex=0; chunkIndex<chunkCount; ++chunkIndex) {

        uint8_t* currChunk = mem + (chunkIndex * chunkSize);
        *((uint8_t**)currChunk) = currChunk + chunkSize;
    }

    *((uint8_t**)&mem[chunkCount * chunkSize]) = NULL; /* terminating NULL */
    buf->mem = buf->head = mem;
    return;
}

A Note on Stack-based Allocation (alloca is your friend)

Earlier on, you may recall that I said there’d be a mention of stack based allocations in the context of the call stack. I’m sure you’ve seen code which conceptually looks something like this:

myFunction() {

    myTemporaryMemoryBuffer = malloc(myMemorySize);
    /* do processing limited to this function. */
    free(myTemporaryMemoryBuffer);
}

There is a function you can use which comes with most C compilers which should mean (depending on the size of your allocation) that you won’t have to resort to heap allocations for temporary buffers of this ilk. That function is alloca. How alloca works internally is architecture dependant, but essentially it performs allocations by adjusting the stack frame for your function to allow you to write data to an area, this can be as simple as moving the stack pointer (just like the linear allocator we mentioned right at the start). The memory returned to you by alloca is then freed up when the function returns.

There are a few caveats with using alloca that you should be aware of however. Be sure to check the size of your allocation to make sure you’re not requesting an unreasonable amount from the stack, this can cause nasty crashes later on, if your gets so large that it stack overflows. For this reason it is also best to know all the places where your function will be called in the context of the program’s overall flow if you’re contemplating allocating a large chunk with alloca. Use of alloca can affect portability in some limited circumstances (apparently), but I’ve yet to come across a compiler that doesn’t support it.

For more details you can consult your favourite search engine.

A Final Thought…

Often the memory one allocates during the course of a processing task is temporary and persists only for the lifetime of a single frame. Taking advantage of this type of knowledge and moving allocations of this type to separate memory buffers is essential to combating the adverse affects of fragmentation in a limited memory system. Any of the above allocation schemes will work, but I would probably suggest going with one of the linear ones, as they are much easier to reset than the pool implementation I’ve described here. Noel Llopis talks about this topic in more detail in this excellent blog post.

The right allocator for you depends on many factors and what the problem you’re trying to solve demands. The advice I would offer is to think carefully about the patterns of allocations and de-allocations you wish to perform with your system, think about the sizes and lifetimes of these allocations and try to manage them with allocators that make sense in that context. It can sometimes help to draw memory layouts on paper (yeah, with a pen and everything) or to graph roughly how you expect your memory usage will look over time, believe it or not, this can really help you to understand how a system produces and consumes data. Doing these things can help you make calls about how to separate your allocations to make memory management as easy, quick and fragmentation-free as possible. Remember, what I’ve talked about here is just a small selection of the some of the simplest strategies that I tend to favour when writing code for console games. It is my hope that if you’ve made it this far and you weren’t already doing this stuff, that your brain is alive with ideas about code you’ve written in the past which could have taken advantage of these techniques to mitigate the substantial drawbacks associated with dynamic memory allocations, or of other strategies you can exploit to solve your memory management problems without dynamic allocation in limited memory systems.

In closing, I believe that programmers should be mindful of the impact of making dynamic allocations (especially in console games) and think twice about using the malloc function or the new operator. It is easy to have the attitude that you’re not doing many allocations so it doesn’t really matter, but this type of thinking spread across a whole team quickly snowballs and results in a death by a thousand paper cuts. If not nipped in the bud, fragmentation and the performance penalties arising from the use of dynamic memory allocations can have catastrophic consequences later on in your development lifecycle which are not easy to solve. Projects where memory allocation and management is not thought through and managed properly often suffer from random crashes after prolonged playing session due to out of memory conditions (which by the way are near impossible to reproduce) and cost hundreds of programmer hours trying to free up memory and reorganise allocations.

Remember: You can never start thinking about memory early enough in your project and you’ll always wish you had started earlier!

More Information

Here are some links to related topics, or to topics that I didn’t have time to cover:

http://gamesfromwithin.com/start-pre-allocating-and-stop-worrying

http://en.wikipedia.org/wiki/Circular_buffer

http://en.wikipedia.org/wiki/Buddy_memory_allocation

http://www.memorymanagement.org/articles/alloc.html

Thanks to Sarah, Jaymin, Richard and Dat for proof reading this drivel.



Introduction


Welcome to this, my first technical blog post for a good while and also my debut contribution to the excellent altdevblogaday.com blogging effort!

On my travels around the games industry I have noticed that although many people know about the existence of the Radix Sort, most know that it’s typically quick (some even known it’s non-comparison based, and linear time). What a great number of individuals I’ve met don’t seem to know, however, is the nuts and bolts of how it actually works! Given this, I thought I’d have a crack at explaining Radix Sort for us mere mortals. With any luck, if you’re scratching your head at the notion of a sort that doesn’t do any comparisons or wondering how one is able to break free from the shackles of that O(n log n) thing that your Comp Sci. professor told you about, then this post will help you through it!

The high-level concept of Radix Sort can be imagined by thinking about what you could do if you had an array of, say 128, unique 16bit integers that you wanted to sort. What is perhaps the most obvious way to do it, if you didn’t care about memory utilisation or locality of your results? A simple way would be to simply use each value in the array as an index into an array. Something akin to the following:

for(uint32_t currentValue=0; currentValue<maxCount; ++currentValue) {
uint32_t value= unsortedArray[currentValue];
sortedArray[value] = value;
}

Obviously this is pretty wasteful in this form, but this extremely basic concept gives rise to the foundation of how Radix Sort works, put another way, we simply aim to just put the values into the correct places straight away without any need to compare with any other values. Obviously in the above example there are numerous points of failure and problems that we need to deal with, but we’ll get to those shortly.

Okay, so we’ve got the concept of roughly what a Radix Sort aims to do; now how do we go about solving the problems with our naive first-pass implementation? Like all good programmers, we’ll first define the problems we’re trying to solve to begin with. Perhaps the most fatal blow to the code above is the larger the values you have in the list of unsorted values, the greater amount of memory you will need to accommodate their placement directly into the correct spot in the results array. This is because of the way we’re placing the values, we just use the value itself to determine where to store it in the corresponding results array. Another problem is collisions. The word “collision” here simply means, “When two items map to the same location in memory in the results array,” for example, when we have the same value twice (or more) in the input array that we’re attempting to sort. For the acquainted, the concept is analogous to that of a collision in the context of a hash map. Another problem that we actually wouldn’t have with the above (due to it working on 16bit unsigned integers) is what to do with negative values or values of a different type, such as floating point numbers. It is not a stretch for the imagination to contrive use-cases in which our sort should be robust enough to deal with this type of data.


Analysing Data Using Elementary Statistical Tools


One of the reasons I like Radix Sort is that the solutions to these seemingly death-dealing problems come in the form of some mathematical tools you were likely taught at primary school (for those of you that only speak American, read “elementary school”). By performing some rudimentary analysis on the data, we’re able to sort it robustly and efficiently. The analysis that I speak of is something called a histogram, a simple mathematical tool used to depict the distribution of a dataset. In its traditional form a histogram is just a bar chart where the height of the column represents the number of times a given value is present in a set. Check out the example below:

To calculate a histogram for an arbitrary set of values we can simply iterate over all the values in our set. For each value in our array we maintain a running total of the number of times it has been encountered… That’s it! (My apologies if you were expecting something more complex). We have a histogram for our array, and we’re well on our way to conquering Radix Sort. Pause for a moment here, and actually take a look at the histogram for the dataset you’re sorting. It can be interesting and very illuminating to see just how your data is spread out and you may gain some interesting insight.

An astute reader may have spotted that I actually glossed over an important detail here. How exactly do you keep track of a running total for each unique value? Ideally we’d like this to be nice and quick, so we don’t want to use some nefarious mapping data structure to keep track of the histogram. Besides, I did promise that I’d keep things simple so how about a nice, simple, flat array? The problem is if we’re sorting, for example, 32bit integers and want to use the values themselves as array indices denoting the counter for a specific value, we’re left with the rather inconvenient problem of requiring 232 elements just to store the histogram! Not to fear though, as this is where the concept of the radix enters the fray.

If we have a smaller radix, of say 8 bits, and multiple histograms (one for each byte of the 32bit integers we’re sorting, a total of four), then we can use a relatively small amount of memory for the histograms which is proportional to the assumable range of the radix. Calculation of multiple histograms can be done in parallel (or in the same loop) no matter how many histograms you seek to calculate for the dataset, you just shift and mask off the number of bits you’re interested in for each histogram. Here’s a quick example of what I mean, for an 8bit radix, you’d simply do: (x & 0xff), ((x>>8)&0xff), ((x>>16)&0xff) and (x>>24) to access each byte of the 32bit value individually. This type of bit shifting should be immediately familiar to any graphics coders out there as it is often used to access individual channel in a 32bit colour value. The code to calculate four histograms (one for each byte) from our 32bit integers ends up looking a little something like this:


for(uint32_t currentItem=0; currentItem<maxCount; ++currentItem) {
const uint32_t value = unsortedArray[currentItem];
const uint32_t value0 = value & 0xff;
const uint32_t value1 = (value>>0x8) & 0xff;
const uint32_t value2 = (value>>0x10) & 0xff;
const uint32_t value3 = (value>>0x18);
histogram0[value0]++;
histogram1[value1]++;
histogram2[value2]++;
histogram3[value3]++;
}

At this point I’d like to stress that there is absolutely no reason why you must have an 8bit radix. It is common, yes, but 11bit, 16bit, or whatever you want will also work. When you’re actually implementing this algorithm you should probably try out a few different radix sizes to see which gives you the best results on your target hardware. It’s essentially a trade-off between cache performance accessing the supporting data structures and doing less passes over the input array. Increasingly from my experience the more cache optimal solution (i.e.: the smaller Radix size and hence histogram/offset table) tends to perform best as memory latency increases relative to CPU performance. It’s also worth noting that histograms for different subsets of a full dataset can be summed in order to produce the histogram for the entire set, this trick can be useful when doing this type of thing in parallel, but we’ll get to that in due course.


Offset Calculation


If you cast your mind back to the beginning of this post, I stated that the guiding principle of Radix Sort was to place values at the correct places in the output array immediately without performing any comparisons. To do this we need to know at what offset into our output array we should use to start placing writing to for each unique value in our input array. The histogram was actually an intermediate step in calculating this list of offsets. To illustrate this I will use a worked example, consider the following list of unsorted numbers:

1, 2, 4, 3, 1, 1, 3, 1, 7, 6, 5

If we wanted to place the value ‘2’ directly into the output array at its correct place we would actually need to place it at index 4 (i.e.: the 5th slot). So the table we’re computing will simply tell us that for any values of ‘2’ begin placing them at index 4. We then increment the offset for the value we just placed as we go, so that any subsequently occurring values which are the same (assuming there are any) go just after the one we’ve placed. The offset table we want to calculate for the above example would look something like this:

0 – N/A (There are no 0′s in the set, so it doesn’t matter!)
1 – 0
2 – 4
3 – 5
4 – 7
5 – 8
6 – 9
7 – 10

So how do we calculate this offset table for a histogram? That’s easy; each location in the table is just the running total of each value in the histogram at that point. So in this example, the offset for ‘2’ would be 4, because we have no ‘0’s, but then four ‘1’′s. This unsurprisingly is a total of 4! The next offset, for ‘3’, is 5 because in our data set we only have one ‘2’, and 0+4+1 is 5. The technique of increasing the offset for a given value as you place it in the output array gives rise to a very subtle, but important property of Radix Sort that is vital for implementations which begin at the least significant byte (LSB Radix Sort) —the sort is stable. In the context of sorting algorithms, a sort is said to be stable if the ordering of like values in the unsorted list is preserved in the sorted one, we shall see why this is so important for LSB Radix Sort later on. Incidentally don’t worry about what LSB means just yet, we’ll get to that!

A quick note at this point, the application of these types of analysis tricks can also be done offline to help you better design a good solution around the data you’re trying to process. I’m hoping that a certain Mr. Acton will be kind enough to share some the increasingly important statistical tools that have made their way into his bag of tricks, at some point on this blog in the future. How about it Mike? :)


What Have We Got So Far?

The data flow chart below shows what we’ve got so far. We’ve successfully applied some elementary data analysis to our data, and in the process computed the only supporting data structure we need: a table of offsets for each radix which tells us where to begin inserting each datum in our output array. I find a fun and useful way to visualise what we’ve just done is to imagine taking the bars of the histogram for each number in our data set, rotate them 90 degrees, and then lay them end to end. You can imagine these bars come together to form a contiguous block of memory, with each bar being big enough to hold all the occurrences of the particular value it represents. The offset table is just the indices along this array where each bucket begins. Now comes the fun part, actually moving the data.

Things are going to get a little more complicated here, but only very, very slightly I promise. There are actually two ways to approach data movement in Radix Sort. We can either start with the least significant bits or with the most significant bits. Most people with a serial programming mindset tend to go with LSB, so we’ll start there and then cover MSB later on, as that has some advantages predominately for parallel programming.


Serial Offender!

Radix Sort is typically not an in-place sort, that is to say implementations typically requires some auxiliary storage which is proportional to the number of elements in the original, unsorted list in order to operate. To move the data around we need to do one pass through our unsorted array for each radix (the number of passes will change depending on the size of each datum being sorted and the size of your radix, of course). Each time we perform a pass, we are actually sorting the data with respect to the particular radix we are considering, and I’ll begin by discussing this for LSB and then discuss MSB.

The first pass will typically sort by the first byte, the second by the second (but respecting the positioning of those elements with respect to the first) and so on. Hopefully at this point you are able to see both why the stable property of LSB Radix Sort is so important and where it comes from. The data movement pass is just a single pass over the input array for each radix as mentioned. For each pass, you just read the value in the input array, mask off the bits that are relevant to the particular radix you are considering and then look-up the offset from the table of indices we computed earlier. This offset tells you where in the corresponding output array we should move the value to. When you do this, you need to be sure to increment the offset table so that the next value from the input array with the same binary signature will be written to the next element in the array (not on top of the one you just wrote!). That’s all there is to it really! You just do this once for each radix, something like this:


for(uint32_t currentValue=0; currentValue<maxCount; ++currentValue) {
const uint32_t value = unsortedArray[currentValue];
const uint32_t offsetTableIndex = value & 0xff;
const uint32_t writePosition = offsetTable0[offsetTableIndex];
offsetTable0[offsetTableIndex]++;
auxiliaryArray[writePosition] = value;
}

So that’s our LSB Radix Sort for integers, but what if you want to sort floating point numbers, or indeed negative numbers? I’m sure if you’re reading this you’re probably aware of the common storage formats of floating point values (the most common of which being IEEE 754) or two’s compliment for signed integers. How does Radix Sort deal with these things? The answer is perfectly well if you apply a simple transformation to the input data, in time-honoured fashion I’ll leave this as an exercise for the reader as I don’t have time to delve into this now.


Making the Best of a Parallel Situation

So we’ve covered how to Radix Sort works when starting from the LSB, seems to work nicely, why would we want to complicate things any further? This part will talk about how sorting data beginning with the most significant digits changes the nature of the problem and how we can use this to our advantage in this increasingly parallel world.

If you think about what you actually have at each stage of the data movement passes, you will see that using MSB Radix gives rise to a particularly interesting property, the first pass over the data actually produces n independent sorting problems, where n is the assumable range of the radix. If you can’t see why the parallel programmer in you is now rubbing his/her hands together, don’t worry, here’s why!

The first pass over the data in the context of an MSB Radix Sort actually has the effect of roughly bucketing the data according it’s most significant byte (or whatever size radix you’re working with), to put it another way, you know that after each pass of MSB Radix Sort, none of the values in the buckets will ever need to be moved into other buckets (which is of course not the case with LSB Radix Sort). This important property actually opens up the potential for us to distribute the sorting of the different buckets over multiple processing elements, crucially with each element working on independent data, making synchronisation much simpler. To perform MSB in serial you just re-apply Radix sort to each bucket independently. Another interesting side effect of doing MSB Radix Sort is that each application of Radix Sort makes the data more and more cache coherent for subsequent passes.

So, there you have it. Radix Sort explained (hopefully well enough) for humans. I hope after reading this you are able to have a strong mental model of how Radix Sort works in practise and moves data around in memory, and importantly how it is able to break free of the O(n log n) lower limit on comparison-based sorting algorithms. Just a quick disclaimer, for some of my more performance minded readers: I make no guarantees as to the optimality of my implementation; the purpose here was to explain the concepts behind the algorithm, not to write an optimal implementation.



Hello!

A few days back I did an interview with fellow Bizarreo, Charlie Birtwistle, about the technology behind Blur. I’m pleased to announe that it is now up for your reading pleasure over at Eurogamer, see the link below. Any comments or questions are most welcome.

Eurogamer Technology Interview

Enjoy!
Ste


Just a quick note to say I will be speaking at Develop Conference 2010. The talk is called “SPU-Assisted Rendering”, and will cover some of the SPU enhanced rendering we do in “Horizon,” the core technology at Bizarre Creations that powers Blur. Hope to see any of you there!

Cheers,
Ste


This post switches gears a little from my usual stuff, but it’s something I feel quite strongly about so I thought I’d post it. A while back I spent some time at a previous employer, playing poker, sprinting, spending a few minutes each morning in “the War Room” and generally dicking about with Post-It notes. In this post I aim to discuss a little about the scruming process and what I believe are its flaws. This post is a jab at scrum and a swipe at the way it’s haphazardly forced upon many studios by corporates after sending someone on a 3 day course and giving them a couple of books.

1. Something has to give!
It’s true enough… Something always has to give. When we consider programming and scrum that “thing” invariably ends up being testing as programmers try to cram as much code in as possible to sign off the user stories by the end of the sprint (why they do this is beyond me, and as you become more versed in the bullshit that is scrum you become acutely aware of one truth, there are no consequences for failure!). When you work in games with people who’ve been around the block (apart from all the cool hacks and craziness) you’ll occasionally get to hear these horror stories about people hacking “features” together to a standard that is just enough to pass a publisher milestone and then re-doing them completely later on in the schedule, sprinting makes me feel like I’m working at an indie developer in the early 90′s.

2. Programmers suck at planning.
Okay, so perhaps this is a little unfair, but planning and estimating things is hard, no matter how hard you try you’ll probably always end up with massively underestimating or overestimating the amount of work for a given task. When it’s the latter, user stories inevitably slip back to the backlog… This doesn’t really matter, as you can just pick them up again on the next sprint, no harm done, right?

3. Tech is implicitly unpredictable.
When you’re working on things that have never been done before, as is often the case at leading edge development studios your job is intrinsically unpredictable. Planning things that have never been done before is pretty tricky. Scrum advocates will claim that scrum forces you to break things down and yeah breaking them down is helpful, but this isn’t really an attribute of Scrum. Any sensible coder will just do this anyway.

4. Iteration is unsupported.
Look at any top game out there, what makes it have that 90+ metacritic? Okay, strong gameplay, style, originality, shit hot graphics and whatever else, but a big part of it is polish. Polish isn’t really quantifiable, it just takes time and effort and a love for the work. Look at your Drake’s Fortunes, your GTAs or your Geometry Wars. Polish is what makes these games great and guess what? It is entirely unsupported in scrum. I would argue a user story early in the production cycle simply can’t be delivered if it states “final quality”.

5. Why do I get a say?
Why do I have a say on art tasks? From a technical standpoint, I can appreciate that some degree of core technology opinion and influence is probably healthy, but in the main I have as little business telling an artist or designer how to model a character or design the flow of a level as they do telling me the best way to optimise pre-pass lighting on the SPUs. Lets face it, artists (in the main) know nothing about code and vice versa. Scrum teams mixing disciplines give everyone a say on everything… See point 8.

6. Scrum != Agile.
Is it just me, or is taking an entire sprint to react to changes not very agile?

7. Less done.
A simple equation…
More time fucking about + more time planning to work = less actual work done. But hey, we get to look at those pretty little, (hopefully, but mostly not) monotonic burn down charts in Hansoft right? Awesome!

8. Too Many Coders Spoil the Broth
Having four people actively developing, refactoring and writing the same systems together is usually stupid. It might increase knowledge, but it reduces code quality, increases hassle and creates bottlenecks because scrum rigidly insists that problems have to be tackled in priority order. Pair programming I don’t have a problem with, in fact I think it’s great, but again this isn’t a scrum-only thing, coders have been doing this for years.
Simple Tip: If you need to share knowledge (and simply talking to your co-worker who wrote the software in the first place won’t suffice), then just have your producer build in time for documentation…

9. Forced Demonstration
Making teams demonstrate user stories are complete in really contrived fashions is nothing short of retarded It wastes my time and the time of the poor, unfortunate senior manager paraded out every month to review your changes. If I have a user story to optimise a function what benefit is it to anyone if I produce a document demonstrating this to be the case? I’ll give you a hint: Fuck all. So what detrement is it? It wastes time, which wastes money, it makes other user stories slip, it invokes a mini-crunch to get features demonstrable (which is bad for morale) and if you fail the user story because the document or demonstration scenario wasn’t written properly, it means more farting about planning to repeat the entire exercise next sprint!

10. Maximum Churn!
Having a bunch of coders commiting code into a branch when you’re trying to get a build together is bad enough. The amount of code churn that results when a scrum team collapses an entire sprint’s work into a development line at the same time that several other teams do the same? That just completely sucks. This may sound a little anti-branchy but it’s really not. If there features were continuously integrated from a team’s branch there would be no problem. Of course, nothing in Scrum prevents this from happening, except of course in reality that it’s more time that you need to account for in your sprint plans up front, who’s to say that another team haven’t modified a load of stuff you’ve also changed and caused you untold merge pains?… It’s also something that will slip at the first sign that you’re going to miss the deadline for your user stories.

Are there any positives? In short nothing that one can really attribute to the scrum process. Team communication, pair programming and breaking tasks down are things that good programmers and teams will do anyway. I must stress before signing this rant off that these are just my opinions of scrum that I’ve experienced first hand, it may not be the same for everyone else, feel free to post any comments you have on the matter, it’d be good to hear from anyone who’s had a positive experience of the scrumming process. :) If you’ve made it this far, thanks for reading and see you soon!

Ste


Many programmers might not be aware of how Dma’s actually make their way into the DMAQ on the SPUs and instead rely on high-level wrappers, such as IBM’s mfc_put or mfc_get functions. An unfortunate side-effect of this hole in knowledge and also to some extent the wrapping of anything into blackboxes like this, is that it tends to encourage branchy code such as:


if(si_to_uint(should_do_dma))
mfc_put(/*arguments to mfc_put*/);

First off, depending on the point that should_do_dma becomes known, this could be potentially awful for SPU program performance. If the value is immediately determined before the Dma is issued (by immediate, I actually mean less than 11 cycles), then any hbr instructions won’t have enough time to initiate a prefetch into the instruction buffer and the SXU is starved of instructions. If it is known you get the branch for free and all is well… But why take the risk? ;-)

The first thing to do is to understand how Dmas can be issued from the SPU without the use of these high-level wrappers. Something like this:


wrch $MFC_LSA, $3
wrch $MFC_EAH, $4
wrch $MFC_EAL, $5
wrch $MFC_Size, $6
wrch $MFC_TagID, $7
wrch $MFC_Cmd, $8

The wrch instruction writes values to special purpose registers, in this case, registers that reside in the MFC. That’s it! The functions IBM provide don’t seem to do anything more than this, this is all well and good, but that doesn’t mean we can avoid the branch around all these wrch calls? They key comes in understanding the underlying the MFC hardware a little. According to IBM’s presentation on Dma on their own website,

transfer sizes can be 1, 2, 4, 8, and n*16 bytes (n integer)

So what if n were 0? Can we issue 0 byte Dmas? The answer is yes, and herein lies the useful trick. 0 byte Dmas are basically equivalent to NOPs for the MFC. They do nothing. Well, not quite nothing, they still make an entry into the DMAQ, but it gets discarded immediately when the Dma engine tries to process it. Simply replacing the size with 0 for those cases where Dma should not be issued will eliminate this branch around the Dma. Hint: For selection masks generated by the SPU’s comparison instructions you can do this with a single 2 cycle selb instruction.


As a side-project I’m writing a z80 emulator in C. It’s almost completely branch-free apart from a jump table in the op-code dispatcher and the equivalent return, which I’m currently investigating ways around (kinda hard to do and still maintain portable C — anyway). I was re-implementing the add operation for the z80 this evening since my last attempt got wiped out when my hard disk at work crashed (always back your work up kiddies!). To cut a long story short, the z80 had a few flags that would get set on various arithmetic and logical operations, one such flag was the half-carry flag, which indicated a carry from bit 3 to bit 4.

After a little bit of research and a few goes at it, I finally relented and asked a particularly experienced co-worker of mine for his take on the situation. Here is the little trick we derived as a result, I thought I’d share my code for it here:

uint32_t new_flgs_h_off = _uint32_and(new_flags_z_n_c, FLAG_H_MASK);
uint32_t new_flgs_h_on = _uint32_or(new_flags_z_n_c, FLAG_H);
uint32_t half_carry_a = _uint32_xor(result, cpu->gprs[REG_A]);
uint32_t half_carry_b = _uint32_xor(half_carry_a, op->raw_data);
uint32_t half_carry = _uint32_and(half_carry_b, 0x10);
uint32_t half_cry_mask = _uint32_neg(half_carry);
uint32_t new_flags_h = _uint32_selb(new_flgs_off, new_flgs_on, half_cry_mask);

The code is of course branch-free, (the selb function used in the terminal line is a trick I explained a while ago, here). Basically if you xor the result of your addition, the value you’re adding to the accumulator and the value in the accumulator itself together, and then and the result with 0x10 you can find out easily if a half-carry occured during this operation.

Ste



Theme © 2005 - 2009 FrederikM.de
BlueMod is a modification of the blueblog_DE Theme by Oliver Wunder