First of all Happy New Year to everyone! Hope 2010 is going to be a fun year. I just finished migrating this blog to a new server, all tied nicely together with the domain name. So some of your old bookmarks to articles (if you had any) might be invalid, and you’ll need to dig through and find them again :( Sorry!

Watch this space, as we will return to regular programming very soon :)


Check it out below :)


This post switches gears a little from my usual stuff, but it’s something I feel quite strongly about so I thought I’d post it. A while back I spent some time at a previous employer, playing poker, sprinting, spending a few minutes each morning in “the War Room” and generally dicking about with Post-It notes. In this post I aim to discuss a little about the scruming process and what I believe are its flaws. This post is a jab at scrum and a swipe at the way it’s haphazardly forced upon many studios by corporates after sending someone on a 3 day course and giving them a couple of books.

1. Something has to give!
It’s true enough… Something always has to give. When we consider programming and scrum that “thing” invariably ends up being testing as programmers try to cram as much code in as possible to sign off the user stories by the end of the sprint (why they do this is beyond me, and as you become more versed in the bullshit that is scrum you become acutely aware of one truth, there are no consequences for failure!). When you work in games with people who’ve been around the block (apart from all the cool hacks and craziness) you’ll occasionally get to hear these horror stories about people hacking “features” together to a standard that is just enough to pass a publisher milestone and then re-doing them completely later on in the schedule, sprinting makes me feel like I’m working at an indie developer in the early 90’s.

2. Programmers suck at planning.
Okay, so perhaps this is a little unfair, but planning and estimating things is hard, no matter how hard you try you’ll probably always end up with massively underestimating or overestimating the amount of work for a given task. When it’s the latter, user stories inevitably slip back to the backlog… This doesn’t really matter, as you can just pick them up again on the next sprint, no harm done, right?

3. Tech is implicitly unpredictable.
When you’re working on things that have never been done before, as is often the case at leading edge development studios your job is intrinsically unpredictable. Planning things that have never been done before is pretty tricky. Scrum advocates will claim that scrum forces you to break things down and yeah breaking them down is helpful, but this isn’t really an attribute of Scrum. Any sensible coder will just do this anyway.

4. Iteration is unsupported.
Look at any top game out there, what makes it have that 90+ metacritic? Okay, strong gameplay, style, originality, shit hot graphics and whatever else, but a big part of it is polish. Polish isn’t really quantifiable, it just takes time and effort and a love for the work. Look at your Drake’s Fortunes, your GTAs or your Geometry Wars. Polish is what makes these games great and guess what? It is entirely unsupported in scrum. I would argue a user story early in the production cycle simply can’t be delivered if it states “final quality”.

5. Why do I get a say?
Why do I have a say on art tasks? From a technical standpoint, I can appreciate that some degree of core technology opinion and influence is probably healthy, but in the main I have as little business telling an artist or designer how to model a character or design the flow of a level as they do telling me the best way to optimise pre-pass lighting on the SPUs. Lets face it, artists (in the main) know nothing about code and vice versa. Scrum teams mixing disciplines give everyone a say on everything… See point 8.

6. Scrum != Agile.
Is it just me, or is taking an entire sprint to react to changes not very agile?

7. Less done.
A simple equation…
More time fucking about + more time planning to work = less actual work done. But hey, we get to look at those pretty little, (hopefully, but mostly not) monotonic burn down charts in Hansoft right? Awesome!

8. Too Many Coders Spoil the Broth
Having four people actively developing, refactoring and writing the same systems together is usually stupid. It might increase knowledge, but it reduces code quality, increases hassle and creates bottlenecks because scrum rigidly insists that problems have to be tackled in priority order. Pair programming I don’t have a problem with, in fact I think it’s great, but again this isn’t a scrum-only thing, coders have been doing this for years.
Simple Tip: If you need to share knowledge (and simply talking to your co-worker who wrote the software in the first place won’t suffice), then just have your producer build in time for documentation…

9. Forced Demonstration
Making teams demonstrate user stories are complete in really contrived fashions is nothing short of retarded It wastes my time and the time of the poor, unfortunate senior manager paraded out every month to review your changes. If I have a user story to optimise a function what benefit is it to anyone if I produce a document demonstrating this to be the case? I’ll give you a hint: Fuck all. So what detrement is it? It wastes time, which wastes money, it makes other user stories slip, it invokes a mini-crunch to get features demonstrable (which is bad for morale) and if you fail the user story because the document or demonstration scenario wasn’t written properly, it means more farting about planning to repeat the entire exercise next sprint!

10. Maximum Churn!
Having a bunch of coders commiting code into a branch when you’re trying to get a build together is bad enough. The amount of code churn that results when a scrum team collapses an entire sprint’s work into a development line at the same time that several other teams do the same? That just completely sucks. This may sound a little anti-branchy but it’s really not. If there features were continuously integrated from a team’s branch there would be no problem. Of course, nothing in Scrum prevents this from happening, except of course in reality that it’s more time that you need to account for in your sprint plans up front, who’s to say that another team haven’t modified a load of stuff you’ve also changed and caused you untold merge pains?… It’s also something that will slip at the first sign that you’re going to miss the deadline for your user stories.

Are there any positives? In short nothing that one can really attribute to the scrum process. Team communication, pair programming and breaking tasks down are things that good programmers and teams will do anyway. I must stress before signing this rant off that these are just my opinions of scrum that I’ve experienced first hand, it may not be the same for everyone else, feel free to post any comments you have on the matter, it’d be good to hear from anyone who’s had a positive experience of the scrumming process. :) If you’ve made it this far, thanks for reading and see you soon!

Ste


Just a quick note for those of you asking me for the Develop North slides. It’s going to be 2010 now until I might be able to release them. They have to go through some pre-approval process at work before they can be release onto the public at large. Unfortunately for the slides, but fortunately for everyone else, everyone is kinda busy right now making Blur even more stunningly beautiful and fun than it already is, :) . So it’s not so high on our priority list right now.


Many programmers might not be aware of how Dma’s actually make their way into the DMAQ on the SPUs and instead rely on high-level wrappers, such as IBM’s mfc_put or mfc_get functions. An unfortunate side-effect of this hole in knowledge and also to some extent the wrapping of anything into blackboxes like this, is that it tends to encourage branchy code such as:


if(si_to_uint(should_do_dma))
mfc_put(/*arguments to mfc_put*/);

First off, depending on the point that should_do_dma becomes known, this could be potentially awful for SPU program performance. If the value is immediately determined before the Dma is issued (by immediate, I actually mean less than 11 cycles), then any hbr instructions won’t have enough time to initiate a prefetch into the instruction buffer and the SXU is starved of instructions. If it is known you get the branch for free and all is well… But why take the risk? ;-)

The first thing to do is to understand how Dmas can be issued from the SPU without the use of these high-level wrappers. Something like this:


wrch $MFC_LSA, $3
wrch $MFC_EAH, $4
wrch $MFC_EAL, $5
wrch $MFC_Size, $6
wrch $MFC_TagID, $7
wrch $MFC_Cmd, $8

The wrch instruction writes values to special purpose registers, in this case, registers that reside in the MFC. That’s it! The functions IBM provide don’t seem to do anything more than this, this is all well and good, but that doesn’t mean we can avoid the branch around all these wrch calls? They key comes in understanding the underlying the MFC hardware a little. According to IBM’s presentation on Dma on their own website,

transfer sizes can be 1, 2, 4, 8, and n*16 bytes (n integer)

So what if n were 0? Can we issue 0 byte Dmas? The answer is yes, and herein lies the useful trick. 0 byte Dmas are basically equivalent to NOPs for the MFC. They do nothing. Well, not quite nothing, they still make an entry into the DMAQ, but it gets discarded immediately when the Dma engine tries to process it. Simply replacing the size with 0 for those cases where Dma should not be issued will eliminate this branch around the Dma. Hint: For selection masks generated by the SPU’s comparison instructions you can do this with a single 2 cycle selb instruction.


Just a quick note to say that I will be speaking at Develop North on 5th November 2009. The title of the talk is “A Bizarre Way to do Real-Time Lighting“, and I will be delivering the 2nd half of the session which talks mostly about our PS3 implementation of the lighting system for Blur.

There looks like there’s going to be some interesting sessions there, even though the schedule does look a little production heavy :) If anyone else is going please feel free to say hello! :)

See you there!
Ste


A common thing I do in the debugger is to take a raw value of a memory address that I know points to a vector of some particular type, cast it to a pointer to that type, and then post fix it with “, n”, where n is the number of elements I want the debugger to show me in the watch window. I guess I must have had it too good with ProDG for too long as doing this just works. I tried the same in Visual Studio and was getting some very odd behavior, namely it was showing me the array, but I was unable to expand the elements of that array to view the contents of the structure encapsulated in the vector. Incidentally, Visual Studio 2008’s watch window doesn’t even come equipped with a horizontal scroll bar! Making it impossible to simply drag the width of the value column to such an extent that the value could be viewed.

After scratching my head a little I called over one of my colleagues who has a particular penchant for Visual Studio, he was equally puzzled for a minute or so, before spotting the problem… The space between the , and the value of n! (Foo*)0xbaadf00d, n doesn’t work as expected, but (Foo*)0xbaadf00d,n does. Given that sensible coders put a space between function arguments I think Visual Studio have a bug here they need to sort out. :)

Hopefully I can get back to ProDG land soon and make the pain go away :-)

Ste


As a side-project I’m writing a z80 emulator in C. It’s almost completely branch-free apart from a jump table in the op-code dispatcher and the equivalent return, which I’m currently investigating ways around (kinda hard to do and still maintain portable C — anyway). I was re-implementing the add operation for the z80 this evening since my last attempt got wiped out when my hard disk at work crashed (always back your work up kiddies!). To cut a long story short, the z80 had a few flags that would get set on various arithmetic and logical operations, one such flag was the half-carry flag, which indicated a carry from bit 3 to bit 4.

After a little bit of research and a few goes at it, I finally relented and asked a particularly experienced co-worker of mine for his take on the situation. Here is the little trick we derived as a result, I thought I’d share my code for it here:

uint32_t new_flgs_h_off = _uint32_and(new_flags_z_n_c, FLAG_H_MASK);
uint32_t new_flgs_h_on = _uint32_or(new_flags_z_n_c, FLAG_H);
uint32_t half_carry_a = _uint32_xor(result, cpu->gprs[REG_A]);
uint32_t half_carry_b = _uint32_xor(half_carry_a, op->raw_data);
uint32_t half_carry = _uint32_and(half_carry_b, 0x10);
uint32_t half_cry_mask = _uint32_neg(half_carry);
uint32_t new_flags_h = _uint32_selb(new_flgs_off, new_flgs_on, half_cry_mask);

The code is of course branch-free, (the selb function used in the terminal line is a trick I explained a while ago, here). Basically if you xor the result of your addition, the value you’re adding to the accumulator and the value in the accumulator itself together, and then and the result with 0x10 you can find out easily if a half-carry occured during this operation.

Ste


This post is concered with bit-wise trickery. Most games studios have a great mix of experience and youth when it comes to the coding team. Sometimes I’d be in a position to ask one of my more “ol’ skool” work colleagues for his take on a particular situation that I initially thought could be solved through some bitwise manipulation. Not to take anything away from the guys I’ve asked over the months and years, but their responses are almost always based on an encyclopedic knowledge of bitwise trickery and intuition that they’ve developed over the years of thrashing the shit out of PIC chips, BBC Micros, Amigas, Ataris, Nintendo Entertainment Systems and PSX-family hardware and not from any discernable formula or theorem. In the unlikely event that the problem is something they have never seen before, while usually being helpful in arriving at the solution, they will often have no way of immediately confirming or denying my suspicions. Enter Warren’s Word-Parrallel Transformation Theorem.

While making my way through a book I’ve recently been fortunate enough to acquire, I found a reference to a somewhat obscure paper entitled, “Functions Relizable with Word-Parallel Logical and Two’s Complement Addition Instructions“. The paper demonstrates that the composition of functions which intend to map between two distinct words of bit patterns (it is worth noting at this point that the use of word here is not meant to convey some finite length, such as 16bits in x86 processors, 32bits in STI’s Cell chip or 64bits in Larabee), rather, any amount of contiguous binary that your imagination cares to conjure.

THEOREM. A function mapping words to words can be implemented with word-parallel add, subtract, and, or, and not instructions if and only if each bit of the result depends only on bits at and to the right of each input operand.
- Henry S. Warren, Jr,
“Functions Relizable with Word-Parallel Logical and Two’s Complement Addition Instructions”,
Communications of the ACM 20, 1977.

Warren explains the concept much more fully (for those not following the argument) in “Hacker’s Delight”, but basically if each bit of the word in question can be computed by some operation at and to the right of the current bit, then the transformation function can be expressed by a series of logical operations and additions. The paper talks only about logical operations (and, or, xor, nand, etc.) and two’s compilment addition, but Warren offers that the technique can be extended to incorporate other operations. Remember however, there are nearly always optimisations that one can make to the transformation function yielded by this process (they are by no means optimal solutions), but it serves as a good indicator that you can perform such transformations in this manner to begin with. :)

The book “Hacker’s Delight” is a fantastic read and without doubt one of the best references dedicated to the science (and art) of bit twiddling.

See ya!


It’s been a little while since I posted something in my mini-series of SPU tricks entries. A couple of days ago I was taking a casual look at my web traffic stats and noticed quite a few people were finding their way here through googling for examples of how the shufb instruction works. So I thought I’d provide an explaination and some examples in the hope of helping those people out who are looking for this information, as it can be a little cryptic to deduce this from the SPU ISA docs during a first read. The shufb instruction takes the form:

shufb rt, ra, rb, rm

And has a latency of 4 cycles and executes on the even pipeline. As you can see, it has four operands, all of which are registers. The first operand (as with all instructions in the SPU ISA) is the target register, i.e.: the register in which the result of the operation will be placed into. The next two operands (ra and rb) are the two quadwords that will be manipulated by the qword length pattern that resides in the register that is the indicated by the 4th and final operand. It is how this manipulation is controlled by the 4th operand that is interesting and the topic of this post.

A shuffle pattern is a qword length value that works on a byte level. Each of the 16 bytes in the quadword controls that contents of the corresponding byte in the target register. I.e.: The 0th byte of the pattern qword controls the value that will ultimately be placed into the 0th byte of the target register, the 1th byte controls the value of the 1th byte placed into the target register, and so on, for all 16 bytes of the qword. Here is an example shuffle pattern:

const vector unsigned char _example1 =
{ 0x00, 0x11, 0x02, 0x13,
0x04, 0x15, 0x06, 0x17,
0x08, 0x19, 0x0a, 0x1b,
0x0c, 0x1d, 0x0e, 0x1f };

The above pattern performs a “perfect shuffle”, but on a byte level (the term “perfect shuffle” typically refers to the interleaving of bits from two words). The lower 4 bits of each byte can essentially be thought of as an index into the 1st or 2nd operand qword. Similarly the upper 4 bits can be thought of as an index into the registers alluded to in the instruction’s operands, since there are only two, we need only concern ourselves with the LSB of this 4 bit group, i.e.: 0x0x (where the 2nd x denotes some other value of the lower 4 bits of the byte), would index into the contents of the ra register, and 0x1x would access the 2nd. There are other special cases which load a corresponding byte with useful values which I will discuss later. An example will hopefully aid us in our understanding:

const vector unsigned char _example2 =
{ 0x00, 0x01, 0x02, 0x03,
0x14, 0x15, 0x16, 0x17,
0x08, 0x09, 0x0a, 0x0b,
0x1c, 0x1d, 0x1e, 0x1f };
qword pattern = (const qword)_example2;
qword ra = si_ilhu(0x3f80); // ra contains: 1.0f, 1.0f, 1.0f, 1.0f
qword rb = si_ilhu(0x4000); // rb contains: 2.0f, 2.0f, 2.0f, 2.0f
qword result = si_shufb(r1, r2, pattern); // result contains: 1.0f, 2.0f, 1.0f, 2.0f

In many programs the simple inlining of shuffle patterns for your assorted data manipulation requirements will suffice, but since the terminal operand to shufb is simply a register there is absolutely nothing to stop you computing the patterns dynamically in your program or from forming them with the constant formation instructions (as should be prefered when lower latency can be achieved than the 6 cycle load from the local store — see this previous post for more details about this). Dynamic shuffle pattern computation is actually critical to performing unaligned loads from the local store in a vaguely efficient manner.

To wrap up I will quickly cover the special cases I mentioned earlier. There are three such “special values” (at least that I am aware of) defined by the current version of the SPU ISA, these are 0x80, 0xc0 and 0xe0. Each simply places a specific value into the corresponding byte of the target register. 0x80 loads 0x00 into the target register, where as 0xc0 loads 0xff and 0xe0 loads 0x80. This is not strictly true, the hardware only considers the top 2 or 3 bits respectively for these special case patterns, if the pattern 0×88 were to be present in one of the bytes of a shuffle pattern it would still clear the bits in that designated byte in the target register. For simplicity and readability I find it advisable to stick to the above immediately recognisable constants when hard-coding shuffle patterns.

As an aside, I have always thought it a shame that there was not a special case pattern defined for the value 0×3f, which would of course allow one to construct various IEEE 754 floating point values (notably 1.0f whose value is 0x3f80) trivially during shuffles, perhaps useful for loading the w component of a homogenous vector with 1. But a SPU wish-list is another post altogether, ;) .

Thanks for reading,
Ste



Get Adobe Flash playerPlugin by wpburn.com wordpress themes
Powered by Wordpress
Theme © 2005 - 2009 FrederikM.de
BlueMod is a modification of the blueblog_DE Theme by Oliver Wunder