Sorting float array in fragment shader without causing dynamic branches in OpenGL ES 2.0 (iPad 3)

Hi guys



We have a somewhat complicated fragment shader, which we cannot split into smaller shaders.



One of the thing the shader does is calculate the distance of the fragment/pixel to 7 other points, and find the two nearest ones.



On the iPad 3 (PowerVR SGX 543), we are having huge performance problems with this part of the code, and Apple Instruments tells us the bottleneck is the shader code which leads to dynamic branches in the GPU.



We sacrificed readability and tried several code changes to avoid these dynamic branches, but up to now we haven’t found a solution:


  • We changed the loop so that it’s not really a loop, but a repeating series of instructions.
  • We simplified the code so that it contains only very simple if-like instructions.



    This is the ‘simplest’ version of the code we came up with:




bool isLowest;
float currentDistance, lowestDistance, newLowest;
lowestDistance = distances[1];
int indexOfLowest = 1, newIndexOfLowest;

// 'loop' for index 2
currentDistance = distances[2];
isLowest = currentDistance < lowestDistance;
newLowest = isLowest ? currentDistance : lowestDistance;
lowestDistance = newLowest;
newIndexOfLowest = isLowest ? 2 : indexOfLowest; // BAD LINE
indexOfLowest = newIndexOfLowest;

// 'loop' for index 3
currentDistance = distances[3];
isLowest = currentDistance < lowestDistance;
newLowest = isLowest ? currentDistance : lowestDistance;
lowestDistance = newLowest;
newIndexOfLowest = isLowest ? 3 : indexOfLowest; // BAD LINE
indexOfLowest = newIndexOfLowest;

// etc. until index 8



As you can see, the code finds the index of the lowest distance. In our opinion, this code can be executed without dynamic branches. The last four lines of one 'loop' are just simple calculations, they are not real branches.

The problem is the second-last row in each 'loop', where indexOfLowest gets a new value (marked as BAD LINE).

If we comment out one of these lines, everything works fine on 60 FPS and Instruments doesn't report dynamic branches. But with this line, it's not more than 8 FPS.

How can we rewrite this code such that it does not cause dynamic branches in the GPU?

We just analyzed the code using PowerVR's PVRShaderEditor (previously PVRUniSCoEditor), where it shows the expected GPU cycles per shader source code line. The line in question (BAD LINE in the code above) is shown with only 1-2 GPU cycles. The fact that this line causes dynamic branches which lead to huge performance problems is not mentioned. Any other ideas how we could debug this code, so we can understand why it's causing these branches?

Hi Cheesus,



The ? operator is entirely equivalent to an if/else statement and counts as a branch, there’s really no difference or saving to be had by switching between them - it’s purely a style choice. There’s not really any good way to avoid this happening, you really just need to be smart about them and reduce them to a minimal state. In point of fact there are actually 4 dynamic branches in each part of your loop, but 2 are optimised out.



So first I’ve mentioned that

a = b ? c : d

is equivalent to:

if (b)<br />
{<br />
a = c;<br />
}<br />
else<br />
{<br />
c = d;<br />
}<br />

```<br />
Each if statement counts as a single branch, as does each else statement. So an if/else block like that has two branches. You may have noticed at this point where your 4 branches are, if not, here they are:<br />
<br />

currentDistance = distances[3];

isLowest = currentDistance < lowestDistance;

/*Here’s 2 */newLowest = isLowest ? currentDistance : lowestDistance;

lowestDistance = newLowest;

/*Here’s 2 */newIndexOfLowest = isLowest ? 3 : indexOfLowest; // BAD LINE

indexOfLowest = newIndexOfLowest;

<br />
The first two are being optimised out, as the determining statement is equivalent to a min() operation, which is always doable without a branch. The second two are not however, because of how you've written them.<br />
<br />
Depending on the iOS compiler (of which I'm afraid we have no specific knowledge), it may be optimising your second branch to a single if statement, as the else statement essentially  boils down to <code>indexOfLowest=indexOfLowest</code><br />
The other dynamic branch is, however, largely unavoidable. There might be a way around it, but if you manually optimise your shader to reflect the single branch it should help. I've already done it here for your convenience:<br />
<br />
<code>	bool isLowest;<br />
float currentDistance, lowestDistance, newLowest;<br />
lowestDistance = distances[1];<br />
int indexOfLowest = 1, newIndexOfLowest;<br />
<br />
// 'loop' for index 2<br />
currentDistance = distances[2];<br />
lowestDistance = min(currentDistance,lowestDistance);<br />
if (currentDistance &lt; lowestDistance)<br />
{<br />
indexOfLowest = 2;<br />
}<br />
<br />
// 'loop' for index 3<br />
currentDistance = distances[3];<br />
lowestDistance = min(currentDistance,lowestDistance);<br />
if (currentDistance &lt; lowestDistance)<br />
{<br />
indexOfLowest = 3;<br />
}<br />
<br />
// 'loop' for index 4<br />
currentDistance = distances[4];<br />
lowestDistance = min(currentDistance,lowestDistance);<br />
if (currentDistance &lt; lowestDistance)<br />
{<br />
indexOfLowest = 4;<br />
}<br />
<br />
<br />
// 'loop' for index 5<br />
currentDistance = distances[5];<br />
lowestDistance = min(currentDistance,lowestDistance);<br />
if (currentDistance &lt; lowestDistance)<br />
{<br />
indexOfLowest = 5;<br />
}<br />
<br />
<br />
// 'loop' for index 6<br />
currentDistance = distances[6];<br />
lowestDistance = min(currentDistance,lowestDistance);<br />
if (currentDistance &lt; lowestDistance)<br />
{<br />
indexOfLowest = 6;<br />
}<br />
<br />
<br />
// 'loop' for index 7<br />
currentDistance = distances[7];<br />
lowestDistance = min(currentDistance,lowestDistance);<br />
if (currentDistance &lt; lowestDistance)<br />
{<br />
indexOfLowest = 7;<br />
}</code><br />
<br />
It's also worth noting the difference between a static and dynamic branch. If the USSE can figure out the result of a branch before shader execution (either because it uses consts or uniforms) it will be a static branch, and effectively have 0 cost. A dynamic branch is anything that branches based on values that have come in through either a varying or a texture read, and should generally be avoided.<br />
<br />
I <i>think</i> it is in fact possible to pack the data in such a way that you can avoid doing a dynamic branch, but I need to do a bit of investigation into how to do that. I'll get back to you when I figure it out. It's worth noting that this won't be a free win though - you may have to sacrifice a bit of precision in your distances variable as well use as a few cycles. So I'd suggest trying the code I've provided here first, let me know if it works.<br />
<br />
Thanks,<br />
Tobias

Thank you for your comprehensive answer. I tested the code you suggested but Instruments still reports dynamic branches, and we still get lousy FPS.



However, I came up with a possible solution which should avoid dynamic branches. Instead of if or ? : clauses, It uses the GPU function ceil() to distinguish cases:





for (i = 2; i < 8; i++) {

currentDistance= distNormSqr;



// not needed anymore

//isLowest = currentDistance < lowestDistance;



// set ftmp3 to a positive float if isLowest, else 0.0

ftmp3 = max(0.0, lowestDistance - currentDistance);



// set ftmp3 to 1.0 if isLowest, else 0.0

ftmp3 = ceil(ftmp3);



// set ftmp3 to (i - indexOfLowest) if isLowest, else 0.0

ftmp3 *= float(i - indexOfLowest);



// the following line sets indexOfLowest to i if isLowest, else it leaves it unchanged

indexOfLowest += int(ftmp3);



lowestDistance = min(currentDistance, lowestDistance);

}





When I tested it, it didn’t perform any better than what we had before and Instruments still reports dynamic branches, but I’m going to investigate it further tomorrow.

An even simpler version would be:





for (i = 2; i < 8; i++) {

currentDistance= distNormSqr;



bool isLowest = currentDistance < lowestDistance;



// set ftmp3 to 1.0 if isLowest, else 0.0

ftmp3 = float(isLowest);



// set ftmp3 to (i - indexOfLowest) if isLowest, else 0.0

ftmp3 *= float(i - indexOfLowest);



// the following line sets indexOfLowest to i if isLowest, else it leaves it unchanged

indexOfLowest += int(ftmp3);



lowestDistance = min(currentDistance, lowestDistance);

}





Still finding out why this still causes dynamic branches.

Hi Cheesus,



My guess would be that the dynamic branch is occurring because of the for loop in this particular case more than anything else, but I assume you got the same result with the unrolled loop? There’s nothing here that really sticks out as dynamic branching.



Is instruments telling you where the dynamic branches are? I tried to get the same result but got no such warning, and performance seemed okay. Which device are you actually running this on that gets you poor performance?



One other thing I noticed - are you supposed to be running your indices from 1-8? It looks like you’re only accessing 8 elements but skipping element[0] - is this intentional? Is your array 9 in size?



Thanks,

Tobias

Thank you Tobias, it was indeed the loop which caused the dynamic branches. I thought a simple loop with fixed limits wouldn’t cause dynamic branches. A static loop can either be either ‘unrolled’ by the compiler, and even if the compiler doesn’t unroll it, it shouldn’t cause performance problems because it doesn’t cause the multiple GPU cores to execute different branches.



Or am I getting something wrong?



Thank you again!

It really should be able to unroll it but it’s something that is apparently not happening in either our or Apple’s implementations. I believe the reason for the dynamic branch is to do with the way the lookup is handled when you use “i” to reference the array, but I can’t give more details than that.



Have you looked at using Aras Pranckevičius’ glsl optimiser? (aras-p.info/blog/2010/09/29/glsl-optimizer/) This will automatically unroll loops for you, alongside other optimisations, so you can code largely how you want - it usefully also obfuscates any IP sensitive shaders. It’s also built into PVRShaderEditor , which will give you cycle cost estimates of your shader as well, so might be worth using anyway if you’re not already.



Thanks,

Tobias

Thanks for your help and the suggestions!