Hi guys
We have a somewhat complicated fragment shader, which we cannot split into smaller shaders.
One of the thing the shader does is calculate the distance of the fragment/pixel to 7 other points, and find the two nearest ones.
On the iPad 3 (PowerVR SGX 543), we are having huge performance problems with this part of the code, and Apple Instruments tells us the bottleneck is the shader code which leads to dynamic branches in the GPU.
We sacrificed readability and tried several code changes to avoid these dynamic branches, but up to now we haven’t found a solution:
- We changed the loop so that it’s not really a loop, but a repeating series of instructions.
- We simplified the code so that it contains only very simple if-like instructions.
This is the ‘simplest’ version of the code we came up with:
bool isLowest;
float currentDistance, lowestDistance, newLowest;
lowestDistance = distances[1];
int indexOfLowest = 1, newIndexOfLowest;
// 'loop' for index 2
currentDistance = distances[2];
isLowest = currentDistance < lowestDistance;
newLowest = isLowest ? currentDistance : lowestDistance;
lowestDistance = newLowest;
newIndexOfLowest = isLowest ? 2 : indexOfLowest; // BAD LINE
indexOfLowest = newIndexOfLowest;
// 'loop' for index 3
currentDistance = distances[3];
isLowest = currentDistance < lowestDistance;
newLowest = isLowest ? currentDistance : lowestDistance;
lowestDistance = newLowest;
newIndexOfLowest = isLowest ? 3 : indexOfLowest; // BAD LINE
indexOfLowest = newIndexOfLowest;
// etc. until index 8
As you can see, the code finds the index of the lowest distance. In our opinion, this code can be executed without dynamic branches. The last four lines of one 'loop' are just simple calculations, they are not real branches.
The problem is the second-last row in each 'loop', where indexOfLowest gets a new value (marked as BAD LINE).
If we comment out one of these lines, everything works fine on 60 FPS and Instruments doesn't report dynamic branches. But with this line, it's not more than 8 FPS.
How can we rewrite this code such that it does not cause dynamic branches in the GPU?
We just analyzed the code using PowerVR's PVRShaderEditor (previously PVRUniSCoEditor), where it shows the expected GPU cycles per shader source code line. The line in question (BAD LINE in the code above) is shown with only 1-2 GPU cycles. The fact that this line causes dynamic branches which lead to huge performance problems is not mentioned. Any other ideas how we could debug this code, so we can understand why it's causing these branches?