"for" or "if" keyword performance problem!!!!

when I use "for" or "if" in Fragment shader, performance is very poor!!!

Why this problem happen???

like this

for(mediump int i=0;i<16;i++){

   tmp = abs( left[0]-texture2D(Texture,TexCoord - vec2(i/w,0.0) + vec2(-1.0/w,-1.0/h)).r)

     + abs( left[1]-texture2D(Texture,TexCoord - vec2(i/w,0.0) + vec2(0.0,-1.0/h)).r)

     + abs( left[2]-texture2D(Texture,TexCoord - vec2(i/w,0.0) + vec2(1.0/w,-1.0/h)).r)

     + abs( left[3]-texture2D(Texture,TexCoord - vec2(i/w,0.0) + vec2(-1.0/w,0.0)).r)

     + abs( left[4]-texture2D(Texture,TexCoord - vec2(i/w,0.0) + vec2(0.0,0.0)).r)

     + abs( left[5]-texture2D(Texture,TexCoord - vec2(i/w,0.0) + vec2(1.0/w,0.0)).r)

     + abs( left[6]-texture2D(Texture,TexCoord - vec2(i/w,0.0) + vec2(-1.0/w,1.0/h)).r)

     + abs( left[7]-texture2D(Texture,TexCoord - vec2(i/w,0.0) + vec2(0.0,1.0/h)).r)

     + abs( left[8]-texture2D(Texture,TexCoord - vec2(i/w,0.0) + vec2(1.0/w,1.0/h)).r);


          min = tmp;

          dis = i;



Hi chlghduf314,

The main problem you have here is a very large number of dependent texture reads. If you compile your code in our PVRUniSCoEditor shader text editing tool, you can obtain cycle counts for each line of code, which allows you to assess the complexity of shaders without having to run your application on your target hardware (unfortunately, the cycle counts generated are only an estimate and do not change based on the number of times a loop is iterated (e.g. the cycle counts the compiler gives could be seen as the cost of one loop around the ‘for’ in your shader)).

Every time a dependent texture read is encountered on our hardware, a stall occurs until the texture information has been retrieved. When there are only a small number of texture reads in a render, the performance drop that you may expect from this stall is hidden due to the hardware’s ability to schedule in other operations until the texture data has been retrieved. As the number of dependent texture reads increases, the number of additional operations that can be performed that do not themselves rely on dependent texture reads decreases, which results in a significant bottleneck being introduced. This is because the hardware runs out of operations it can perform with the data it has cached, and reaches a point where it is sitting idlely waiting for data to be retrieved from external memory.

For this reason, even the most complex, optimal applications running on high end current generation mobile hardware would struggle to maintain interactive frame rates with more than two dependent texture reads per fragment.

One quick way to improve your code is to calculate all of your texture coordinates in your vertex shader. This not only requires less processing (i.e. there are generally less vertex operations per render than fragment operations), but it also allows the hardware to retrieve texture information independently of the fragment shader’s execution.

If you read through the performance recommendations document in our SDK (also available here), you can find an number of ways to optimised the shader code you have presented.

Joe2010-10-08 10:11:53

Thank you !!!

When applying image filters with large kernels, as it seems you are doing, it’s also important to check whether there is any work that can be shared between adjacent fragments. Is the filter separable? You may be better off splitting the filter into multiple passes.

About two thirds of the texture fetches in that loop are redundant because you sample the same locations in the previous loop iteration already.