Long shader compile times on iPhone 3G S

I am getting shader compile times of up to 8 seconds on the iPhone 3G S (OS 3.1.2).  The 8 second example consists of an 18 line vertex shader and a 110 line fragment shader, plus a few #define lines.  Shark on Mac OS X tells me that 96.7% of the time is being spent in response to glValidateProgram, of which nearly all goes into CompileShader in the IMGSGX535GLDriver library.  An excerpt:

    96.7%, glValidateProgram, OpenGLES
    96.7%, glValidateProgramARB_Exec, GLEngine
    96.7%, gldUpdateDispatch, IMGSGX535GLDriver
    96.7%, glrLoadCurrentPipelinePrograms, IMGSGX535GLDriver
    94.0%, glrUpdateFragmentProgramInline, IMGSGX535GLDriver
    94.0%, glrUpdateCtxSysFragmentProgram, IMGSGX535GLDriver
    93.9%, sgxUpdateCtxSysProgram, IMGSGX535GLDriver
    93.9%, ppimgCompile, IMGSGX535GLDriver
    93.9%, PVRUniFlexCompileToHw, IMGSGX535GLDriver
    93.8%, CompileShader, IMGSGX535GLDriver

I have a detailed decomposition.  As a point of comparison, when running on the iPhone simulator on a MacBook Pro, I get numbers like 0.0006 to 0.01 seconds.

I’m guessing there is some sort of super-linear (and maybe exponential) analysis going on during compilation in the IMGSGX535GL driver.

Strangely, I’ve been unable to find other reports of this problem of terribly slow compile times.

Any ideas on what’s going on and how I can speed things up dramatically?  I’m happy to share my shader programs if anyone is interested.

When I skip glValidateProgram, I get these same sort of delay, but during the first render call.

I am completely baffled, and I would sure appreciate help.

Apple are supporting the iPhone themselves, so I can’t directly help you with this issue. Posting on their developer forum may bring some joy and they also have a bug-tracking system known as “radar” which may be worth inquiring about.











Thats said, 8 seconds is a very long time - it does sound like something may be wrong here. Inline compilation can be rather slow in general and it often takes a noticeable amount of time in my own runs on iPhone - it’s possible that something to do with your particular shader takes a long time, especially as 110 lines for a fragment shader is rather long on a mobile platform. I would be interested to see the shader in question, if you don’t mind posting it.

Hi Gordon.  Thanks for the response.  I’ve been unable to get onto the Apple developer forum so far (“There was an error processing your request.”).  I’ll keep trying.

Meanwhile, here is the shader program.  It’s generated by the compiler of a purely functional language I designed & implemented (embedded in Haskell).  The common subexpression eliminator needs work.

Regards,

First, a prelude common to the vertex & fragment shaders, to remove some differences between OpenGL 2.1 and OpenGL ES 2.0:

Code:
// These required names in OpenGL are forbidden in OpenGL ES
#define gl_ModelViewProjectionMatrix ModelViewProjectionMatrix_u
#define gl_NormalMatrix NormalMatrix_u

precision highp float;

uniform mat4 gl_ModelViewProjectionMatrix;
uniform mat3 gl_NormalMatrix;

#define _uniform time_u
#define _attribute uv_a
#define _varying_F uv_v
#define _varying_S pos_v

Next, the vertex shader:
Code:
uniform   float _uniform;
attribute vec2 _attribute;
varying   vec2 _varying_F;
varying   vec3 _varying_S;
void main () {
    vec2  x15 = _attribute;
    float x27 = x15.y;
    vec2  x39 = vec2(2.0,9.0);
    float x34 = sin(dot(x39,vec2(_uniform,sqrt(dot(x15,x15)))));
    float x53 = 1.0 / exp(log(1.0 + sqrt(dot(x15,x15))) * 3.0);
    float x79 = cos(_uniform);
    float x81 = sin(_uniform);
    vec3  x10 = vec3(x15.x
                    ,dot(vec2(x27,- x34 * x53),vec2(x79,x81))
                    ,dot(vec2(x34 * x53,x27),vec2(x79,x81)));
    gl_Position = gl_ModelViewProjectionMatrix * vec4(x10,1.0);
    _varying_F = _attribute;
    _varying_S = x10;
}

Finally, the fragment shader:
Code:
uniform   float _uniform;
varying   vec2 _varying_F;
varying   vec3 _varying_S;
void main () {
    vec2  x33 = vec2(0.3,0.3);
    vec2  x42 = vec2(1.1,1.1);
    vec2  x47 = vec2(0.11111111,0.11111111);
    vec2  x54 = vec2(1.0,1.0);
    vec2  x63 = vec2(1.5,1.5);
    vec2  x67 = vec2(0.14285715,0.14285715);
    vec2  x71 = vec2(-0.5,-0.5);
    vec2  x78 = vec2(0.5,0.5);
    vec2  x87 = vec2(1.0,1.0);
    vec2  x96 = vec2(1.5,1.5);
    vec2  x100 = vec2(0.14285715,0.14285715);
    float x110 = cos(_uniform);
    float x112 = sin(_uniform);
    float x104 = dot(vec2(x110,x112),_varying_F);
    float x117 = dot(vec2(x110,- x112),_varying_F.yx);
    vec2  x127 = vec2(1.0,1.0);
    vec2  x136 = vec2(0.3,0.3);
    vec2  x145 = vec2(1.1,1.1);
    vec2  x149 = vec2(0.11111111,0.11111111);
    vec2  x156 = vec2(1.0,1.0);
    vec2  x165 = vec2(1.5,1.5);
    vec2  x169 = vec2(0.14285715,0.14285715);
    vec2  x173 = vec2(-0.5,-0.5);
    vec2  x180 = vec2(0.5,0.5);
    vec2  x189 = vec2(1.0,1.0);
    vec2  x198 = vec2(1.5,1.5);
    vec2  x202 = vec2(0.14285715,0.14285715);
    vec2  x206 = vec2(1.0,1.0);
    bool  x16 = dot(1.0 / (x33 + cos(x42 * vec2(_uniform,_uniform)) * x47) *
                    ((x54 + sin(x63 * vec2(_uniform,_uniform)) * x67) * (x71 +
                                                                         mod(x78 +
                                                                             1.0 /
                                                                             (x87 +
                                                                              sin(x96 *
                                                                                  vec2(_uniform
                                                                                      ,_uniform)) *
                                                                              x100) *
                                                                             vec2(x104
                                                                                 ,x117)
                                                                            ,x127)))
                   ,1.0 / (x136 + cos(x145 * vec2(_uniform,_uniform)) * x149) *
                    ((x156 + sin(x165 * vec2(_uniform,_uniform)) * x169) *
                     (x173 + mod(x180 + 1.0 / (x189 + sin(x198 * vec2(_uniform
                                                                     ,_uniform)) *
                                                      x202) * vec2(x104,x117)
                                ,x206)))) <= 1.0;
    vec2  x292 = vec2(2.0,9.0);
    vec2  x303 = _varying_F;
    float x298 = dot(x303,x303);
    float x289 = dot(x292,vec2(_uniform,sqrt(x298)));
    float x287 = cos(x289);
    float x316 = 1.0 / (2.0 * sqrt(x298));
    float x327 = x303.x;
    float x329 = sin(x289);
    float x349 = dot(x303,x303);
    float x343 = 1.0 + sqrt(x349);
    float x338 = log(x343) * 3.0;
    float x336 = exp(x338);
    float x334 = 1.0 / x336;
    float x362 = 1.0 / (x336 * x336);
    float x372 = exp(x338);
    float x382 = 1.0 / x343;
    float x389 = 1.0 / (2.0 * sqrt(x349));
    float x276 = dot(vec2(x287 * (9.0 * (x316 * (x327 + x327))),x329)
                    ,vec2(x334
                         ,x362 * (x372 * (x382 * (x389 * (x327 + x327)) *
                                          3.0))));
    float x402 = sin(_uniform);
    float x404 = cos(_uniform);
    float x437 = x303.y;
    float x415 = dot(vec2(x287 * (9.0 * (x316 * (x437 + x437))),x329)
                    ,vec2(x334
                         ,x362 * (x372 * (x382 * (x389 * (x437 + x437)) *
                                          3.0))));
    float x471 = - x415 * x402;
    float x265 = dot(- (vec2(x276,x276)) * vec2(x402,x404)
                    ,vec2(x415 * x404,x404) + vec2(x402,x471));
    float x479 = - x415 * x404;
    float x484 = - x402;
    float x476 = x479 + x484;
    float x486 = x404 + x471;
    float x252 = 1.0 / sqrt(dot(vec3(vec2(x265,x476),x486)
                               ,vec3(vec2(x265,x476),x486)));
    vec3  x240 = gl_NormalMatrix * (vec3(x252,x252,x252) * vec3(x265
                                                               ,vec2(x479
                                                                    ,x404) +
                                                                vec2(x484
                                                                    ,x471)));
    vec3  x228 = vec3(1.0 / sqrt(dot(x240,x240))) * x240;
    vec3  x514 = vec3(-0.5962848,-0.2981424,-0.745356);
    float x223 = abs(dot(x228,x514));
    vec3  x546 = vec3(0.5962848,0.2981424,0.745356);
    vec3  x562 = vec3(0.0,0.0,-2.5);
    float x517 = exp(log(max(0.0
                            ,dot(vec3(2.0 * dot(x228,x514)) * x228 + x546
                                ,vec3(1.0 / sqrt(dot(x562 - _varying_S
                                                    ,x562 - _varying_S))) *
                                 (x562 - _varying_S)))) * 15.0);
    gl_FragColor = vec4(dot(vec2(x16 ? 0.0 : 1.0,0.5)
                           ,vec2(0.2 + 0.4 * x223,x517))
                       ,dot(vec2(x16 ? 0.0 : 1.0,0.5)
                           ,vec2(0.2 + 0.4 * x223,x517))
                       ,dot(vec2(x16 ? 0.0 : 1.0,0.5)
                           ,vec2(0.2 + 0.4 * x223,x517))
                       ,(x16 ? 0.0 : 1.0) * (1.0 + x223) + x517);
    
}

This shader program is a procedural surface with a procedural texture, with lighting computed per pixel, including per-pixel normal calculation.  Calling glValidateProgram (after glLinkProgram) takes about 8 seconds.

I will improve my high-level language compiler (the one that generates this code), which will probably reduce the GLSL compilation times and perhaps improve shader performance.  Still, I will want to generate shaders of about the complexity above, and 8 seconds times several shader programs is killing my start-up.

After asking around it appears you’re not the only person who has had difficulty with Apple’s forum recently - I can’t log in myself, unfortunately.





I ran your shaders through the offline compiler we have in PVRUniSCoEditor - obviously this isn’t the same compiler, it’s running on different hardware etc. etc.





The vertex shader appears to compile in a reasonable amount of time and gives 60 cycles in total.





The fragment shader takes a long time ~4-5 seconds and has a total cycle count of 356. The line that seems to contribute most to the compile time is the one on which bool x16 is set - it comes out as 212 cycles all by itself. If I replace this line with something trivial then the compilation time reduces dramatically.





It’s a little difficult to do any more as I haven’t had time to really analyse what the shaders do, but this line looks like a good place to start optimising. You may be able to reduce its complexity and in turn it may compile faster and run faster.





Performance-wise: If you can calculations from the fragment shader to the vertex shader you should get a performance boost (assuming fewer vertices than pixels); it’s possible editing precisions will speed up the final shader. Apple’s docs on shader performance are pretty good and we have our own recommendations as well.


How many shaders do you generate this way, and what is your performance target? This shader, or others of similar complexity, are hardly suitable for real-time performance on the iPhone 3GS.





However, there are huge optimisation opportunities (you could likely achieve a 5x-15x performance increase by hand-optimising this particular shader). Especially the line for x16 can be massively simplified, but there are other cases with large possible savings as well:


- replace expressions consisting purely of uniforms and constants with uniforms


- calculate common expressions once


- move linear operations on varyings into the vertex shader


- use pow(x, y) instead of exp(log(x) * y) (or use exp2/log2)


- use normalize() to normalize a vector


- use float(bool) instead of bool ? 0.0 : 1.0 and invert the condition





Improvements in your high-level compiler should allow you to get much better performance and compile times. However, when targeting a device like a mobile phone it is important to optimise as much as possible. If the number of shaders you generate this way is relatively small, it might make sense to create hand-optimised versions of the shaders, or at least some of them.

Hi Gordon.  Thanks much for the experiments & feedback.  I will certainly improve my own compiler (generating GLSL code), which I expect will improve GLSL compilation times.   Perhaps execution performance will improve as well, though I wonder if those long GLSL compilation times are already making up for deficiencies in my compiler.

Still, I can’t help but wonder if there are some unfortunate choices of algorithms (and data structures) inside the IMG GLSL compilers that would cause such high compilation times – perhaps some quadratic, cubic, or even exponential-time algorithms.

I have hundreds of interactive shaders, all generated from simple, composable, high-level specifications.  Execution speed varies among okay, almost okay, and better than okay, and will improve as I improve my compiler.  (Examples of previous generations of this compiler can be found on my web pages about Pan, Vertigo, and Pajama.)  I want to make these shaders available in a pleasant 3D visual index, which means quick compilation is very important to me.  Moreover, I have a means of direct-manipulation (non-syntactic) construction of these interactive shaders by end-users (described in a blog post with pointers to a research paper and Google tech talk).  For interactive construction, quick compilation is even more important.

I realize that compilation on the iPhone will be slower than compilation on my MacBook Pro, but the difference is so enormous that I suspect it has more to do with the compiler implementation than the execution platform.  Of course I can only speculate.

Would someone at IMG be willing to profile the GLSL compiler with my troublesome example and see if some inefficiencies surface?  Efficiency improvements to IMG’s compiler combined with quality improvements to my front-end compiler could get start-up times down to the range I’m looking for.

I’ve also been wondering whether the iPhone supports off-line compilation (binary shaders).  So far, I’ve been unable to find any statement one way or the other.  Does anyone here know?  Off-line compilation could help me out tremendously for pre-generated shader programs.

Much appreciated,


Hi Xmas.  I appreciate your specific suggestions for optimization.  Some are on my list already, and others are new to me.

In answer to your question, there will be at least hundreds of these shaders generated by my compiler (which will be improved a good deal before releasing).  My intent is to make it easy for other people to generate these shaders, both though my purely functional high-level language and though a direct-manipulation interface, as well as automatic generation through artificial evolution (of the high-level specifications) with user-driven selection.  Then I expect there to be many thousands of generated shaders.

Regards,

By the way, if anyone wants to try running this shader code, just feed it a mesh of XY (vec2) pairs, where each of X & Y range from -1.0f to 1.0f.

conal wrote:
I've also been wondering whether the iPhone supports off-line compilation (binary shaders). So far, I've been unable to find any statement one way or the other. Does anyone here know? Off-line compilation could help me out tremendously for pre-generated shader programs.Much appreciated,




Currently, Apple is not allowing pre-compiled shaders on the iPhone (I forget where this is actually said) - I think this is to avoid fragmentation problems. The speed of Imagination's compilers is something we're always looking at. I can't discuss the implementation on the iPhone, I'm afraid.



I've passed your fragment shader to our compiler team to help with their optimisation efforts.