iPad GL ES 2.0 poor performance on texture blend

Hi I’ve run into a bit of a loggerheads with some code I’m working on for one of our apps we’re developing for iPad. The whole thing was running, very slowly so it came time to pull it apart and look at each component to see why. Now this shader appears to be the main cause of one of our bottlenecks.


The purpose of this shader is to blend together two textures, one created on the fly from a previous framebuffer’s render target. The other is the overlay image.


We’ve pulled the whole app apart and are now just testing this with two textures being blended onto a quad the size of the glViewport. One texture is rgba the other is greyscale. The whole effect we’re looking for is the same as the overlay layer effect in photoshop.


The size of the textures doesn’t seem to have any effect on the speed, that is two 1024x1024 textures cause this to run at 18fps. Two 64x64 textures blended together cause this to run at 18fps.


Changing the size of the viewport causes this to run faster however, fullscreen on the iPad is giving us 18fps. Running this at iPhone resolution on the iPad (320x480) is giving us 60fps. Have we hit some limit for rendering in GLES 2.0 on the iPad or is something else causing this issue on out shader? We’re not sure how to fix/rectify this issue, the fragment shader we’re using seems to be fairly simple in appearance, so I’m surprised it’s causing such a slowdown on the device.





This is all being ran on device not in the simulator. Any ideas on how we could possibly improve/simplify this shader whilst still keeping the effect we’re after… ? Or is the iPad gpu just not up to doing full screen GLSL scenes?








#define VECTOR_ONE               vec3( 1.0, 1.0, 1.0 )


#define      VECTOR_HALF          vec3( 0.5, 0.5, 0.5 )





uniform               sampler2D     s_renderTexture;


uniform               sampler2D     s_overlayMap;


varying mediump vec2          myTexCoord;





lowp vec4 baseColor;


lowp float overlayTexture;


mediump vec3 val;


mediump vec3 finalMix;





//This Runs at < 18fps


void main()


{     


        //Get the Texture colour values


     baseColor          =     texture2D(s_renderTexture     , myTexCoord);     


     overlayTexture     =   texture2D(s_overlayMap, myTexCoord).r;





     //Set the mix value to 0.0 or 1.0 depending on the baseColor.rgb, this causes the mix to use   


        //either argument 1 (if val = 0.0) or argument 2 (if 1.0) which is faster than an if() branch





     val               =     step( VECTOR_HALF, baseColor.rgb );


     finalMix          =     mix( (2.0 * baseColor.rgb * overlayTexture),


                               (1.0 - 2.0 * (VECTOR_ONE - baseColor.rgb) * (VECTOR_ONE - verlayTexture)),


                                val );





        //Set the Fragments colour


     gl_FragColor     =     vec4( finalMix, 1.0 );


}

The “overlay” blend mode can be interpreted as “change the intensity of a color channel depending on the distance from 0.5 multiplied by the top layer intensity (interpreted as a number in the range [-1, 1])”.





Try this shader:


Code:

uniform sampler2D s_renderTexture;
uniform sampler2D s_overlayMap;
varying mediump vec2 myTexCoord;

void main()
{
     //Get the Texture colour values
     lowp vec3 baseColor = texture2D(s_renderTexture, myTexCoord).rgb;
     lowp float overlayTexture = texture2D(s_overlayMap, myTexCoord).r;
     
     lowp vec3 finalMix = baseColor + (overlayTexture - 0.5) * (1.0 - abs(2.0 * baseColor - 1.0));
     
     //Set the Fragments colour
     gl_FragColor = vec4( finalMix, 1.0 );
}

Thanks, this is exactly what I was looking for and has now cut the number of cycles on this shader from 36 to only 11. Much faster than before.


I think I may owe you a pint.

Actually looking at this perhaps I’ve hit some kind of hardware barrier. Just ran the shader on the device, as it was out of the office this morning. Using this shader and just rendering one billboard with two textures (a test build of the app, that simply renders a quad and overlays texture A onto Texture B, no other rendering carried out) I get 50fps, which is much better than the 18fps I was getting before thank you very much (you’re still getting a pint) however I actually expected a better framerate seeing as the shader now is fairly light weight (only 11 instructions), was hoping of getting 60fps.





This test was carried out with the two textures sized at 1024x1024. I then lowered the texture size to 16x16 to see if this increased speed. It didn’t rendering two 16x16 textures with this shader to a fullscreen glviewport (768x1024 on the iPad) runs at 50fps.


Am I missing a trick here, as resizing the glViewport to smaller resolutions does increase speed I didn’t think that resizing the viewport would be such a bottle neck, I expected a performance hit when rendering textures smaller than their native size, but not 10fps. I’ve stripped everything out of the renderer other than binding a framebuffer, binding a Vertex and Index buffer two texture binds, and tow glEnableVertexAttribArray (one for vertexs one for texture co-ords), a drawElements and the unbindings of vbo and ibo.


Still a consistent 50 fps. There’s absolutely nothing going on in the scene other than rendering two textures with the above shader.


Have I now just hit the hardware barrier for the iPad?

It sounds like your fragment shader limited rather than limited by your texture bandwidth. Reducing the size of the viewport reduces the number of fragments that you have to process, which would explain the performance increase you see at lower resolutions.


11 cycles per fragment is still fairly intensive for an effect that covers the whole screen. What is the photoshop style effect you are trying to achieve with your greyscale image?

1024x768@50fps is ~40Mpix/s. While I can’t go into details about iPad hardware, the number of shader cycles you get per pixel on today’s handheld devices with high-res screens is quite limited.





The following shader might actually run slightly better (use GL_ALPHA format instead of GL_LUMINANCE for the second texture).


Code:

void main()
{
     //Get the Texture colour values
     lowp vec3 baseColor = texture2D(s_renderTexture, myTexCoord).rgb;
     lowp float overlayTexture = texture2D(s_overlayMap, myTexCoord).a;

     lowp vec3 iasb = 1.0 - abs(2.0 * baseColor - 1.0);
     lowp vec3 finalMix = baseColor + (overlayTexture - 0.5) * iasb;

     //Set the Fragments colour
     gl_FragColor = vec4( finalMix, 1.0);
}

GL_LUMINANCE gives me an increase of 5fps over using GL_ALPHA.


I also changed the order of operations on the following lines





//lowp vec3 iasb = 1.0 - abs(2.0 * baseColor - 1.0);


lowp vec3 iasb = -(abs(2.0 * baseColor - 1.0)) + 1.0;


//lowp vec3 finalMix = baseColor + (overlayTexture - 0.5) * iasb;


lowp vec3 finalMix = iasb * (overlayTexture - 0.5) + baseColor;





to get an increase of 3-4fps which brought this shader back up to the 50fps the previous version was getting.


Otherwise Apples shader compiler seems to get the same fps from both versions of the shader even with carrying out the first multiply step. So no real improvement there, although I’m quite happy to get 50fps on a fullscreen render using this shader as I’m more that likely aiming at getting a render speed of 30 fps which is a decent framerate for the app we’re working on.


Thanks for the assist.





Now to tackle the second and more important… and more complex shader which I’m trying to speed up which comes from the following tech paper Valve released.


Valve Alpha Tested magnification





Got it down to two texture read, one smoothstep, and step… and way too many cycles so far (42) although this is only for textures 64x64 rendered to a 512x512 viewport (this is the base Image that gets the overlay applied to it). Looks like the following.





#define OUTLINE_MIN_VALUE     0.325


#define OUTLINE_MAX_VALUE 0.5282





uniform               sampler2D     s_shapeTexture;


varying mediump vec2          myTexCoord;


uniform mediump vec3          m_color;


varying     mediump     vec2          altTexCoord;








void main()


{


     //Two texture reads from the same texture, one shifted slightly for the light sources direction


     //alphaChannel - Vec2( BaseTextures normal position alpha value at co-ords, baseTexture’s slightly shifted position alpha value )


     lowp vec2 alphaChannels= vec2( texture2D(s_shapeTexture, myTexCoord).a, texture2D(s_shapeTexture, altTexCoord).a );


     //A smoothstep to have a soft edge to the image going from 0.0 - 1.0 in the range of 0.325 to 0.528 in the alpha channel


     //values above OUTLINE_MAX_VALUE come out as 1.0, values below OUTLINE_MIN_VALUE come out as 0.0


     lowp vec2 alphaChannelMods = smoothstep( OUTLINE_MIN_VALUE, OUTLINE_MAX_VALUE, alphaChannels);


     //Add the two alpha values together


     lowp float alpha = ( alphaChannelMods.y + alphaChannelMods.s);


     //Now Square the alpha and multiply by one tenth, add this to the original images alpha.


     alpha = alpha * alpha * 0.1 + alphaChannelMods.x;


     //highlight modification


     //if we are applying a highlight to the fragment best to get it’s newfound color value


     //Dark alpha equals a brighter highlight, light alpha equals darker highlight (shadow)


     lowp float highlightMod = step( alphaChannels.t, alphaChannels.s) * (1.0-alphaChannelMods.y);


     lowp float highlightVal= highlightMod * 0.1348 + 1.0;/4/


     //now stick it all together (if higlightVal is zero this makes the shadow color black)


     gl_FragColor= vec4( m_color * highlightVal * alphaChannelMods.x , alpha ); /7/


}


I think this is shader could benefit greatly from dynamic branching, as I’d expect a large area of the texture to be either transparent or fully opaque. The conditionals themselves come at some cost but you will save a lot more by skipping the whole calculation for transparent pixels.

Tested conditional statement first thing today, actually slows it down by four frames. The shader was only doing two texture reads an if( baseTexture.a > ALPHA_CUTOFF ) and writing gl_fragColor if the alpha value was less than the cut off… So looks like I’ll have to refine this one some more.

What performance do you get with that shader? How much of the texture is transparent, and how much is fully opaque?