Using textures with premultplied alpha efficiently

I’m attempting to switch to using textures with premultiplied alpha in my fragment shader (because the source of the texture images uses premultplied alpha, and it seems wasteful to un-premultiply them). My intuition tells me that this should be slightly /more/ efficient than using regular non-premultiplied textures, but I’m seeing lower performance - presumably because my calculation is not as efficient as the built-in mix() function.





My testing’s being done on an Apple iPad.





This is the (straightforward) code I’m using for non-premultiplied textures:







    lowp vec4 contentsColor = texture2D(sContentsTexture, vContentsCoordinate);

    lowp vec4 zoomedColor = texture2D(sZoomedContentsTexture, vZoomedContentsCoordinate);

    contentsColor = mix(contentsColor, zoomedColor, zoomedColor);

[/CODE]



Using the premultipleed sZoomedContentsTexture, I switch the mix() line to:

<br />&nbsp;&nbsp;&nbsp;&nbsp;contentsColor = (contentsColor * (1.0 - zoomedColor)) + zoomedColor; <br />



And see my frame rate drop.



Is there anything I can do to make this more efficient (beyond giving up on the premultiplied textures, of course)?



<br /> <br />&nbsp;&nbsp;&nbsp;&nbsp;lowp vec4 contentsColor = texture2D(sContentsTexture, vContentsCoordinate);<br /> <br />&nbsp;&nbsp;&nbsp;&nbsp;lowp vec4 zoomedColor = texture2D(sZoomedContentsTexture, vZoomedContentsCoordinate);<br /> <br />&nbsp;&nbsp;&nbsp;&nbsp;contentsColor = mix(contentsColor, zoomedColor, zoomedColor);<br /> <br />





Using the premultipleed sZoomedContentsTexture, I switch the mix() line to:




    contentsColor = (contentsColor * (1.0 - zoomedColor)) + zoomedColor;

[/CODE]



And see my frame rate drop.



Is there anything I can do to make this more efficient (beyond giving up on the premultiplied textures, of course)?



<br /> <br />&nbsp;&nbsp;&nbsp;&nbsp;contentsColor = (contentsColor * (1.0 - zoomedColor)) + zoomedColor;<br /> <br />





And see my frame rate drop.





Is there anything I can do to make this more efficient (beyond giving up on the premultiplied textures, of course)?





I assume the lines should be
<pre =“BBcode”>contentsColor = mix(contentsColor, zoomedColor, zoomedColor.a);
and
<pre =“BBcode”>contentsColor = (contentsColor * (1.0 - zoomedColor.a)) + zoomedColor;
How does the following perform?
<pre =“BBcode”>float transparency = 1.0 - zoomedColor.a;
contentsColor = contentsColor * transparency + zoomedColor;
In the good old days of assembler-like shading languages there was a command for linear interpolation (corresponding to mix) and also a command for the combination of a  multiplication and an addition (often called MAD). Maybe the compiler is clever enough to use it when the expressions are simple enough.

Also, if you don’t need the alpha channel of the result, it might be worth to avoid the computation for the alpha channel.

Well, I’m just guessing… :slight_smile:

Ah, yes, you’re correct about the '.a’s - not sure how those went missing when I was editing the post…





Unfortunately pulling out the subtraction doesn’t seem to help much. Maybe a /little/, but I suspect it’s just rounding:





Speeds are:


With mix(): 30fps


My premultiply-aware code: 17fps


With Martin’s extracted transparency subtraction: 18fps


Hmmm, is it possible to move the computation of the transparency further up in the fragment shader? (In general it is a good idea to try to move instructions that depend on each other as far apart as possible. Since transparency is used in the computation of contentsColor, this might require the unit to wait for the result of transparency before computing contentsColor. Of course, the computation of transparency itself depends on a texture lookup and it should also be as far away from that line as possible. :slight_smile: However, many compilers are pretty good at reordering instructions, thus, it might not make any difference if you change the order. In any case, it shouldn’t hurt to give the compiler a hint.

I tried moving it as far away as possible (which admittedly is not very far away - this shader doesn’t do much beyond blending textures), but it made no difference. Isn’t that to be expected though? Presumably a similar subtraction has to occur in the mix() function, and it’s plenty fast.





Is mix() a hardware routine? I’m quite confused as to why it’s faster when I’m /trying/ to do a logically simpler operation (hence my thinking there must be a better way to do what I’m trying to do).


It depends on the hardware but it probably is. Have a look at the GL_ARB_fragment_program extension (http://www.opengl.org/registry/specs/ARB/fragment_program.txt); one of the commands is LRP:
    3.11.5.14  LRP: Linear Interpolation

The LRP instruction performs a component-wise linear interpolation
between the second and third operands using the first operand as the
blend factor.

tmp0 = VectorLoad(op0);
tmp1 = VectorLoad(op1);
tmp2 = VectorLoad(op2);
result.x = tmp0.x * tmp1.x + (1 - tmp0.x) * tmp2.x;
result.y = tmp0.y * tmp1.y + (1 - tmp0.y) * tmp2.y;
result.z = tmp0.z * tmp1.z + (1 - tmp0.z) * tmp2.z;
result.w = tmp0.w * tmp1.w + (1 - tmp0.w) * tmp2.w;

Of course, it's up to the hardware producer how they implement the LRP instruction,
but it's likely that it is rather efficient.
You could also try PowerVRs shader compiler tools (I forgot the name) which should
give you at least a rough idea about the number of actual instructions.

Martin Kraus2010-11-04 22:55:55