I’m trying to optimize a bloom pass on iPad 2, and I’m running into performance problems with my shader. In the final stage, I switch from rendering to a texture target to the real backbuffer, then draw a fullscreen quad using the texture of the original scene and a lower rez blurred bloom buffer.
The final pass is taking about 7ms based on a GPU utilization of 21% according to Apple’s Instruments. Is that as fast as I can expect it to go without simplifying the model, or does that performance indicate I’m doing something Wrong? Switching the shader out for a very simple add of the 2 textures results in a draw time of only 4ms, so I guess it could just be my bloom combine is too complex. Any thoughts on speeding this up?
Thanks in advance
precision lowp float;
precision lowp sampler2D;
varying mediump vec2 v_TexCoord;
varying lowp vec4 v_Color;
vec3 AdjustSaturation(vec3 color, float saturation)
// The constants 0.3, 0.59, and 0.11 are chosen because the
// human eye is more sensitive to green light, and less to blue.
float grey = dot(color, vec3(0.3, 0.59, 0.11));
vec3 result = mix(vec3(grey,grey,grey), color, saturation);
uniform sampler2D TextureSampler;
uniform sampler2D BaseSampler;
uniform float BloomIntensity;
uniform float BaseIntensity;
uniform float BloomSaturation;
uniform float BaseSaturation;
// Look up the bloom and original base image colors.
vec3 bloom = texture2D(TextureSampler, v_TexCoord).rgb;
vec3 base = texture2D(BaseSampler, v_TexCoord).rgb;
// Adjust color saturation and intensity.
bloom = AdjustSaturation(bloom, BloomSaturation) * BloomIntensity;
base = AdjustSaturation(base, BaseSaturation);
// Darken down the base image in areas where there is a lot of bloom,
// to prevent things looking excessively burned-out.
base *= (1.0 - clamp(bloom, 0.0, 1.0)) * BaseIntensity;
// Combine the two images.
vout.rgb = base + bloom;
vout.a = 1.0;
gl_FragColor = vout;
does the saturation adjustment have a great impact on the final image?
Could you adjust the saturation for just the final fragment colour instead of the individual input colours from the textures?
Looking at the cycle counts, the reported 7ms do make sense.
You could speed it up by removing the adjustement completely, this more than halves the amount of instructions that are generated (PVRUniscoEditor).
And as this is a fullscreen shader, every instruction you can avoid is having an impact on your final performance.
Let me know if you need more help!
Thanks for the answer, Marco. It does have a big impact, yeah. I guess I’ll just make 9 versions of the shader to handle cases where saturation for the 2 parameters are 0 / 1 / SomethingElse to shave 1 or 2 more ms off (that happens a lot) and call it done.
From playing with the compiler, it looks like doing static branching on uniforms is cheaper, but not totally free, so writing 9 shaders by hand, or doing my own shader preprocessor would be faster at run time. Is that understanding correct?