Reduction: many texel lookups VS Multiple renders


I'm currently implementing a kind of reduction. 
Generally the reduction is conducted by rendering rectangles multiple times, by reducing their size in the order of 2. 
I have to sum values in each 11x11 patch in a texture. 

My basic implementation was computing horizontal sum for 11 texels, and then do the same thing in vertical. 
However, reading 11 texels in a fragment shader is slow as known in general. 

In this case, could I improve the speed if I do the sum in a typical way ?
(By putting a 11x11 patch to a 16x16 patch and performing rendering multiple times)

Actually, both methods seem not much different in computational complexity to me.

Hi WonwooLee,

did you have a look at the Bloom training course from our SDK?
In there we optimize the amount of texture fetches by using the filtering capabilities of the texture units.
If your filter coefficients allow it, then you can save quite a few texture lookups that way.

Do you require a 11x11 kernel to downsample your image?
You will be most likely bandwidth bound depending on your input image size.

Best regards,

marco2012-09-04 17:00:30