How does framebuffer fetch influence performance?

Greetings,

is there documentation available that describes whether and to what extend using GL_EXT_shader_framebuffer_fetch in GLSL fragment shaders influences the performance of rendering?

Regards

Hi,

framebuffer fetch is quite well supported in our modern GPUs.
In terms of performance FBO Fetch operations use the GPUs on-chip memory therefore it does not incur system memory bandwidth cost. While this on-chip memory is significantly faster than system memory it does have finite bandwidth but it’s a lot more difficult to exhaust.
Also, we need to fit FBO pixels onto this on-chip memory, if you use too much we need to reduce the number of tiles processed at a given time which will result in worse memory latency hiding capabilities.
You can check the number of tiles processed in parallel using PVRTune’s “Tiles in Flight” counter. This of course needs latest PVRTuneComplete and latest drivers.

bests,
Marton

1 Like

Hello,

thanks for your fast answer. Does that mean framebuffer fetches are faster than say texture fetches (because those are from system memory, right?).

In my scenario, there will be one framebuffer fetch per fragment shader invokation.

Hi,

they are only comparable if you are fetching the same data in the same pattern, eg. read in a viewport size texture 1-to-1 with a fragment shader VS fbo fetch the previous pass’ results.
It could be faster, but the main benefit should be less bandwidth usage.

What would be your usage scenario?

thanks,
Marton

1 Like

The usage scenario is as follows:

I render a number of objects with depth test. There are subtypes of objects that need globally ordered drawing among themselves, that means object type A is always in front of object type B etc. . This is currently done using stencil test. Problem is, that I have to use a separate drawcall every time I change the stencil reference value which causes 10 x more drawcalls than without the test. Now I examine multipass schemes that might be able to evaluate the order in place and for that framebuffer fetch may be useful.

Hi,

have you thought of ordering draw calls and submitting them in the correct order? eg. something like this:
http://realtimecollisiondetection.net/blog/?p=86

bests,
Marton

Problem is I cannot do that as the objects are stored in separate VBOs and generated at different times. Performing a global sorting every time the view changes would be expensive.

I think they are collecting draw calls in each frame, sorting them using this method then drawing them in the correct order. But that’s understandable if you think that would be too expensive.

Note that they also had a situation where flexibility was a priority:
“In the past, however, we had slightly different criteria and used a different solution. On the PS2 we were not willing to sacrifice speed for flexibility (and the needs for flexibility were lesser too) so we simply had predefined buckets for all the different combinations of layers, viewports, translucency, etc. and just inserted draw calls directly into the appropriate bin. On drawing, we would just process each bucket in whichever order we wanted to draw them, no sorting necessary! Obviously, this system is much faster with no sorting required, but it is also much less flexible, and doesn’t deal well with e.g. including depth sorting as part of the bucketing. (A very similar bucketing system was used on the PS1 as well.)”

Problem is even if I did sort drawcalls I would still have to do the VBO switch which requires another drawcall. Currently the outer loop features VBO switch and the inner loop features stencil reference value switch. Sorting the drawcalls for stencil switch would just move the VBO switch to the inside and the result is the same.