Implementation recommendations when using multiple FBOs

Hi!


I currently have a single input texture that needs to be filtered by three different shaders and emitted into three different output textures. A school book example of MRTs, problem is, OpenGLES 2 do not expose MRT. I’m developing for the iPad2+ and iPhone4+.

In my current implementation I have 3 different FBOs, each assigned to a unique output texture. I currently perform the rendering like this:
  1. Assign the input texture to texture unit 0
  2. Make FBO 1 the render target
  3. Activate shader 1
  4. Draw full screen quad
  5. Make FBO 2 the render target
  6. Activate shader 2
  7. Draw full screen quad
  8. Make FBO 3 the render target
  9. Activate shader 3
I have hard time measuring where the performance go but my guts feeling says all this will become highly sequential, mainly because I’m reusing texture unit 0 all the time.

How can I make this as fast as possible is my question? I have multiple ideas in my head but I thought you guys might give me the correct one directly. Here are my ideas:

First idea:
  1. Assign the input texture to texture unit 1,2 & 3
  2. Make FBO 1 the current render target
  3. Activate shader 1 (using texture unit 1)
  4. Draw full screen quad
  5. Make FBO 2 the current render target
  6. Activate shader 2 (using texture unit 2)
  7. Draw full screen quad
  8. Make FBO 3 the current render target
  9. Activate shader 3 (using texture unit 3)
  10. Draw full scree quad
Idea 2:
  1. Assign the input to texture unit 1
  2. Use a 2x bigger output texture attached to a single FBO
  3. Use conditional IFs in the shader to select what’s rendered into each quadrant of the output texture
I have loose ideas using a cube texture as output also but I dunno about that.

So, how do I “emulate” multiple render targets in the most efficient way? I would prefer to have parallelism here!

Thanks in advance!

Kind regards, Andreas Larsson

Hi,

The Series5 architecture processes a single render pass at a time. It cannot process multiple render passes in parallel.

Because of this, the approach you are already using should be suitable. Binding the texture to unit 0 for all render passes will not result in worse performance than binding the texture to a different unit for each render pass.

You are most likely bandwidth limited. If you have not done so already, you can reduce your bandwidth overhead by compressing your input texture (e.g. PVRTC 2bpp or 4bpp).
Additionally, if the input texture is static (i.e. you’re not updating the contexts of the texture or its filtering modes etc, at run-time) and the effect you are applying in a given render pass is static (i.e. the effect produces identical results each frame), then you could generate the results of the render passes off-line.

Regards,
Joe

Hi and thanks for a swift reply! :smiley:


My input texture is fully dynamic, re-rendered each frame. Three questions:

  • Basically to improve performance, I need to reduce the resolution, simple as that?
  • How can I measure how much time each buffer takes, because when I measure the time between one FBO switch and another I always get 0ms, where as the whole frame gets quite some ms. Is everything batched towards the end of the frame by the driver or what? Can I “flush” somehow?
  • How would I use discard frame buffer in this case? F.e. the rendered input texture is of no longer use when I’ve rendered the three outputs. If I do not discard this will the GPU do some copy? Can I get some performance back by discarding here?
Kind regards, Andreas

Not sure if my follow up questions got hidden from you since I accidentally checked “answered” after posting the follow ups…


JackAsser said:

  • Basically to improve performance, I need to reduce the resolution, simple as that?



  • Reducing the resolution should help. As your texture is fully dynamic, the overhead of uploading new texture data every frame may also be impacting performance. Figuring out the best way to reduce this overhead is a big discussion by itself. It would be best to start a new topic for this.


    JackAsser said:

  • How can I measure how much time each buffer takes, because when I measure the time between one FBO switch and another I always get 0ms, where as the whole frame gets quite some ms. Is everything batched towards the end of the frame by the driver or what? Can I "flush" somehow??



  • Taking timestamps in an application to measure GPU performance is tricky for a number of reasons. This includes the fact that the driver doesn't have to immediately submit to work to the GPU when you pass the data to GL and also that PowerVR GPUs keep multiple frames in flight at any given time (vertex and fragment processing work from different frames is processed in parallel).

    The best way to force a render is to call glReadPixels() (1x1 region will do). As this is a blocking operation, it's best to collect a time stamp, render the frame you want to profile a large number of times(e.g. 2000 or more frames), call glReadPixels() in the last frame before the swap and collect a second time stamp after it returns. Dividing the elapsed time by the number of rendered frames will give an accurate average frame-time (as the frame you're rendering is static, i.e. the number of triangles and blended fragments doesn't change).


    JackAsser said:

  • How would I use discard frame buffer in this case? F.e. the rendered input texture is of no longer use when I've rendered the three outputs. If I do not discard this will the GPU do some copy? Can I get some performance back by discarding here?



  • The discard framebuffer extension tells the driver that the contents of specified buffers do not need to be preserved. This allows you to prevent the GPU from wasting memory bandwidth by redundantly writing out the redundant buffer data to memory.