Performance drop while using MSAA with FBO

We are noticing around 10% performance drop when MSAA is turned on (compared to MSAA turned off) during rendering to FBO.

Is it normal to see this amount of performance drop?

Example code for creting and rendering FBO:
[pre]
void Createframebuffer()
{
int width = 1920;
int height = 720;
int m_iMaxSamples = 0;
glGetIntegerv( GL_MAX_SAMPLES_IMG, &m_iMaxSamples );
m_iMaxSamples = m_iMaxSamples >> 1;
glGenFramebuffers( 1, m_iFrameBuffer);
glGenTextures( 1, m_iFrameTexture);
glGenRenderbuffers( 1, &m_iDepthBuffer);

glBindTexture(GL_TEXTURE_2D,m_iFrameTexture);
glTexImage2D( GL_TEXTURE_2D, 0, GL_RGBA, width, height, 0, GL_RGBA, GL_UNSIGNED_BYTE, NULL );

/// Texture Parameter Setting
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE );
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE );
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
			
	/// Render Buffer Bind ( Depth )
glBindRenderbuffer( GL_RENDERBUFFER, m_iDepthBuffer );
if( true == m_b_IMG_multisampled_FBO)
{
	glRenderbufferStorageMultisampleIMG( GL_RENDERBUFFER, m_iMaxSamples, GL_DEPTH_COMPONENT16, width, height );
}
else
{
	glRenderbufferStorage( GL_RENDERBUFFER, GL_DEPTH_COMPONENT16, width, height );
}

/// Frame Buffer Bind
glBindFramebuffer(GL_FRAMEBUFFER, m_iFrameBuffer);
if( true == m_b_IMG_multisampled_FBO)
{
	glFramebufferTexture2DMultisampleIMG( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_TEXTURE_2D, m_iFrameTexture, 0, m_iMaxSamples);
}
else
{
	glFramebufferTexture2D( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_TEXTURE_2D, m_iFrameTexture, 0);
}

///  Attaching a Renderbuffer as a Framebuffer Attachment
glFramebufferRenderbuffer( GL_FRAMEBUFFER, GL_DEPTH_ATTACHMENT, GL_RENDERBUFFER, m_iDepthBuffer);
glBindFramebuffer(GL_FRAMEBUFFER, m_iDefaultFrameBuffer);

}

void Drawdone()
{
glBindFramebuffer(GL_FRAMEBUFFER, 0);

float VertexList[] = 
{
	-1.0f,	-1.0f,	0.0f, 0.0f,
	1.0f,	-1.0f,	1.0f, 0.0f,
	-1.0f,	1.0f,	0.0f, 1.0f,
	1.0f,	1.0f,	1.0f, 1.0f
};


glBindTexture(GL_TEXTURE_2D, m_iFrameTexture);

	
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE );
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE );
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
	
glEnableVertexAttribArray( 0 );
glEnableVertexAttribArray( 1 );

glVertexAttribPointer(0, 2, GL_FLOAT, GL_FALSE, 4 * sizeof(GLfloat), VertexList);
glVertexAttribPointer(1, 2, GL_FLOAT, GL_FALSE, 4 * sizeof(GLfloat), (VertexList + 2));

glDrawArrays(GL_TRIANGLE_STRIP, 0, 4);

glDisableVertexAttribArray( 0 );
glDisableVertexAttribArray( 1 );
eglSwapBuffers( m_eglDisplay, m_eglSurface );

glBindFramebuffer(GL_FRAMEBUFFER, m_iFrameBuffer);

}
[/pre]

Also, does increasing/decreasing number of samples affect the performance?

Thanks in advance

With MSAA, pixels that contain an edge will have to execute the fragment shader for each sub-sample. The more triangles in a scene, the more likely that sub-sample colour values will need to be calculated.

It may be work considering a screen-space anti-aliasing technique if your scene has a very high number of triangles. MSAA will give better visual quality in most cases, and it is possible for screen-space AA to be more expensive, particularly if you’re bandwidth limited.

But drawing on glSurface with MSAA doesn’t cause this amount of performance drop.
Only while drawing on FBO this amount of performance drop is noticed (the extra time to draw from FBO to glSurface is deducted).
So, is drawing on FBO expensive than drawing on glSurface with MSAA on?

Zulkarnine, this may or may not be it, but something you might check: when rendering to the FBO, does your code discard the depth buffer (gl{Discard,Invalidate}Framebuffer)? If not, you might try that. Without this, I believe the GPU will resolve and write out your 1920x720 depth buffer to GPU/system memory which will take time. Similarly, would verify that you glClear both color and depth before the render-to-FBO, though you probably already are.

Also, what’s GL_MAX_SAMPLES_IMG on your platform? The greater the number of samples per pixel, the smaller the screen tiles and the greater the number of screen tiles/tile lists which must be built and processed. You might compare cost between 2X MSAA and 4X or 16X (if supported) and see if that changes the 10% you’re seeing.

PaulL, with MSAA you said that “pixels that contain an edge will have to execute the fragment shader for each sub-sample.” Does that really happen on PowerVR GPUs? That’s supersampling, not multisampling. If so, is there a way to turn that behavior off to restore strict multisampling behavior (1 shading sample per pixel; N geometry samples per pixel).

Another thought… Would glBlitFramebuffer (GL_NEAREST) be a faster method for getting the FBO-rendered texture to the system framebuffer than rendering a textured quad?

Thank you Dark_Photon for your observations and suggestions.

Yes we are clearing color, stencil & depth buffer. But what is the difference between glClear()’ ing depth buffer and Invalidate/discard depth buffer after every frame?

The code didn’t discard/invalidate the depth buffer but I will try with that one.

GL_MAX_SAMPLES_IMG = 8 in our platform. So it will be 4 that is used in the sample code. I will check if the cost varies with 2X MSA.

I tried with the glBlitFramebuffer but for some reason it was not working on the target. On the other hand, it worked on WIN32 with MSAA off but didn’t work when MSAA was on.
Is glBlitFramebuffer( ) supposed to be faster than rendering texture?

Dark_Photon, you’re right there. I got that entirely wrong, it does just work as MSAA where there should only be a single sample per pixel.

Zulkamine, glBlitFramebuffer with MSAA is outside of the ES specification, so our drivers don’t handle it. We still think the performance drop is unusual, if possible, could you provide a trace of your use case, otherwise describe how your scene is rendered.

Zulkamine:
[blockquote]But what is the difference between glClear’ ing depth buffer and Invalidate/discard depth buffer after every frame?[/blockquote]
It’s subtle. I was hoping PaulL would take this, but I’ll give it a stab. ImgTech’s PaulS explained this to me, but I think the PowerVR docs describe it as well.

By default the driver can’t assume you’re going to replace every pixel/sample in your render target with new content (for example: suppose you only drew to half the framebuffer; what should be in the other half? Answer: the old contents).

Given that, by default the driver needs to 1) read-in the full framebuffer contents from CPU memory into GPU tile cache before rendering and 2) write-out the full framebuffer from GPU tile cache to CPU memory after rendering. Usually #1 is typically a total waste of time/cycles for all buffers, and #2 is typically useless w.r.t. depth/stencil specifically.

So how to prevent that waste? A glClear() of all bits in a buffer at the beginning of rendering prevents #1, and the {Invalidate,Discard}Framebuffer at the end of rendering prevents #2.

Also, to the wording in your question, you’d glClear all bits/channels of the color, depth, and/or stencil (when utilized) to eliminate the max amount of needless memory transfers up-front, and invalidate depth and stencil afterwards. This tells the driver/GPU to read in no initial contents before rendering and only write out the color buffer afterwards.

@Dark_Photon, Thanks a lot for explaining that to me in detail.

So it seems to be a good reason to add Invalidate/Discard (depth & stencil) framebuffers at the end and I will obviously try that and see if that can help improve the performance.

As you have mentioned too, PaulS of ImgTech is really great at explaining this kind of concepts and how the drivers work. I had some conversations with him lately and he always tried his best to help. He is a great person.

@PaulL So far I know, GL_IMG_multisampled_render_to_texture is not supported on WIN32 even if IsGLExtensionSupported(“GL_IMG_multisampled_render_to_texture”); returns true. The FBOs are not rendered properly and it’s a mess overall.

So, i think PVRTrace won’t be able to provide any good overall view of what we are trying to do.

The main problem is that, After loading All necessary textures and VBOs to memory, Same API calls with MSAA turned on adds about 10% additional overhead than when MSAA was turned off for the same FBO.

So, Is it supposed to be 10% slower when MSAA for FBO is on? Or is it a driver issue?

if IMG_multisampled_render_to_texture isn’t working correctly on win32, it is potentially a PVRVFrame bug, a trace will help us to identify that.

A trace will also let us do some analysis on our test hardware here, and we might be able to identify the performance issue from that.

@PaulL, thanks for checking on MSAA!

@Zulkarnine, re glClear, glad to help! Re glBlitFramebuffer, since you’re rendering MSAA to your FBO (via IMG_multisampled_render_to_texture), can’t the EGLSurface be single-sampled? If so, then in GLES there should be no problem with glBlitFramebuffer as you’re blitting from/to single-sampled framebuffers. What’s not supported is the destination of the blit being multisampled.

@Dark_Photon, it would be nice to turn off multisampling for the eglSurface but unfortunately there are some floating objects which need to be rendered after the FBO is drawn to the eglSurface (and they also need MSAA). So I think, for now I have to go without glBlitFramebuffer :frowning:

BTW thanks for clarifying what is actually supported and what isn’t :slight_smile:

@PaulL I will try to provide you the PVRTrace by tomorrow

Hello, @Paull ,
I will be creating a ticket and provide the link to the trace files there as it contains some confidential information.