Greetings,
I tried to optimize my application using UBOs in the following way:
Scenario: About 500 drawcalls per frame, triangle strips with glDrawArrays, one draw call something between 5000-50000 triangles, about 30 different VBOs. Each vertex has about 13 bytes of attributes. No textures as of current. Each draw call has several glUniform calls which I have completely eliminated except for one glUniformMatrix* call that sets the mvp matrix. In order to eliminate that last call, I tried the following:
- Created an UBO with GL_DYNAMIC_DRAW parameter. This UBO holds a set of 256 4x4 float matrices.
- The UBO is updated once per frame using glBufferData. Completely at the moment.
- The UBO is bound to an interface block something like this:
layout(std140, binding = 0) uniform u_CentralMatrixStack { mat4 matrixStack [ 16 * 16 ]; }; - The VBOs are extended to include one additional 8-bit attribute that indexes matrixStack and retrieves the matrix relevant for the draw call inside the vertex shader.
Doing so dropped my average fps from 25 to 20. Replacing the matrixStack[index] by the original glUniform mvp brought it back up.
I compared the vertex shader assembly output from the compiler for the former and latter case. What came to my attention was that the UBO variant adds a 7 : wdf drc0 instruction to the output.
Can you imagine why UBO access can cause such a performance drop in the way I use it?
Regards.