UBO performance drop

Greetings,

I tried to optimize my application using UBOs in the following way:
Scenario: About 500 drawcalls per frame, triangle strips with glDrawArrays, one draw call something between 5000-50000 triangles, about 30 different VBOs. Each vertex has about 13 bytes of attributes. No textures as of current. Each draw call has several glUniform calls which I have completely eliminated except for one glUniformMatrix* call that sets the mvp matrix. In order to eliminate that last call, I tried the following:

  • Created an UBO with GL_DYNAMIC_DRAW parameter. This UBO holds a set of 256 4x4 float matrices.
  • The UBO is updated once per frame using glBufferData. Completely at the moment.
  • The UBO is bound to an interface block something like this:
    layout(std140, binding = 0) uniform u_CentralMatrixStack { mat4 matrixStack [ 16 * 16 ]; };
  • The VBOs are extended to include one additional 8-bit attribute that indexes matrixStack and retrieves the matrix relevant for the draw call inside the vertex shader.

Doing so dropped my average fps from 25 to 20. Replacing the matrixStack[index] by the original glUniform mvp brought it back up.

I compared the vertex shader assembly output from the compiler for the former and latter case. What came to my attention was that the UBO variant adds a 7 : wdf drc0 instruction to the output.

Can you imagine why UBO access can cause such a performance drop in the way I use it?

Regards.

Note: I reduced the stack size from 16 * 16 to 10 * 10 and the framecount increases again by 2-3 fps.
Is uploading 16kb of data into an UBO so slow the way I do it?

Hi,

can you please post what device you are trying this on? also driver version please :slight_smile:
you might be having cache issues or something like that.
Also, can you please provide a PVRTune file?

thanks,
Marton

Hello,

It’s a GX6250 I think. Driver version I will have to look tomorrow.
Actually, I think I solved one half of the mystery:
I gradually reduced the size of matrixStack until I had 11 * 11 elements. At this point, the wdf and smadd64 instructions disappeared from the disassembly and I got half of my frames back. I suspect that the common store is limited to around 1024 byte on PowerVR 6 and that matrixStack[16*16] broke that limit sending the buffer to main memory instead.

Ok, this is the hardware and driver info:

GL renderer: PowerVR Rogue GX6250
EGL version: 1.4 build 1.10@5187610

I narrowed down the remaining missing 2-3 fps to accessing the matrix array in the shader. I created a minimum shader and put it in the PVRShaderEditor 2.12:

#version 310 es
precision highp float;
layout(std140, binding = 0) uniform u_CentralMatrixStack { mat4 matrixStack [ 11 * 11 ]; };
uniform mat4 u_mvp;
in vec2 a_pos;
in float a_deltaRowColIndex;
flat out int index;
flat out int matrixIndex;

void main()
{
matrixIndex = int( a_deltaRowColIndex );
gl_Position = u_mvp * vec4( a_pos , 1.0, 1.0 );
//gl_Position = matrixStack [ matrixIndex ] * vec4( a_pos , 1.0, 1.0 );
}

This is the disassembly as it is:

--------------------- Disassembled HW Code --------------------

0 : fmul ft0, sh0, vi1
pck.s32.rndzero ft2, vi0
mov i0, ft0;
uvsw.write ft2, 4;

1 : fmad ft0, sh4, vi2, i0
fmul ft1, sh1, vi1
mov i0, ft0;
mov i1, ft1;

2 : fadd ft0, sh8, i0
fmad ft1, sh5, vi2, i1
mov i0, ft0;
mov i1, ft1;

3 : fadd ft0, sh12, i0
fadd ft1, sh9, i1
mov ft0.e0.e1.e2.e3, ft0
mov i0, ft1;
uvsw.write ft0, 0;

4 : fadd ft0, sh13, i0
fmul ft1, sh2, vi1
mov ft0.e0.e1.e2.e3, ft0
mov i0, ft1;
uvsw.write ft0, 1;

5 : fmad ft0, sh6, vi2, i0
fmul ft1, sh3, vi1
mov i0, ft0;
mov i1, ft1;

6 : fadd ft0, sh10, i0
fmad ft1, sh7, vi2, i1
mov i0, ft0;
mov i1, ft1;

7 : fadd ft0, sh14, i0
fadd ft1, sh11, i1
mov ft0.e0.e1.e2.e3, ft0
mov i0, ft1;
uvsw.write ft0, 2;

8 : fadd ft0, sh15, i0
mov ft0.e0.e1.e2.e3, ft0
uvsw.write ft0, 3;

This is the disassembly with the matrixStack line instead:

--------------------- Disassembled HW Code --------------------

0 : pck.s32.rndzero ft2, vi0
mov i2, ft2;

1 : imad16 ft0, i2.e0, c16.e0, sh2.e0
mov idx0, ft0;

2 : fmul ft0, sh0[idx0], vi1
mov i0, ft0;

3 : fmad ft0, sh4[idx0], vi2, i0
mov i0, ft0;

4 : fadd ft0, sh8[idx0], i0
fmul ft1, sh1[idx0], vi1
mov i0, ft0;
mov i1, ft1;

5 : fadd ft0, sh12[idx0], i0
fmad ft1, sh5[idx0], vi2, i1
mov ft0.e0.e1.e2.e3, ft0
mov i0, ft1;
uvsw.write ft0, 0;

6 : fadd ft0, sh9[idx0], i0
fmul ft1, sh2[idx0], vi1
mov i0, ft0;
mov i1, ft1;

7 : fadd ft0, sh13[idx0], i0
fmad ft1, sh6[idx0], vi2, i1
mov ft0.e0.e1.e2.e3, ft0
mov i0, ft1;
uvsw.write ft0, 1;

8 : fadd ft0, sh10[idx0], i0
fmul ft1, sh3[idx0], vi1
mov i0, ft0;
mov i1, ft1;

9 : fadd ft0, sh14[idx0], i0
fmad ft1, sh7[idx0], vi2, i1
mov ft0.e0.e1.e2.e3, ft0
mov i0, ft1;
uvsw.write ft0, 2;

10 : fadd ft0, sh11[idx0], i0
mov ft0.e0.e1.e2.e3, i2
mov i0, ft0;
uvsw.write ft0, 4;

11 : fadd ft0, sh15[idx0], i0
mov ft0.e0.e1.e2.e3, ft0
uvsw.write ft0, 3;

The matrixStack version seems to have some more mov instructions and uses indexing to access the common store instead of only the register names ( sh15[idx0] etc. ).
I wonder if the indexing ruins the cache behaviour in some way.

Hi,

It is possible that your accesses hurt cache performance.
I think the new PVRTuneComplete will contain detailed compiler stats that might be helpful to you. We will release it in a 2-3 weeks I think.

bests,
Marton

1 Like

One question: Are the addresses sh1[idx0],sh2[idx0],sh3[idx0] etc. located next to each other in memory? If no, might this be the cause of the bad cache performance?

Hi,

sorry about the late reply, I think we’ve been contacted by one of your peers through our customer facing ticketing system, and we’ve advised them through that avenue. Are you aware of that?

bests,
Marton