UBO performance drop

desperado · May 17, 2019, 9:45pm

Greetings,

I tried to optimize my application using UBOs in the following way:
Scenario: About 500 drawcalls per frame, triangle strips with glDrawArrays, one draw call something between 5000-50000 triangles, about 30 different VBOs. Each vertex has about 13 bytes of attributes. No textures as of current. Each draw call has several glUniform calls which I have completely eliminated except for one glUniformMatrix* call that sets the mvp matrix. In order to eliminate that last call, I tried the following:

Created an UBO with GL_DYNAMIC_DRAW parameter. This UBO holds a set of 256 4x4 float matrices.
The UBO is updated once per frame using glBufferData. Completely at the moment.
The UBO is bound to an interface block something like this:
layout(std140, binding = 0) uniform u_CentralMatrixStack { mat4 matrixStack [ 16 * 16 ]; };
The VBOs are extended to include one additional 8-bit attribute that indexes matrixStack and retrieves the matrix relevant for the draw call inside the vertex shader.

Doing so dropped my average fps from 25 to 20. Replacing the matrixStack[index] by the original glUniform mvp brought it back up.

I compared the vertex shader assembly output from the compiler for the former and latter case. What came to my attention was that the UBO variant adds a 7 : wdf drc0 instruction to the output.

Can you imagine why UBO access can cause such a performance drop in the way I use it?

Regards.

desperado · May 20, 2019, 12:13pm

Note: I reduced the stack size from 16 * 16 to 10 * 10 and the framecount increases again by 2-3 fps.
Is uploading 16kb of data into an UBO so slow the way I do it?

MartonTamas · May 21, 2019, 2:38pm

Hi,

can you please post what device you are trying this on? also driver version please
you might be having cache issues or something like that.
Also, can you please provide a PVRTune file?

thanks,
Marton

desperado · May 21, 2019, 7:25pm

Hello,

It’s a GX6250 I think. Driver version I will have to look tomorrow.
Actually, I think I solved one half of the mystery:
I gradually reduced the size of matrixStack until I had 11 * 11 elements. At this point, the wdf and smadd64 instructions disappeared from the disassembly and I got half of my frames back. I suspect that the common store is limited to around 1024 byte on PowerVR 6 and that matrixStack[16*16] broke that limit sending the buffer to main memory instead.

desperado · May 22, 2019, 11:25am

Ok, this is the hardware and driver info:

GL renderer: PowerVR Rogue GX6250
EGL version: 1.4 build 1.10@5187610

I narrowed down the remaining missing 2-3 fps to accessing the matrix array in the shader. I created a minimum shader and put it in the PVRShaderEditor 2.12:

#version 310 es
precision highp float;
layout(std140, binding = 0) uniform u_CentralMatrixStack { mat4 matrixStack [ 11 * 11 ]; };
uniform mat4 u_mvp;
in vec2 a_pos;
in float a_deltaRowColIndex;
flat out int index;
flat out int matrixIndex;

void main()
{
matrixIndex = int( a_deltaRowColIndex );
gl_Position = u_mvp * vec4( a_pos , 1.0, 1.0 );
//gl_Position = matrixStack [ matrixIndex ] * vec4( a_pos , 1.0, 1.0 );
}

This is the disassembly as it is:

--------------------- Disassembled HW Code --------------------

0 : fmul ft0, sh0, vi1
pck.s32.rndzero ft2, vi0
mov i0, ft0;
uvsw.write ft2, 4;

1 : fmad ft0, sh4, vi2, i0
fmul ft1, sh1, vi1
mov i0, ft0;
mov i1, ft1;

2 : fadd ft0, sh8, i0
fmad ft1, sh5, vi2, i1
mov i0, ft0;
mov i1, ft1;

3 : fadd ft0, sh12, i0
fadd ft1, sh9, i1
mov ft0.e0.e1.e2.e3, ft0
mov i0, ft1;
uvsw.write ft0, 0;

4 : fadd ft0, sh13, i0
fmul ft1, sh2, vi1
mov ft0.e0.e1.e2.e3, ft0
mov i0, ft1;
uvsw.write ft0, 1;

5 : fmad ft0, sh6, vi2, i0
fmul ft1, sh3, vi1
mov i0, ft0;
mov i1, ft1;

6 : fadd ft0, sh10, i0
fmad ft1, sh7, vi2, i1
mov i0, ft0;
mov i1, ft1;

7 : fadd ft0, sh14, i0
fadd ft1, sh11, i1
mov ft0.e0.e1.e2.e3, ft0
mov i0, ft1;
uvsw.write ft0, 2;

8 : fadd ft0, sh15, i0
mov ft0.e0.e1.e2.e3, ft0
uvsw.write ft0, 3;

This is the disassembly with the matrixStack line instead:

--------------------- Disassembled HW Code --------------------

0 : pck.s32.rndzero ft2, vi0
mov i2, ft2;

1 : imad16 ft0, i2.e0, c16.e0, sh2.e0
mov idx0, ft0;

2 : fmul ft0, sh0[idx0], vi1
mov i0, ft0;

3 : fmad ft0, sh4[idx0], vi2, i0
mov i0, ft0;

4 : fadd ft0, sh8[idx0], i0
fmul ft1, sh1[idx0], vi1
mov i0, ft0;
mov i1, ft1;

5 : fadd ft0, sh12[idx0], i0
fmad ft1, sh5[idx0], vi2, i1
mov ft0.e0.e1.e2.e3, ft0
mov i0, ft1;
uvsw.write ft0, 0;

6 : fadd ft0, sh9[idx0], i0
fmul ft1, sh2[idx0], vi1
mov i0, ft0;
mov i1, ft1;

7 : fadd ft0, sh13[idx0], i0
fmad ft1, sh6[idx0], vi2, i1
mov ft0.e0.e1.e2.e3, ft0
mov i0, ft1;
uvsw.write ft0, 1;

8 : fadd ft0, sh10[idx0], i0
fmul ft1, sh3[idx0], vi1
mov i0, ft0;
mov i1, ft1;

9 : fadd ft0, sh14[idx0], i0
fmad ft1, sh7[idx0], vi2, i1
mov ft0.e0.e1.e2.e3, ft0
mov i0, ft1;
uvsw.write ft0, 2;

10 : fadd ft0, sh11[idx0], i0
mov ft0.e0.e1.e2.e3, i2
mov i0, ft0;
uvsw.write ft0, 4;

11 : fadd ft0, sh15[idx0], i0
mov ft0.e0.e1.e2.e3, ft0
uvsw.write ft0, 3;

The matrixStack version seems to have some more mov instructions and uses indexing to access the common store instead of only the register names ( sh15[idx0] etc. ).
I wonder if the indexing ruins the cache behaviour in some way.

MartonTamas · May 22, 2019, 11:42am

Hi,

It is possible that your accesses hurt cache performance.
I think the new PVRTuneComplete will contain detailed compiler stats that might be helpful to you. We will release it in a 2-3 weeks I think.

bests,
Marton

desperado · May 22, 2019, 11:51am

One question: Are the addresses sh1[idx0],sh2[idx0],sh3[idx0] etc. located next to each other in memory? If no, might this be the cause of the bad cache performance?

MartonTamas · June 6, 2019, 1:47pm

Hi,

sorry about the late reply, I think we’ve been contacted by one of your peers through our customer facing ticketing system, and we’ve advised them through that avenue. Are you aware of that?

bests,
Marton

Topic		Replies	Views
Performance drop while using MSAA with FBO PowerVR Insider pvrvframe , pvrtrace	12	1131	September 4, 2015
glDrawArrays and glDrawElements PowerVR Insider	1	306	September 22, 2009
Vertex Buffer Objects Performance PowerVR Insider	2	216	April 29, 2008
Triangle strips faster? PowerVR Insider pvrgeopod	16	434	April 8, 2013
[question]how to optimize vbo PowerVR Insider	1	305	January 4, 2011

UBO performance drop

Related Topics