1.The Apple's <OpenGLES Programming Guide> say vertex shader has 128 uniform vectors.

2.The PowerVR's <PowerVR SGX OpenGLES 2.0 Application Development Recommandation> say it has up to 512 uniform vectors.

3.I use glGetIntegerv(GL_MAX_VERTEX_UNIFORM_VECTORS) in simulator and iPad Device, and it return 128.

4.But I write

mat4 matModelViewProj;

vec4 BoneMatrix[256];

in my vertex shader, and it compile and link and work successfully on simulator and iPad Device, the uniform location for matModelViewProj returned from glGetUniformLocation is 256!!! And I can increase this value(256) to 768, and it still working. When increasing to 1024, the glLinkProgram crashed.

5.And the GPU Skinning vertex shader performance is very slow, with ths same FPS as CPU Skinning.

6.I found the 'Real Character Demo' in PowerVR Insider Demos, say it can support 42 bones, so 42 * 3 = 126 vectors, so only 2 uniform vectors left for normal transform and lighting, it can not be possible if there is only 128 vectors in SGX.

I am very confused about these, anyone can explain? Anyone has the source code of the 'Real Character Demo'?

The device is iPad wifi 64G, iOS 3.2

Thanks for your reply.

yelinghuye2010-08-30 04:57:46

Hi,

Yes i found different numbers too, that’s why i asked if PowerVR could publish the exact specifications of their chips.

But i’m now afraid that it actually depends on who wrote the drivers :s. They allow their customers to make their own drivers so they may not use all that’s in the chip. That’s actually a question, but i hope i’m wrong.

For 6) : Although the meshes in the Real Characters demo contain 42 bones, each mesh is split into multiple batches containing far fewer bones (a maximum of 8 bones in this case I believe). The calculation of the batches is done at export time using PVRGeoPOD. The Chameleon Man demo in the SDK demonstrates the technique for drawing a skinned mesh in multiple batches.

Hope that’s helpful.

The document actually mentions 512 uniform scalars, which corresponds to 128 vec4s. However that's really just the number which is guaranteed to work, see below.

SGX has a limited amount of space to store uniforms (which can vary between implementations). You can use more uniforms, but these have to be stored in external memory which is slow. Unfortunately OpenGL ES doesn't have a way to distinguish between "fast" and "slow" uniforms. It's usually best to keep the number of uniforms low.

Could you please submit a bug report to Apple?

How many bones (per vertex, per mesh) do you have, how many skinned vertices in total? How does your shader look like? Is your application already GPU limited without skinning?

If you are GPU limited and the CPU has cycles to spare, it makes sense to perform skinning on the CPU. If that's the case, make sure that you/the compiler uses NEON SIMD instructions for the skinning code.

Very appreciated for your replies.

I review the <PowerVR SGX OpenGLES 2.0 Application Development Recommandation>, yes, it actually mentions 512 uniform scalars, I was wrong about it.

**About the performance:**

**1**.I render a same skin mesh 20 times, the mesh has 1265 vertices and 1698 triangles, 70 bones, max 2 bones for a vertex, just single texture stage for skin mesh. So the test scene has 1265*20=25300 vertices and 1698*20=33960 triangles.

The scene FPS is 30 on the iPad device currently, same as my CPU skinning implementation.

I use VBO for the mesh, the vertex format is

struct MY_SKIN_VERTEX

{

VECTOR3 p;

VECTOR3 n;

float u, v;

BYTE index[2]; // only max 2 bones affect a vertex

BYTE weight[2]; // unsigned byte needing scale to [0-1]

};

Below is my vertex shader code:

**attribute highp vec3 my_Vertex;**

attribute lowp vec4 my_VertexColor;

attribute mediump vec2 my_TexCoord0;

attribute mediump vec4 my_BoneIndexWeight; // glVertexAttriPointer(GL_BONE_INDEXWEIGHT_ARRAY, 4, GL_UNSIGNED_BYTE, GL_FALSE, sizeof(MY_SKIN_VERTEX), (const GLvoid*)32);

uniform highp mat4 my_ModelViewProj;

uniform highp vec4 my_BoneMatrix[3 * 70]; // Every 3 vec4 construct a 4x3 bone matrix

varying mediump vec2 PixelTexCoord0;

varying lowp vec4 PixelColor;

void main()

{

int nIndex1 = int(my_BoneIndexWeight.x * 3.0);

int nIndex2 = int(my_BoneIndexWeight.y * 3.0);

lowp float fWeight1 = my_BoneIndexWeight.z * 0.0039215686;// divide 255, scale to [0-1]

lowp float fWeight2 = my_BoneIndexWeight.w * 0.0039215686;

highp vec4 v1 = my_BoneMatrix[nIndex1];

highp vec4 v2 = my_BoneMatrix[nIndex1+1];

highp vec4 v3 = my_BoneMatrix[nIndex1+2];

highp vec3 vVertex1 = vec3(v1.x * my_Vertex.x + v1.y * my_Vertex.y + v1.z * my_Vertex.z + v1.w,

v2.x * my_Vertex.x + v2.y * my_Vertex.y + v2.z * my_Vertex.z + v2.w,

v3.x * my_Vertex.x + v3.y * my_Vertex.y + v3.z * my_Vertex.z + v3.w);

v1 = my_BoneMatrix[nIndex2];

v2 = my_BoneMatrix[nIndex2+1];

v3 = my_BoneMatrix[nIndex2+2];

highp vec3 vVertex2 = vec3(v1.x * my_Vertex.x + v1.y * my_Vertex.y + v1.z * my_Vertex.z + v1.w,

v2.x * my_Vertex.x + v2.y * my_Vertex.y + v2.z * my_Vertex.z + v2.w,

v3.x * my_Vertex.x + v3.y * my_Vertex.y + v3.z * my_Vertex.z + v3.w);

highp vec4 vBlendVertex = vec4(vVertex1 * fWeight1 + vVertex2 * fWeight2, 1.0);

gl_Position = my_ModelViewProj * vBlendVertex;

PixelTexCoord0 = my_TexCoord0;

PixelColor = my_VertexColor;

}

attribute lowp vec4 my_VertexColor;

attribute mediump vec2 my_TexCoord0;

attribute mediump vec4 my_BoneIndexWeight; // glVertexAttriPointer(GL_BONE_INDEXWEIGHT_ARRAY, 4, GL_UNSIGNED_BYTE, GL_FALSE, sizeof(MY_SKIN_VERTEX), (const GLvoid*)32);

uniform highp mat4 my_ModelViewProj;

uniform highp vec4 my_BoneMatrix[3 * 70]; // Every 3 vec4 construct a 4x3 bone matrix

varying mediump vec2 PixelTexCoord0;

varying lowp vec4 PixelColor;

void main()

{

int nIndex1 = int(my_BoneIndexWeight.x * 3.0);

int nIndex2 = int(my_BoneIndexWeight.y * 3.0);

lowp float fWeight1 = my_BoneIndexWeight.z * 0.0039215686;// divide 255, scale to [0-1]

lowp float fWeight2 = my_BoneIndexWeight.w * 0.0039215686;

highp vec4 v1 = my_BoneMatrix[nIndex1];

highp vec4 v2 = my_BoneMatrix[nIndex1+1];

highp vec4 v3 = my_BoneMatrix[nIndex1+2];

highp vec3 vVertex1 = vec3(v1.x * my_Vertex.x + v1.y * my_Vertex.y + v1.z * my_Vertex.z + v1.w,

v2.x * my_Vertex.x + v2.y * my_Vertex.y + v2.z * my_Vertex.z + v2.w,

v3.x * my_Vertex.x + v3.y * my_Vertex.y + v3.z * my_Vertex.z + v3.w);

v1 = my_BoneMatrix[nIndex2];

v2 = my_BoneMatrix[nIndex2+1];

v3 = my_BoneMatrix[nIndex2+2];

highp vec3 vVertex2 = vec3(v1.x * my_Vertex.x + v1.y * my_Vertex.y + v1.z * my_Vertex.z + v1.w,

v2.x * my_Vertex.x + v2.y * my_Vertex.y + v2.z * my_Vertex.z + v2.w,

v3.x * my_Vertex.x + v3.y * my_Vertex.y + v3.z * my_Vertex.z + v3.w);

highp vec4 vBlendVertex = vec4(vVertex1 * fWeight1 + vVertex2 * fWeight2, 1.0);

gl_Position = my_ModelViewProj * vBlendVertex;

PixelTexCoord0 = my_TexCoord0;

PixelColor = my_VertexColor;

}

**2**.I also used the PVRTBoneBatch found in the PowerVR SDK to split the skin into 2 render batch and 35 bones for each batch, and the FPS keep the same 30.

**3**.I also implemented the skin calculation in CPU, but not using VBO(pure client vertex array), and the FPS is also 30.

**4**.When I only calculate 1 bone in the vertex shader(just comment out the second bone calculation), the FPS go up to 40, so I think the bottleneck is at the vertex shader.

**5**.When I draw only 16 skin mesh(decreasing 4), but the FPS is still 30, I was very confused about that.

**.And another interesting thing I found that, when I increase the 70 in the vertex shader(vec4 my_BoneMatrix[3 * 70]) to 84, the FPS drop down to 23. It seems it used the slow uniforms. When I decrease the 70 to 30(the animation is wrong but vertex shader is still working), the FPS keep the same 30.**

6

6

My game is a MMORPG, the CPU is very limited, I think the GPU skinning can improve the skin efficiency, but this result discouraged me. And I doubt the iPad OpenGLES implementation use the CPU to do vertex shader, why there is no difference between CPU skinning and GPU skinning?

yelinghuye2010-09-01 05:02:26

On iPad (and iPhone) the frame rate is always linked to the screen refresh rate of 60 Hz. With a constant workload you will always hit framerates like 60, 40, 30, 24, 20, 17.14, … (i.e. 60/1, 60/1.5, 60/2, 60/2.5, etc.).

Thus if you get 30 fps for two different workloads it doesn’t mean they both take the same time to process. It just means that both take between 25 and 33.3 ms.

Also, keep in mind that SGX is a unified shader architecture. By rendering fewer meshes you decrease the vertex workload, but the fragment workload is probably much less affected. So reducing the number of meshes by 20% might be a 5% reduction in total shader workload per frame.