texture2D Cycle Count

I’m examining a shader using PVRUniSCoEditor v1.3. I have a question regarding the cycle counts for texture2D in an OpenGL ES 2.0 fragment shader. The cycle counts are shown in parentheses below for reading just the rgb values of a texture:

    (4) vec3 c0 = texture2D(img0, vTexture0.xy).rgb;
    (3) vec3 c1 = texture2D(img0, vTexture0.zw).rgb;
    (3) vec3 c2 = texture2D(img0, vTexture1.xy).rgb;

If I read in the rgba values, the cycle counts are:

    (2) vec4 c0 = texture2D(img0, vTexture0.xy);
    (1) vec4 c1 = texture2D(img0, vTexture0.zw);
    (1) vec4 c2 = texture2D(img0, vTexture1.xy);

I would have thought that reading in less data would have resulted in the rgb version having lower cycle counts. I’m not entirely sure what the cycle counts mean for texture2D, I guess. (This shader will be reading from an RGBA texture.)

I would imagine the cycles aren’t including the time it takes to actually retrieve the data. It seems that would depend more on the hardware.

So, is it generally better for one to read in rgba rather than rgb?

(I know the ultimate answer is to test it out on the hardware, but I generally don’t have access to the hardware, an iPhone 4.)

Thanks.



Hi pail,


I apologise for the slow response on this.





The reason you are seeing a difference in cycle counts between the .xy and .zw samples is that the hardware treats .zw texture coordinates as dependent texture reads. This is a result of the swizzle causes by using the zw values of the vector. As the compiler only estimates per line cycle counts, the extra cycle is added to the first line that is encountered (as the dependent texture read is not tied to a particular line of code as far as the compiler is concerned).


In the case of the xy sample, the hardware is able to sample the texture ahead of time, thereby saving the additional cycles a dependent read induces (as well as avoiding a stall it may cause).


You can resolve this issue by using two vec2 texture coordinates instead of packing this into a vec4. Both scenarios require two 32 bit registers to store this information, so there is no advantage to packing the data into a vec4.





As for the difference between the .rgb and full rgba sampling, the difference in cycle counts is due to the data being natively stored as a vec4. This means a single vec4 move operation is required to sample this data, whereas only using a subset of this data requires a more complex operation to retrieve the required information.





Regards,


JoeJoe2010-10-05 10:54:18

Thanks. That is great information. I was able to eke out 1 more fps. While that may not seem like much, we are targeting 15 fps, so to me that is good progress. We are doing many passes with many filters. Our worst case is now about 12 fps.

Glad to hear it helped :slight_smile:





If you have any other shader compiler related questions to help you further improve the performance of your code, feel free to start a new topic. I’ll try to get back sooner next time