Need Information about boundaries related to OpenGL ES 3.2 implementation

Hello,

I would like to have information about 2 boundaries that are related to OpenGL ES 3.2

The first one is related to the glDispatchCompute function from OpenGL ES 3.2
It seems that when using shared memory in a compute shader, I cannot dispatch more than 32x32x1 workgroups, more than that and the actual number of workgroups dispatched won’t match the number of workgroups requested.
I would like to have more information about this behavior, because I don’t see such limitations in the OpenGL ES Specification and the number limit of maximum workgroups is much higher than 32x32.

Also, it seems that compute shaders running for a long time get terminated, thus not completing their computation.
I would like to have more information regarding this behavior, such as what is the maximum allowed time for a thread or what are the conditions to terminate a thread/workgroup before its completion.

Best regards,
Antoine

Hi Antoine,

Thank you for you message. To be able to help you further could I please get some information on which PVR device you are using, the DDK version and Android version. I would then get back to you with the information you need.

Kind regards,

Stephen

I’m not sure about where can I obtain the DDK version, but I can tell you that I am working on the Rogue GE8430 with the full OpenGL ES Version: 3.2 build 1.10@5371573
The program runs under Linux 5.4.74 (aarch64)

Best regards,
Antoine

Hi Antoine,

Thanks for the information, the DDK version I was referring to is the number after the @ symbol of the build so that is fine.

In regard to your issue with workgroups, from speaking with the DDK team this is an issue they have not seen before. I would suggest using the OpenGLES Query API to query all the compute functionality such as the max workgroup number. Please also check the debug dump to see if there is any relevant info using:

cat /sys/kernel/debug/pvr/debug_dump

or

debug_dump -d

In regard to the early termination of compute shaders, for Linux there is a watchdog daemon that detects when a program hangs for too long. It does this by constantly writing to the watchdog (in /dev/watchdog/) at a time specified by the registers. If possible see if you can install a program called “wdt” which will allow you to display and set watchdog timer parameters. You might need to also install “watchdog”.

Kind regards,

Stephen

Hello Stephen,

For now, regarding the early termination of compute shaders, the idea is that I am trying to do as much (useless) computation in each thread, with few groups, so as to get an empirical estimation of raw peak Gflops.

Put it simply, I am multiplying mat4 variables in a for loop with a big number of iterations. It works well, up to some number of iterations, but the overall computation time is still less than a second, and because I already have done some much longer matrix multiplication on the CPU side, I’m not convinced that the watchdog is relevant (because my process have done much more compute intensive task and still completed it on the CPU side).

What I am guessing, is that there is some absolute number of cycles that can be done in a single thread, which is defined by the driver, and I would like to be aware of this value, or where to obtain this value.

With the help of the debug_dump binary, I do notice that the programs giving me wrong results (higher estimation, due to early termination and thus wrong gflops estimation) appear as:

Recovery 1: PID = 1358, frame = 0, HWRTData = 0x00000000, EventStatus = 0x00000400, Guilty Overrun
              CRTimer = 0x00003873E1BA, OSTimer = 332.622666312, CyclesElapsed = 1003083264
              PreResetTimeInCycles = 51456, HWResetTimeInCycles = 28160, TotalRecoveryTimeInCycles = 79616

Here is the full debug_dump log:
debug_dump_log.txt (11.9 KB)

So there, I launched my program 2 times, one time with a reasonable amount of computation with PID 1357 which does not appear in this debug log, and an other one with a much higher number of iterations with PID 1358, and this Guilty Overrun thing makes me think the driver is the one responsible for the early termination of the compute shaders, this is what i’d like to get more information about.

To be more clear, I think that it makes sense that the gpu program cannot stall the GPU and thus have a limitation on its execution time/total cycles, it’s just that I want to be fully aware of this limitation.

Best regards,
Antoine

Hi Antoine,

Thank you for the debug dump. The “Guilty Overrun” you see happens when the shader runs for too long and then we kill it. The worst-case overrun deadline we know of is 5 seconds but it could be sooner than that.

You may be doing a very time-expensive glDispatchCompute. On Windows10 such dispatches will kill the entire gpu and reboot the GPU driver, when the GPU took a few seconds to finish their tasks. Our hardware has less Gflops so the amount of operations it can do in 5 seconds is fewer. Instead of rebooting the driver, we seamlessly kill the workload and resume.

If you wanted to get around this (on our’s or any other GPU), glDispatchCompute fewer workgroups at a time. It possible to do this by using glUniform2ui to set an offset to be used with gl_WorkgroupID.xy before each dispatch.

I hope this helps.

Kind regards,

Stephen