Most Optimized glReadPixels readback

Hi,





I’m working with one of the more popular 535 devices, trying to readback from the opengl target.





Unfortunately, I can’t seem to hit acceptable performance (~3/4 ms for a 480x320 rgba8 image), no matter what tricks I try. Does someone have an explicit example?





From instrumentation, it seems that most energy is wasted in memcpy within glgProcessPixelsWithProcessor. At this point, even an empty render loop that issues no gl commands after a flush still incurs high utilization. I’ve been told to try a cycled buffer, but can’t seem to get an implementation that comes close to my original performance.





Any examples (explicit)? Does anyone know what the expected performance should be?





Any help appreciated.

Just out of curiosity: on which platform did you measure your “original performance”?
Also: my understanding of “glFlush” was that it only makes sure to issue all GL commands but doesn’t wait for them to complete. (For that you would require “glFinish”.) Maybe this explains your observation.

Sorry, by flush I meant finish. Both were tested to the end of making sure commands were issued/completed.





The platform is iPhone 4. I can’t get anything close to acceptable performance from the glReadPixels command. I’ve tried as many variations on the theme as I can think of. Nothing breaks 15ms. And I could only get that by dividing a single readpixels into 4 readpixels. That is sad commentary on the state of the implementation.

btw, it’s a 480x320 bgra8 read

Hi piezas,





Possibly the first thing to note is that the iPhone is VSync limited to 60fps by Apple’s drivers, so you’ll never get a frame time less than ~16.67ms because of this.





The other thing to note is that doing glReadPixels is generally discouraged as it stalls the hardware. From our performance recommendations:


"Any access to the framebuffer will cause the driver to flush queued rendering commands and wait for rendering to finish, removing all parallelism between the CPU and the different modules in the graphics core."


I suggest reading over our OGLES2 Application Programming Recommendations that can be found in the SDK or online HERE as it has several other good tips on keeping your frame rates high :slight_smile:





As our recommendations suggest, you should favour framebuffer objects (FBOs) over glReadPixels. A good example of this can be seen in our Render to Texture training course in the SDK package.


Is there a specific effect you’re trying to achieve? Perhaps we can help recommend a different approach?





Thanks,





TobiasTobias2010-11-18 11:04:28

Thanks, Tobias.





Unfortunately, since I need to do further processing on the CPU (things that currently can’t be done without an OpenCL type API), I have to perform the readback. So I guess I have to take the hit.





Is that 16ms performance floor enforced for activities that don’t require the screen renderbuffer? For instance, if I am rendering to an FBO (possibly buffer cycled), can I read out from a texture not currently being draw to more frequently than 60HZ?





BTW, on the OpenCL path, I’d love to start getting into OpenCL development with PowerVR, but don’t know where that path begins (either with imgtec or a consumer device vendor or dev device integrator). Can we discuss further? Offline, if necessary?

Hi piezas,





You’ve certainly got an interesting problem! We believe that doing offscreen renders to an FBO should be fine, doing multiple offscreen renders should allow you to bypass the iPhone’s frame speed limits.





However, the number of glReadPixels calls you’ll be doing could mean there might be little benefit in doing this as you will be stalling the hardware so frequently. It might be possible to do some or all of your work in a fragment shader rather than using glReadPixels, we can offer some advice on this if you wish to discuss it further?





While our SGX hardware is capable of supporting OpenCL, there are currently no platforms that implement the OpenCL embedded profile. Once devices that support it enter the market we will support it in the SDK, but we can’t say when this may be as it depends wholly on our vendors.





If you do wish to explore other options either on here or via email, I’ll be happy to answer any other questions you have!Tobias2010-11-18 16:43:09

  1. Piezas, I think you need to explain at a higher level what you are actually trying to achieve.





    2) glFinish/glFlush: I do no think you need to call these





    3) glReadPixels implies a CPU stall until the render has finished. Therefore, if you have weighty CPU tasks to carry out, you should go multithreaded: split your app into a GL/rendering thread and a work thread. In that situation the work thread can carry on working while the render thread is stalled in glReadPixels.

I call them at the end of a frame to mark the end of gl use. And perform the readback at the beginning of the next frame draw. Since the FPS is 15 and the utilization is 30%, I figured that should give things enough time.





It’s already multithreaded, or should be. This render block is called by the system automatically through DisplayLink.





The real disappointment is what I can’t access. I’ve found among other things that performing N calls of size/N is more efficient than 1 call of size. I understand why this might be, but it points to a pretty big lack of optimization in some of the functions (in particular, glgProcessPixelsWithProcessor - believe it’s Apple’s lib).