Whats the best way to micro-benchmark my render on PowerVR hardware?

I’d like to accurately measure the cost of my render so I can figure out where I should focus my optimization work. Do you have any tips for creating a micro-benchmark?

Benchmarking the performance of applications running on PowerVR GPUs isn’t as simple as collecting time stamps as various points in your render. The reason for this is that the graphics driver submits work to the GPU independently from the API calls that an application has made, i.e. very few graphics API calls are blocking operations. Additionally, PowerVR GPUs will try to process as much vertex and fragment work in parallel as possible to keep idle time to a minimum and make optimal use of the resources available. This means that to accurately measure performance you will need to render a number of frames and calculate the average frame-time to understand the true cost of your render.







Because of this behaviour, it’s best to write micro-benchmarks to accurately measure performance. The aim of a micro-benchmark it is to understand the cost of rendering a static frame. Writing a benchmark for a dynamic scene (e.g. a fly-through mode in a game) is beyond the scope of this guide.







The following sections describe a simple, generic micro-benchmark. For the sake of simplicity, it assumes that the benchmark is written using the OpenGL ES 2.0 graphics API.





NOTE: This benchmark guide makes use of glReadPixels() to force renders to complete. This is a very expensive operation as it removes all parallelism between the CPU and GPU. Please, only use glReadPixels() when it’s absolutely necessary.



If you are not already familiar with the PowerVR GPU architecture, you should check out our PowerVR Series 5 Architecture Guide for Developers document.


Platform setup


Before you begin benchmarking your application, you need to ensure that your target platform is setup appropriately.

Disabling V-Sync


V-Sync is a feature enabled on most platforms that synchronises the display’s refresh rate with GPU’s frame rate to avoid tearing (an artefact caused by the GPU updating a surface the display is still reading from). As V-Sync limits the number of frames that the GPU will process, it prevents you from accurately calculating the cost of your render. You have two options:



1. Disable V-Sync: If possible, you should disable V-Sync on your platform. This will remove the limit and will allow the GPU to render frames as fast as possible

2. Rendering off-screen: If you cannot disable V-Sync on you platform, you should repeatedly render to off-screen surfaces (e.g. OpenGL ES FBOs) to keep the GPU busy. Rendering off-screen is beyond the scope of this guide

Ensure no other application is using the GPU


When benchmarking, you must ensure that the GPU is only processing work submitted by your application. If you’re unsure which processes are utilising the GPU, you can use PVRTune to profile the GPU and identify the processes that are submitting work to it.




If you cannot disable other processes that are using the GPU but they have a fixed cost (for example, the SurfaceFlinger compositor on some Android devices), you can factor this cost into your calculations and still run your benchmark on the device. Keep in mind that even with a fixed cost, your benchmark will be less accurate when other applications are using the GPUs resources.



You should not run your benchmark on the device if other processes using the GPU have a varying cost, as this will severely impact the accuracy of your tests.

What Should I Be Benchmarking?


Static scenes


A micro-benchmark should render a static scene so that the results are well defined. There should be no dynamic parts to the render. Additionally, the graphics API calls made in each frame should be consistent. Ideally, the benchmark should render identical frames over and over again to understand their average cost.

Asset Warm-Up


When writing a benchmark, the first thing to remember is that drivers don't have to upload textures or compile shaders at the point that they were submitted to the graphics API. The graphics driver may, instead, defer this work until the first time that the resource is referenced by a draw call (this allows the driver to avoid uploading redundant resources that are never actually used in the render).




An asset warm-up phase allows you to force the driver to upload the resources that you will be using in your micro-benchmark.


How can I make the driver do that?


As the driver will upload and compile assets at the point that they are first used, the easiest way to force the operations is to do the following:




1. Render your static scene a number of times (~10 frames should do)

2. Call glReadPixels() before the final eglSwapBuffers(). This will force the driver to complete all renders that has been submitted so far. Reading back a 1x1 region is sufficient, as you don’t need the returned data

a. NOTE: You only need to call glReadPixels() once here

Benchmarking the scene


Now that the driver has warmed up the required assets and we’re happy with the platform’s setup, it’s time to start benchmarking!




To get an accurate measure of the cost of your render, you should send a large number of frames to the GPU between your first timestamp and your last (the more frames, the better!). Processing a large number of frames allows the GPU to keep multiple frames in flight at a time (as it would to in a standard, well written application) and it also reduces the impact of any setup and shutdown costs caused by the benchmark. Here’s an overview of how this should be implemented:




1. Collect a time stamp after the asset warm-up frame has completed (eglSwapBuffers() has returned)

2. Render your static scene a large number of times (at least 10 seconds worth of rendering)

3. Call glReadPixels() before the last eglSwapBuffers() to force a render.

a. NOTE: You only need to call glReadPixels() at the end of the benchmark. Do not call this every frame, as it will severely impact the benchmark performance!

4. Collect a time stamp after glReadPixels() returns

5. Calculate the average frame time (the elapsed time divides by the number of frames that were rendered)





…and you’re done!

My benchmark is more complex than this. What should I do?


Post on our forum :) !




We're more than happy to help anyone who would like to accurately measure the performance of their graphics rendering on PowerVR hardware. If you would rather discuss your benchmark privately, you can email devtech@imgtec.com