Understanding PBO texture uploads

Hello @Joe-2 , I was wondering how the GPU manages texture uploads while using PBO.
You told that
[blockquote]We expect PBOs would further improve performance as the driver would not need to perform a memcpy from application memory to driver memory.[/blockquote]
Does it mean that GPU will be accessing that mapped memory directly while rendering? And GPU will not need to load this into driver’s memory?

So if you could answer the following questions then that would be very helpful,

  1. When we call glTexImage2D or glTexSubImage2D doesn’t DMA handle the operation?
  2. While using PBO and after memcpy’ing texture data into the mapped address doesn’t the same DMA handle the operation?
  3. If so, then how can PBO be faster? Is it because of that memcpy handled by CPU instead of GPU?
  4. But if after that memcpy, same operations is going on wouldn’t the net gain be the same?
  5. Which of the following is the way that GPU handles texture uploads?

[pre]1. WITH_PBO
WORKER : |CPU|GPU|
PBO : |memcpy|DMA_|

  1. NO PBO
    WORKER : |GPU|
    NO PBO : |memcpy|DMA_|

  2. NO_PBO_2
    WORKER : |_CPU|GPU|
    NO PBO : |memcpy|_UPLOAD|[/pre]

Thanks

Hi Zulkarnine,

Apologies for the delayed response here. Since my last post, I have found out that I had been misinformed about how PBOs work. I’m discussing the performance implications of using and not using PBOs with the driver team now. I should be able to answer your questions later this week.

Thank you @Joe-2 ,
Looking forward to your reply.

Hello @Joe-2 , Is there any update regarding the performance implications of using/not using PBO for texture uploads?

Hi Zulkarnine,

I’ve continued investigating the issue today, but have encountered some strange behaviour in our driver where TexStorage doesn’t work with PBO uploads. I’m discussing this with the driver team now - it’s possible I made a mistake in my test application. I should be able to give you an update this afternoon.

Hello Joe,
Thank you for your reply.

I am also having some difficulty using PBO on the target (not PVRFrame emulator) with both, RGBA and PVRTC format textures. (Maybe I am doing something wrong too). Strangely the sample programs work fine on the PVRFrame emulator.

That’s why I wanted to make sure at first if using PBO to upload texture is really worth the extra hassle.

Looking forward to your reply.

Here’s an overview of the driver’s standard texture upload process (without PBO or glTexStorage*):

Application –> driver texture copy

  • The modifying function call blocks (e.g. glTexSubImage2D)
  • glTexSubImage2D reference pageValidation is performed for the supplied parameters. If the combination is invalid, an error will be returned (check out the for more information)
  • If the parameters are valid, the driver allocates temporary buffer of requested resolution and format
  • The driver CPU memcpy’s the application supplied buffer to the temporary buffer
  • The driver then returns the function so the application can continue submitting GL calls

Graphics driver –> GPU texture copy

  • For most texture formats, the GPU will perform the copy via DMA. In PVRTune these tasks are shown as Transfer Tasks. In some cases a CPU copy will be used as a fall back
  • The data is twiddled during the transfer. This improves texture cache efficiency when the data is sampled by draws
  • Once the copy is complete, the driver can free the temporary buffer

Note: The process explained above is simplified. To reduce the number of memory allocations and frees performed, the driver may keep temporary buffers around for some time in case they can be reused.

glTexStorage*
With glTexStorage2D and glTexStorage3D, the driver can perform validation of all levels of a texture up front. Unlike standard (mutable) textures where the GPU DMA copy is deferred until the first draw using the texture, immutable format textures also kick asynchronous GPU texture uploads when glTexSubImage2D() is called. This means that - with immutable format textures - you don’t have to perform texture warm ups to avoid run-time stutters.

[blockquote]Does it mean that GPU will be accessing that mapped memory directly while rendering? And GPU will not need to load this into driver’s memory?
[/blockquote]
The GPU’s texture is stored in a twiddled layout to improve texture cache efficiency. PBOs requires the data to be stored linearly in the driver’s buffer. If the glMapBufferRange() GL_MAP_INVALIDATE_RANGE_BIT or GL_MAP_INVALIDATE_BUFFER_BIT flags are used, the driver will allocate a PBO buffer without preserving the contents of the GPU side texture. If neither flag is specified, the GPU will DMA the contents of it’s texture to the PBO.

[blockquote] 3. […] how can PBO be faster? Is it because of that memcpy handled by CPU instead of GPU?[/blockquote]
In the case of uploading texture data from a file, PBO uploads are only faster than standard glTexSubImage2D() uploads if the data is loaded directly from the file into the PBO. If the file is loaded into application memory before it is memcpy’d to the PBO buffer, as many memcpy’s would be performed as a glTexSubImage2D upload.

Standard texture upload
File–>Application buffer–>Driver buffer (e.g. copy performed by glTexSubImage2D)
Optimized PBO texture upload
File–>PBO driver buffer

(As mentioned above, there will be an asynchronous DMA from the driver’s temporary buffer to GPU memory sometime after the application’s buffer update completes)

[blockquote]it’s possible I made a mistake in my test application[/blockquote]
Turns out I had made a mistake in my test, which I’ve now fixed. You can find my modified version of the SDK’s Texturing demo (PBO upload + TexStorage) on GitHub: Commits · JoeDavisIMG/Native_SDK · GitHub

Hope this helps :slight_smile:

Joe

Thank you Joe for the detailed explanation.

That helped a lot to understand how the texture uploads are handled by the driver. Also, your example code helped me fix the problem in my code. It appears that I was calling glTexSubImage2D() too early and also using the wrong enum for the internal format.

After that, I have carried on some tests with RGBA texture upload and it seems that even while using
File–>Application buffer–>Driver buffer
for both (PBO and NON_PBO) uploads, PBO was around 15~20% faster than general glTexSubImage2D() function on actual target.

One question though, can we not use glTexImage2D( ) instead of glTexSubImage2D( ) and glCompressedTexImage2D( ) instead of glCompressedTexSubImage2D( )?
Because I was having some problem while using glCompressedTexImage2D( ) instead of
glCompressedTexSubImage2D( ).

Looking forward to your reply.
Thanks,
Zulkarnine

Hi @Joe,
Could you check if the problem that I mentioned in my last post regarding glTexImage2D and glTexSubImage2D can be solved?

What might be the reason for those 2 functions to act differently?

Looking forward to your reply.

Thanks,
Zulkarnine

Hi Zulkarnine,

When TexStorage is used, the SubImage functions must be used to upload new texel data. From section 3.8.4 of the OpenGL ES 3.0 spec:

[blockquote]After a successful call to any TexStorage* command, no further changes to
the dimensions or format of the texture object may be made. Other commands
may only alter the texel values and texture parameters. Using any of the following
commands with the same texture will result in an INVALID_OPERATION error
being generated, even if it does not affect the dimensions or format:
• TexImage*
• CompressedTexImage*
• CopyTexImage*
• TexStorage*
[/blockquote]

Thank you Joe for the clarification.