Fast upload of textures on OMAP3

A slightly different issue to my earlier thread…





Is there some faster way to blit to SGX texture memory on OMAP3 than glTexImage2D? Since the GPU memory is shared this seems like a good candidate for a platform specific extension. If not, is it expected that glTexImage2D should be extremely slow? Uploading a 256x256 RGBA (native) texture at 60fps is using almost all the CPU bandwidth. My application is all about dynamic textures. I guess folks doing video must have run into this?





Thanks





Rob





I instrumented my application a little to confirm where the huge dynamic texture overhead is and got some unexpected results. First, I confirmed that my frame update loop runs much too slowly when uploading a new texture every frame:





Texture Size   Frame Update Time


128x128        3.78ms


256x256        17.5ms


512x512        38.2ms





So then I timed the call to glTexImage2D which uploads the texture:





Texture Size   Upload Time


128x128        0.4ms


256x256        1.8ms


512x512        10.5ms





Not fast, but nothing like as bad as I was expecting. I timed the code that generates the texture for each frame:





Texture Size   Render Time


128x128        0.18ms


256x256        0.7ms


512x512        2.7ms





No problem there. So then I timed the geometry plotting code, which plots a vertex array containing two triangles. All the calls were immeasurably fast except for glDrawArrays:





Texture Size   Draw Array Time


128x128        3.2ms


256x256        15ms


512x512        25ms





If I don’t upload a new texture this call is immeasurably fast.





I’d assumed that the big performance hit was uploading the texture, but it looks like it’s actually in plotting the geometry. Can anyone give me some guidance on what’s going on, and what I could do about it? I’d guess the texture upload has messed up the pipeline, but I didn’t expect to see such a massive hit proportional to the size of the texture. Possibly this is my ignorance of OGL-ES. I’m using the 1.1 vertex shader by the way.





One final thought. I’m forced to use a R5G6B5 framebuffer due to issues between the SGX driver and the latest omap DSS driver. The textures are RGBA - there’s no way to upload in R5G6B5 that I’m aware of. Could there be some issue with translation?





Any suggestions appreciated.





Cheers





Rob





OpenGL loads textures in a lazy fashion - that is, the actual loading of the texture to be used doesn’t occur until the texture is used. In this case it doesn’t happen until the call to glDrawArrays that you’ve measured. When you call this again with the same texture there is no load overhead and so the call is, as you say, immeasurably fast.





During the texture load the data is optimized for the GPU by the drivers which is why you’re CPU bandwidth is being used up. Sadly, there is no way to bypass this through GL. It may be worth contacting Texas Instruments about an alternative path for this issue.





I’m not sure what you mean by there being no way to upload RGB565. There is code in our PVRTools that does just that - it’s just another texture format supported by GL. If you’re producing RGBA8888 data you will need to convert this to RGB565 first (I posted about this here: https://www.imgtec.com/forum/forum_posts.asp?TID=227). PVRTexTool will generate all of the OpenGL texture formats so you can use this to examine the quality of image you’ll be achieving on your development machine and see which formats are available. Using lower bandwidth texture formats is generally a performance win, but I’m not certain you will see a great improvement here, however.





Hi Gordon, I wonder if you could answer a couple of questions for me…





We have a project which we are writing for the entire iPhone range. We have texture uploading pretty much about where we want it on the 3GS, and are getting there with the earlier iPhone / iPods.





I have read pretty much all the info out there wrt to texture uploads, and the problems people are facing, so am fairly familiar with the problems / methods you all describe above.





At the moment we are uploading 16px16p texture tiles in 4x4 batches, so 64px64p blocks. These are either in RGB888 or RGBA8888 format. By staging uploads and batching etc. we have managed to smooth this on the GLES1.1 iPhone / iPods.





What I was wondering is if there is any advantage (other than the small reduction in bandwidth) to using a different format for the data, say RGB565?





Are there any formats that are considered “more native”, and therefore take less swizzling at upload? We have plenty of CPU time to spare on formatting the data better for the GPU.





One thing I did notice but have not investigated fully is that RGB seems to take more swizzling time when we profile the app, than RGBA. Are we imagining this, or is this expected behaviour?





Or is there any other trick such as somehow drawing to a separate framebuffer and using the texture from there the next frame to render on the main framebuffer? That last one is pretty out there, but we came up with it when wondering if anything can go on asynchronously when it’s not affecting the main view (or contexts) tiling…





Any hints / tips / suggestions welcome.


Thanks.

FWIW I have obtained a 25% increase in overall texture upload speed using the rough method I outlined in my last post. If anyone is interested in how to do this let me know.

I am doing glCompressedTexImage2D for pvrtc (either 4bpp or 2bpp) of 1024x1024 pixels on MBX (iPhone 3G). There are thousands of tiles. The performance is very good. User won’t see any delay during panning and zooming operations to load tiles.
Not sure if how it will go on OMAP3’s SGX. But I guess it would be better.

I am aware of compressed textures… What I was talking about was uploading textures to the GPU which have been procedurally generated on the CPU.





I am pretty sure that generating compressed textures on the fly for upload is not viable. :slight_smile:

scratt wrote:
What I was wondering is if there is any advantage (other than the small reduction in bandwidth) to using a different format for the data, say RGB565?

Cutting bandwidth requirements in half - not only at upload time but also at render time - is hardly a "small" thing.



Quote:
Are there any formats that are considered "more native", and therefore take less swizzling at upload? We have plenty of CPU time to spare on formatting the data better for the GPU.

On MBX platforms it may be preferable to use GL_BGRA/GL_UNSIGNED_BYTE instead of GL_RGBA/GL_UNSIGNED_BYTE. Apart from that, the best strategy is to use the smallest format that matches your quality requirements, i.e. try PVRTC first, then 16bpp, then 32bpp. And ignore palette formats (except for storage reasons) as they are not supported natively.



Quote:
One thing I did notice but have not investigated fully is that RGB seems to take more swizzling time when we profile the app, than RGBA. Are we imagining this, or is this expected behaviour?

This isn't surprising, as RGB888 needs to be converted for the hardware. RGB888 pixels are not 32-bit aligned which makes them harder to handle.Xmas2009-07-13 13:05:21
Xmas wrote:
scratt wrote:
What I was wondering is if there is any advantage (other than the small reduction in bandwidth) to using a different format for the data, say RGB565?

Cutting bandwidth requirements in half - not only at upload time but also at render time - is hardly a "small" thing.



Agreed. The delay I was seeing was actually in swizzling when I profiled.

But even then whatever the GPU does with texture uploads after that seems to play a part too.



When you upload a sub-image does the entire texture need to be worked on, or just the portion uploaded?

i.e. Do you "unswizzle" the whole texture and then update the sub image and then reformat, or do you simply update the subimage?



I was worried that the RGB565 formats may also require swizzling (re. your comment about alignment below).

When I profile the swizzling I see seems to be related to 16bpp / 32bpp, so I presumed the same might happen with RGB565, but on reflection that's a straight 16bpp copy so should in theory not get swizzled? Is that correct? (re. your similar comment about BGRA v RGB or RGBA)



From reading elsewhere I also understand that this swizzling is done when you try to draw geometry with the texture, rather than at upload time. Is that correct?



Xmas wrote:
Quote:
Are there any formats that are considered "more native", and therefore take less swizzling at upload? We have plenty of CPU time to spare on formatting the data better for the GPU.

On MBX platforms it may be preferable to use GL_BGRA/GL_UNSIGNED_BYTE instead of GL_RGBA/GL_UNSIGNED_BYTE. Apart from that, the best strategy is to use the smallest format that matches your quality requirements, i.e. try PVRTC first, then 16bpp, then 32bpp. And ignore palette formats (except for storage reasons) as they are not supported natively.



I was not aware that BGRA was available. I'll look into that. Thanks.

In my experience on desktop platforms this usually garners 10% or so.

If it actually removes the requirement for swizzling this could be quite a major improvement.



Xmas wrote:
Quote:
One thing I did notice but have not investigated fully is that RGB seems to take more swizzling time when we profile the app, than RGBA. Are we imagining this, or is this expected behaviour?

This isn't surprising, as RGB888 needs to be converted for the hardware. RGB888 pixels are not 32-bit aligned which makes them harder to handle.



Ok, the alignment thing makes sense. Thanks.



What we are actually doing at the moment is pushing point arrays in texture tile grids into an FBO bound texture. This is noticeably faster than uploading the identical RGB or RGBA data via glTexSubImage2D. Of course when we do it this way we can work with just RGB on the GPU side (saving some texture space GPU side, and some prep time CPU side), but still have to "upload" 4 bytes for the colour vertices. :)

As a side benefit we can even do some post processing FBO effects on the data-sets held in the texture.



Profiling the app it garners a performance increase of about 3 to 5 fps, and swizzling is non-existant in the profile. It makes me chuckle that such a convoluted approach is definitively faster!



Someone with more experience than I suggested that this is because we are getting the GPU to do the "swizzle". Apparently it's a "form of GPGPU processing". By having the GPU do that for us we also seem to be skipping any pipeline flushes.



It certainly works, and is the fastest solution we have found so far.. I have a theory that if we stuck with 8 bit data on the CPU side for colour but changed the FBO texture to a smaller format we may also save there too without putting any "swizzling load" on the CPU to prep the 565 data.



Any comments on any of that?



Thanks for your feedback. It is appreciated. :)scratt2009-07-15 05:23:33
scratt wrote:
When you upload a sub-image does the entire texture need to be worked on, or just the portion uploaded?

i.e. Do you "unswizzle" the whole texture and then update the sub image and then reformat, or do you simply update the subimage?

Subimage uploads generally require more than a simple copy, especially if the texture object has been used in the current or previous frame. The exact behaviour can be different between platforms, though.



Quote:
From reading elsewhere I also understand that this swizzling is done when you try to draw geometry with the texture, rather than at upload time. Is that correct?

Swizzling, i.e. reordering channels, is only one part of the reformatting work that may have to be done. Since texture objects can be modified at any point in GL, some of this work may be deferred until draw time.



Quote:
I was not aware that BGRA was available. I'll look into that. Thanks.

Before using it you should check for the presence of either GL_EXT_texture_format_BGRA8888 or GL_IMG_texture_format_BGRA8888 in the extension string.



Quote:
What we are actually doing at the moment is pushing point arrays in texture tile grids into an FBO bound texture. This is noticeably faster than uploading the identical RGB or RGBA data via glTexSubImage2D. Of course when we do it this way we can work with just RGB on the GPU side (saving some texture space GPU side, and some prep time CPU side), but still have to "upload" 4 bytes for the colour vertices. :)

How many pixels do you generally update this way?



Rendering to an RGB565 texture should generally be as fast or faster (if you're bandwidth limited) than rendering to an RGBA8888 texture.
Xmas wrote:

How many pixels do you generally update this way?



It varies from frame to frame.

Using glTexSubImage2D we would do them in 18x18 tiles, anything from 1 - 50 per frame.

Obviously at the top end of that range things get slow.



Using our method we batch all the individual points into an interleaved vertex array of short vertices and 4 byte colours. We then setup the FBO and draw the whole lot in one batch, so anything from 324 to 16,200 points.



I first came up with the idea when I looked at the point draw rate we could achieve with normal particle systems, when compared to the same number of effective pixels pushed via glTexSubImage2D.



Xmas wrote:
Rendering to an RGB565 texture should generally be as fast or faster (if you're bandwidth limited) than rendering to an RGBA8888 texture.



Obviously I will go and check both the glTexSubImage2D and our "pixel gun" method against this format.



Thanks once again for the quick and concise reply.scratt2009-07-15 13:17:42

Sorry to bump such an old post, but Scratt I am interested in yoru technique - could you elaborate on it please (if you still remember how you did it, and if you still check these forums)?