Can UBOs be stored in registers?

yes that compiler compiles for one specific GPU. I wouldn’t expect vertex and fragment to be consistent as the two workloads might require different setups. I can ask around what causes the difference and get back to you :slight_smile:

I’d say the difference is that you can expect the common store to be set up and good to go when the program starts executing. Texel fetches are dynamically served and have to go through the cache hierarchy.

That would be nice.

Basically I stand before the decision of putting a lookup table either in an UBO (that resides in the sh registers / common store ) and a texture (whose accesses might also be supported by a cache) and I need to determine what would be faster.

So there’s not much explanation for the difference, it’s just simply how it works.

How big is that lookup table?
I’d test performance in either case!

Does that mean that the common store is large enough to store more than 256 vec4 for UBOs for PowerVR6 chips? Should this be possible?

Ok, this spec here for the G6230
http://www.actions-semi.com/upload/file/20170728/6363680778367191661795441.pdf
Lists 128 bit x 1024 common store. Is there a rule in the compiler to switch if this is exhausted?

@MartonTamas: Btw Can you find out what latency the common store has compared to the L0 texture cache?

Tx

when it says it has space for 512 or something than you should be able to utilise that.

Lists 128 bit x 1024 common store. Is there a rule in the compiler to switch if this is exhausted?

I think if you use more than the available common store, your SH reads will turn into DMA (LD) operations, which is quite a bit slower.

Btw Can you find out what latency the common store has compared to the L0 texture cache?

I don’t think I can find that out, but registers being registers you can read them in the same cycle ie. no cost or 1 cycle.
Data needs to go through the cache hierarchy so you’d potentially need to wait for data to arrive from system memory. If it’s already in the cache, L0 should be around 3-5 cycles, L1 should be 10-20 cycles, L2 should be around 20-40 cycles.
DRAM should be around 100 cycles or more, but I’d definitely write a test and run it on the target GPU that you’d use.

What irritates me is that the number of available common store entries (1024) seems to be much larger than what is returned by GL_MAX_VERTEX_UNIFORM_VECTORS (256).

Also my assumption was that sh registers are not actual registers but pseudo-registers allocated from the common store. That’s why I wanted to know the latency of the common store because I assume that is the latency of the sh “registers”.

that’s 1024 * 128 bits, precisely what the hardware has. We probably need to suballocate so that we can have multiple tiles in flight etc. I don’t think it’s wasted :slight_smile:

no, they are registers :slight_smile:

OK, could you please again explain this sentence to me:

“Shared Registers are allocated from the Common Store and may be indexed.”

This is what gave me the thought that sh registers are not actual registers. Are you still rolling with the 1 cycle latency?

Also, is there a definitive way to calculate the available sh register memory using glGet commands?

Temporary and Attribute registers for example are allocated from the Unified Store but they are still registers :slight_smile:
so yes 1 cycle latency
I’d say just grab the value that GL gives you, as you can see you might not be able to use all 512 for some reason (eg. driver/compiler needs some space), but you should be able to use most of them.

1 Like