I must say sorry that after reading many imgtec documents and performance guide, I still have question on the performance between alpha test and alpha blend on PoweVR
With HSR, opaque pixel will get the best performance, however, alpha test and alpha blend still have to process all fragment cause their visibility are unknown, so they are supposed to get almost the same performance.
In this topic:https://www.imgcommunity.local/forums/topic/question-about-alpha-test-performance/ Joe said:“Because the hardware does not have to re-determine fragment visibility when alpha blending”, I think this “re-determine” is the alpha value comparison in alpha test, it should be fast, and wont be an overhead even this goes twice(one in HSR stage, one in the real fragment processing), isn’t it?
Apologies this point wasn’t clear enough in our current documentation. We will aim to improve our overview for a future release.
When the GPU processes blended object, it knows that every pixel in that primitive will be processed. This means that any depth/stencil writes that need to be performed by the ISP happen immediately for the entire primitive.
When discard is used, pixel visibility isn’t known until the corresponding fragment shader executes. Because of this, depth and stencil writes must be deferred until pixel visibility is known. This reduces performance as the pixel visibility information has to be fed back to the ISP unit after shader execution to determine which depth/stencil positions need to be written to. The cost of this can vary, but in the worse case the entire fragment processing pipeline will stall until the deferred depth/stencil writes complete.
Thank for your detailed explaination, and I made this diagram to confirm my current conprehension.
so for opaque, it goes HSR and saves real fragment processing time,
for alpha test, it goes HSR and still have to process all fragment,
for alpha blend, it skip HSR and process all fragment. (I misunderstood this and thought it goes to HSR as well)
am I right?
All primitives, regardless of blend/discard state, will go through the HSR process.
Consider a case where a small town with buildings and trees is rendered. A well optimized application will submit all of the opaque draws first (buildings and tree trunks). The leaves of the tree may then be drawn as a textured quad, where the discard keyword is used to punch through transparent regions of the texture so only the leaves are drawn. If a building is partially obscuring the leaves of a tree, then the obscured leaves can be rejected from the render by the HSR unit when depth testing is performed.
The statue in figure 10 of our PowerVR Hardware Architecture Overview for Developers is a good example of the GPU’s ability to discard blended fragments that are obscured by opaque fragments.
The reason alpha test/discard is more costly is that is goes through the HSR stage twice; once to perform depth/stencil tests to see if any of the fragments are obscured, and a second time to write depth/stencil values for any fragments that were not discarded when the fragment shader executed.
I got it. For now as I understand, following are two figures show the flowline in the architecture, upper is alphablend, the other is alphatest
in preprocess stage, it won’t process fragment shader, so it’s fast but meanwhile it won’t be able to know which pixel is discarded, so alphatest write depth after fragment shader processing, and alphablend write depth before it, is this right?
Your description isn’t quite right. In your diagrams, you’ve shown two passes through the HSR stage for blended objects and three passes for alpha test. This should actually be one pass through it for blended primitives and two passes through it for alpha test.
In the blended primitive case, depth tests and writes can be performed before the shader executes (one pass through HSR). For alpha tested primitives, depth testing can be performed on the initial pass, but a second pass through HSR is required after shader execution to update the on-chip depth buffer with data for the visible fragments.
Hope this helps,
the flowline start point should not be considered as a “pass-through”, this’s my drawing mistake.
I think my understanding is closer to your idea now:
- it preprocess the fragment (all preprocessing won’t involve fragment shaders, so that’s why it’s fast?)
- then it obtain depths (since no discard used), go back to HSR unit and do HSR, and write the depth buffer.(PASS 1 as you said)
- process the fragment shader in a traditional way, then blend and output
- it preprocess the fragment
- then it obtain depths and go back to HSR unit ( but depths can’t be used since discarding exists) (PASS 1)
- process the fragment shader in a traditional way,
- real depths obtained, then go back to HSR unit and write depth buffer. (PASS 2)
- then output.
does this right?
>1. it preprocess the fragment [all preprocessing won’t involve fragment shaders, so that’s why it’s fast?]
There is no preprocessing stage. The ISP unit (where HSR is performed) has access to the position data of all primitives within the tile. With this information, it can perform depth and stencil reads/writes immediately. Blended primitives go through the exact same stages as opaque primitives. As all fragments of a blended object are considered to be visible, depth tests and writes can be performed up front. The only difference is that blended primitives must be sent down the pipeline one at a time to ensure they are processed in the submission order specified by the application.
Blended primitives will do the following:
ISP HSR: Depth and and stencil tests and writes
Shading: Colours are calculated for fragments that pass the tests
For alpha testing, there isn’t any preprocessing either. The only difference between alpha test and the blended path is that depth and stencil writes must be deferred until the shader has executed.
Alpha tested primitives will do the following:
ISP HSR: Depth and and stencil tests (no writes)
Shading: Colours are calculated for fragments that pass the tests
Visibility feedback to ISP: After the shader has executed, the GPU knows which fragments were discarded and which where kept. Visibility information is fed back to the ISP so depth and stencil writes can be performed for the fragments that passed the alpha test
I think I’m clear now, but is going through HSR very expensive,
alphatest: [HSR1(depth tests) + HSR2(depth writes)] > alphablend:[HSR1(depth tests + depth writes)] ? I don’t have much chip hierarchy knowledge.
The HSR process itself is very cheap. The overhead of alpha test comes from the ISP being blocked until alpha tested primitive visibility information has been fed back to the ISP. This blocking has to occur so depth and stencil buffers can be updated with the alpha tested primitive before subsequent draws are processed.
Hi, Joe, finally I got the answer, thanks a lot for your patience on explaining me the techinic background of this question.
And I still want to ask one more: When would the ISP unit be available for the next primitive? after depth writing of current primitive ends? so in the pipeline for alphablend the next primitive can enter ISP earlier than alphatest?
[blockquote]after depth writing of current primitive ends?[/blockquote]
As soon as the ISP has finished processing a primitive, it can begin processing another.
[blockquote]so in the pipeline for alphablend the next primitive can enter ISP earlier than alphatest?[/blockquote]
That’s correct. In the blend scenario, the ISP can continue to process new primitives while the USCs are calculating colours for primitives that have already propagated through the pipeline.