Does Half precision floating point support by OpenCL implementation on PowerVR GE8300?

iuni · January 17, 2020, 8:03am

I try Half precision floating point (float16) kernel function on PowerVR GE8300, the result seems OK, but the performance is almost same as float32.

So does Half precision floating point support by OpenCL implementation on PowerVR GE8300? If support, how about the performance compare to float32?

MartonTamas · January 17, 2020, 12:02pm

Hi,

it all depends what it compiles to, try analysing the disassembled code in PVRShadereditor.

Marton

roserg · January 20, 2020, 1:08am

Hi
I will describe my experience
I know that ge8320 supports fp16, so I think ge8300 also should.
The problem is that not so easy to force compiler generate efficient fp16 code that 2x faster than fp32.
The first that you should do is enable -cl-fast-relaxed-math in compiler options, without it compiler will not generate fp16 code in OpenCL.
Then more interesting part, you should write very special code to generate efficient fp16 asm. Pvr GPU in general can do 2xf32mad per clock in single thread. So our goal is 4xf16mads. But some arguments for this 4xf16mads must be common.
On practice you should use this paired instructions:
r0+=s0 * data[0]; // r0, s0 private registers, data - local mem
r1+=s1 * data[0];
The easy strategy is to process 2 pixels or elements in one thread to receive this code.
This code in some conditions will be translated to asm 4xfp16 sopmads per cycle. You should have a lot enough of this instruction to force compiler generate 4xfp16sopmads.
PVRShadereditor probably won’t help you with OpenCL fp16 because you can not pass -cl-fast-relaxed-math compiler flag, at least I didn’t find this ability half a year ago. But you can receive in this editor asm for fp16 in GLSL.
You can try to generate convolution kernel here https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/delegates/gpu/cl/kernels/conv_powervr.cc
This kernel have ~1.8x better performance in fp16 and in general can achieve ~70% of theoretical alu perf in fp16 for ge8320.
I described mine experience and maybe I wrong/not precise in some moments.
P.S. Apple A11, A12, A13 GPUs have similar behavior.
P.P.S A7-A10 GPUs does not support 4xfp16sopmads(8flops) per cycle. They can do only 2xfp16 of this type:
r0=s0 * data[0]+t0 * data[1];
r1=s1 * data[0]+t1 * data[1];
That actually kinda 6xflops, but in practice not many algorithms can use this instructions.

MartonTamas · January 20, 2020, 10:46am

Hi,

I had a look for you, the GE8300 (and GE8320) does support the SOPMAD instruction.
Generating efficient code is not so straightforward, but I think you’ve got a good strategy there (seems to be working). It depends on a number of things, usually whether you can route your data (which you guessed correctly) through the internal datapaths so that you can use all “phases” of the ALU.
I don’t have much experience with OpenCL on PVR but with GLSL the key is using mediump.

iuni · February 3, 2020, 9:11am

i add “-cl-fast-relaxed-math” on clBuildProgram as following:

err = clBuildProgram(program_, 1, &dev, “-cl-fast-relaxed-math”, nullptr, nullptr);

Beside 4xfp16 sopmads to make computing faster, float16 also reduce data fetching since the size of float16 is just half of float32. I thought the most improvement is reduce the time of data/buffer read, however, the result didn’t support my idea.

Would you please share your thoughts?

roserg · February 16, 2020, 5:52pm

The result can depend on what kind of algorithm you want to implement. Some tasks very compute heavy, some memory bound or smt else. I don’t know what is limiting factor for your case.
I just know that in convolutions(and probably in gemm) you can fully utilize fp16 ALU. And even in this cases it will be not easy and require correct data representation, reading some chunks of data(few half4 or half8 for example) uniform reads for work group for weights or second matrix and etc.
For uniform reads you should use the same address for all work items in work group. Work group size should be equal to 32 for best performance in majority of cases. In the beginning of code you can add
attribute((reqd_work_group_size( X , Y , Z ))) to help compiler.
In some cases can be profitable upload data to local mem with async_work_group_copy.

But I can not help you because I don’t know what is your bottleneck.
About fp16 I explained in first response.

Topic		Replies	Views
Performance of data read/write between fp16 and fp32 PowerVR Insider	2	938	April 23, 2020
Tensorflow lite opencl pull request	2	353	February 20, 2023
GT7400 compile OpenCL NDRANGE_KERNEL executed abnormally PowerVR Insider	4	2160	November 19, 2021
OpenCL support PowerVR Insider	24	1229	May 15, 2011
Opencv with Opencl on power VR series 6 GPUs PowerVR Insider	1	488	February 25, 2015

Does Half precision floating point support by OpenCL implementation on PowerVR GE8300?

Related topics