Hi
I will describe my experience
I know that ge8320 supports fp16, so I think ge8300 also should.
The problem is that not so easy to force compiler generate efficient fp16 code that 2x faster than fp32.
The first that you should do is enable -cl-fast-relaxed-math in compiler options, without it compiler will not generate fp16 code in OpenCL.
Then more interesting part, you should write very special code to generate efficient fp16 asm. Pvr GPU in general can do 2xf32mad per clock in single thread. So our goal is 4xf16mads. But some arguments for this 4xf16mads must be common.
On practice you should use this paired instructions:
r0+=s0 * data[0]; // r0, s0 private registers, data - local mem
r1+=s1 * data[0];
The easy strategy is to process 2 pixels or elements in one thread to receive this code.
This code in some conditions will be translated to asm 4xfp16 sopmads per cycle. You should have a lot enough of this instruction to force compiler generate 4xfp16sopmads.
PVRShadereditor probably won’t help you with OpenCL fp16 because you can not pass -cl-fast-relaxed-math compiler flag, at least I didn’t find this ability half a year ago. But you can receive in this editor asm for fp16 in GLSL.
You can try to generate convolution kernel here https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/delegates/gpu/cl/kernels/conv_powervr.cc
This kernel have ~1.8x better performance in fp16 and in general can achieve ~70% of theoretical alu perf in fp16 for ge8320.
I described mine experience and maybe I wrong/not precise in some moments.
P.S. Apple A11, A12, A13 GPUs have similar behavior.
P.P.S A7-A10 GPUs does not support 4xfp16sopmads(8flops) per cycle. They can do only 2xfp16 of this type:
r0=s0 * data[0]+t0 * data[1];
r1=s1 * data[0]+t1 * data[1];
That actually kinda 6xflops, but in practice not many algorithms can use this instructions.
I had a look for you, the GE8300 (and GE8320) does support the SOPMAD instruction.
Generating efficient code is not so straightforward, but I think you’ve got a good strategy there (seems to be working). It depends on a number of things, usually whether you can route your data (which you guessed correctly) through the internal datapaths so that you can use all “phases” of the ALU.
I don’t have much experience with OpenCL on PVR but with GLSL the key is using mediump.
Beside 4xfp16 sopmads to make computing faster, float16 also reduce data fetching since the size of float16 is just half of float32. I thought the most improvement is reduce the time of data/buffer read, however, the result didn’t support my idea.
The result can depend on what kind of algorithm you want to implement. Some tasks very compute heavy, some memory bound or smt else. I don’t know what is limiting factor for your case.
I just know that in convolutions(and probably in gemm) you can fully utilize fp16 ALU. And even in this cases it will be not easy and require correct data representation, reading some chunks of data(few half4 or half8 for example) uniform reads for work group for weights or second matrix and etc.
For uniform reads you should use the same address for all work items in work group. Work group size should be equal to 32 for best performance in majority of cases. In the beginning of code you can add attribute((reqd_work_group_size( X , Y , Z ))) to help compiler.
In some cases can be profitable upload data to local mem with async_work_group_copy.
But I can not help you because I don’t know what is your bottleneck.
About fp16 I explained in first response.