I’m having a RISC-V Dev platform which has a IMG-GPU, and I’m able to successfully build the Vulkan-NCNN Framework and performing Yolo object detection what we noticed is we’re observing better with CPU as it’s taking around ~3.0 seconds for performing Yolo v8 objection detection, while GPU is taking ~6.2 seconds for performing the same object detection.
Can you please help me on this, If there’s a way to optimise or improvise the performance rates on GPU over CPU. Also it’ll be highly appreciated if you can help me with sharing more details on this issue.
I would like to know please if when you run the Yolo v8 objection detection on GPU, the binary is truly running tasks on the GPU or it is being emulated on CPU. Please verify you have PowerVR GPU drivers installed.
Yes the Vulkan-NCNN binary that executed is invoking the PowerVR GPU to perform the further computation for Object Detection. We can also see our GPU HW version that aligns with our IP.
As far as Drivers, we’ve loaded our GPU-DDK modules along with the necessary libraries installed in the Setup.
Let me know if you’d need additional details, Thank you!
I would suggest taking a recording with our profiling tool PVRTuneDeveloper which reads GPU counters, to make sure there is actual GPU utilization. You can download the tool from the Downloads section of our Developer Portal Login - Imagination Developers (please note that you will need to make a new account in order to access the Downloads section).
The tool is easy to use, you should be able to install PVRPerfServer in the PowerVR device, start it and connect to the PowerVR device from the PVRTuneDeveloperGUI application on a desktop computer through different ways (adb, ssh) and start a recording, where you will see in real time GPU task information of any GPU workloads running on the GPU (so if no GPU workloads are running at that moment, no tasks will be shown). Feel free to ask any questions in case you need any help.
Also, please make sure you have a PowerVR driver running and not something different like a hardware accelerated Mesa Vulkan driver using GPU.
Additionally, I would like to ask some details about the device being used:
What exact RISC-V Dev platform are you using? Please mention model, CPU and GPU.
What OS and version?
Can you get information from the PowerVR driver running? you can try with the command sudo cat /sys/kernel/debug/pvr/version
Hi @AlejandroC , Thanks for you suggestion. We tried to verify the same with PVRTune Application and found there was a certain computation that was performed over the GPU for Yolo Tests which was provided with Vulkan-NCNN Example Sources. Please find the respective images obtained from PVRTune, But what is still not sure is even with performance data that is obtained over GPU for computing Yolo in PVRTune is still slightly higher than computed time over CPU Performance.
Below are the Device Details as you requested:
CPU:
Architecture: riscv64
CPU(s): 4 (0-3)
Could you please provide the output of the following command? sudo cat /sys/kernel/debug/pvr/version
I would like to understand the details of the driver you are using (perhaps you have a PowerVR open source driver which could not provide the fastest performance possible).
Also, based on the screenshots provided, I can see the GPU workloads take between 2.0s and 3.2s. Are those the same workloads initially mentioned which were taking about 6.2s in the GPU? Maybe the Yolo object detection is performing other operations which slow down the whole process.
We’re using the same GPU DDK v1.15 to initialise the PowerVR Rogue GE8300 HW, Which is essential for enabling the GPU acceleration in our Development Environment, So our Goal is to leverage the same setup to efficiently run the Vulkan-NCNN Framework on GPU, to reduce CPU Overheads.
Yes, They are same workloads, since the Yolo Application does perform various other functions, We implemented the code to measure the time for each of their functions (like, loading the model, Detection, & drawing bounding boxes), where we noticed the detection time alone took ~6 seconds to execute on GPU (but ~2.0 sec on CPU). But with PVRTune we’re seeing the total computation is about ~2.3 seconds on GPU. Which we found is still slightly higher than CPU…
I am not familiar with the Vulkan-NCNN Framework. There are many factors that could be affecting performance and leading to GPU underutilization (too many registers in long shaders, too many instructions per shader due to long loops, wrong workgroup count for the PowerVR architecture, bandwidth due to intensive load / store operations).
Please take a PVRTune recording and verify those counters: