At the media computing group I was asked if it would be possible to detect 3D objects in the video stream of the Kinect camera. There is previous work by Chavdar Papazov and Darius Burschk on detecting 3D objects in noisy and occulded scenes, which uses a RANSAC based algorithm to detect objects in depth images.![]()
Identifying the bottleneck
This can be applied to the depth data provided by the Kinect, however a pure CPU based implementation is far too slow for real-time application. After some profiling we found out that most steps of the online recognition phase are very efficient, except for step 6: Testing and accepting the hypothesis. Which can take up to 70 seconds in high quality recognition settings.
Implementation on CUDA
For one recognition phase there are about 100.000 hypotheses that need to be tested, each of them independently. So this is a scenario which fits very well with the SIMD structure of GPUs.
So my last project for the media computing group was to port this algorithm to utilize the GPU with Nvidia CUDA. Actually the CUDA part was very comfortable, the hypothesis testing algorithm does not require complicated data structures so I could just upload all information to the GPU and execute nearly the original C code on the GPU.
Eigendecomposition on CUDA
But there was one big challenge for me. The algorithm requires a polar decomposition of a real 3×3 matrix. Which is based on a singular value decomposition, which then can be implemented using eigen decompositions of real, symmetric 3×3 matrices. In the original code this is done using OpenCV, which cannot be ported to CUDA easily.
So I had to implement this from scratch. To be honest it was not very easy for me as I haven’t been using any linear algebra after the exam ~5 years ago. But as always with math, Wikipedia is a great help!
I based my implementation on an example by S. Melax, and implemented some Matrix, Vector and Quaternion classes to get this running. A few days and nights later I was testing my CUDA based algorithm on millions of matrices and compared them to the OpenCV results. It works perfectly. Interestingly only in ~99.9975% of the cases, but for this application this is sufficient.
Speedup
Long story short. After putting it all together I achieved an approximate speedup of factor 7. Please remark that the test machine was using an very old Geforce 8600GT, one of the first CUDA capable devices. I believe much higher speedups can be expected on more modern GPUs.
The groundwork for real-time recognition is done, and I already have a hell lot of ideas how to further improve performance of this algorithm. Unfortunately my contract ends and my job here is over, but I am curious to see which further improvements will be made on this technique.
PS: You can download my C++ decomposition code, it does not have any requirements except CUDA and the standard C math library.