CPU x10 faster than GPU: Recommendations for GPU implementation speed up

juanmed · September 2, 2019, 6:03pm

Hello,

I am coding a drone control algorithm (using modern control theory, not reinforcement learning) and was testing Pytorch as a replacement for Numpy. The algorithm receives as inputs the state of the drone and a desired trajectory, and computes the inputs for the drone to follow the trajectory. It must compute this at least at 100Hz.

The main purpose for trying Pytorch is to see if there would be any gains of using the GPU, since most of the operations are matrix-vector operations.

Links to source code of both controllers: numpy implementation, pytorch implementation and the script where I call each of them.

During my testing I found out that the same control algorithm written using numpy and running on the CPU is at least 10x faster than the pytorch implementation running on the GPU (using only torch functions). I tried both in a desktop computer and on a Jetson Nano with quite similar and interesting results:

Two tests on desktop:

Intel Core i5
GeForce GTX 1050
CUDA10
Pytorch 1.2.

Two tests on Jetson Nano

ARMv8
Nvidia Tegra X1
Pytorch 1.2

The graphs, one for each tests, shows the computation time distribution of the code running either on a)numpy on CPU (blue) b) Pytorch on CPU (green) and c) Pytorch on GPU (red)

In both hardware configurations, numpy on CPU was at least x10 faster that pytorch on GPU. Also, Pytorch on CPU is faster than on GPU. In the case of the desktop, Pytorch on CPU can be, on average, faster than numpy on CPU. Finally (and unluckily for me) Pytorch on GPU running in Jetson Nano cannot achieve 100Hz throughput.

What I am interested on is actually getting the Pytorch GPU on Jetson speed to reach a performance similar than its CPU speed on Jetson Nano (>=100Hz throughput), since I cannot attach a desktop to a drone.

Reading around it seems some issues could be

Data transfer between CPU and GPU can be very expensive,
Tensor type and dtype used

I am not sure how to dramatically improve from this. Currently all operations are done on Tensor.FloatTensor, and I load all the data to GPU at the beginning of each iteration, all computation gets done only on GPU, and I offload from GPU only at the end when all the results are ready.

I am aware this is not the main purpose for which pytorch was created but I would like to get advice on how to optimize the performance of Pytorch on GPU, on an smaller platform like Jetson Nano, and hopefully get a x10 increase in performance.

Any advice will be very welcome!

Juan

For reference, I measured executions times following:

For GPU:

        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)

        start.record()
        # I put my code here
        end.record()
        torch.cuda.synchronize()
       execution_time = start.elapsed_time(end)

For CPU:

       start = time.time()
       # I put my code here
       end = time.time()
       execution_time = end - start

dambo · September 2, 2019, 6:21pm

It is practically impossible to help you without taking a look at the code.
Also, are you utilizing batches for inference on the GPU?

juanmed · September 2, 2019, 6:39pm

@dambo, thanks for your reply. I just edited the post to include links to the source code at the top.

Regarding batches, I am actually not using them. I have used pytorch for machine learning in vision, where a dataset is ready beforehand. However in this case the data comes in the instant it should be processed, so I am not sure if batch processing is possible.

rwightman · September 2, 2019, 6:52pm

Using the GPU typically involves extra latency, especially when done via friendly APIs in Python. For large NN that is typically dwarfed by number of FLOPS so no big deal. For real-time control, with minimal flops and a low latency / high frequency demand it is likely not the best fit.

I’d also question why Python? If you implemented that loop in C or C++ and replotted your graph, it’d be hard to compare any of the existing measures on the same scale. There is so much extra overhead under the covers of any Python app, you’d likely get a 100x+ increase on the CPU writing it in C++ for an algo like that.

juanmed · September 3, 2019, 2:59am

@rwightman Thanks for your reply.

When you mention to “implement that loop in C or C++”, do you mean purely C++ running on CPU or using the torch C++ frontend on GPU? If the answer is the latter, I totally agree a C++ will be faster than python running on the CPU, but since I have not tried the torch C++ frontend, I am not sure whether an implementation of the controller using it will have a significant impact on execution speed on the GPU.

About Python, I simply wanted to try. I am used to it for machine learning and also have seen some use cases for pytorch+python on optimal control (and general optimization) with interesting results. Nonetheless, optimization sees benefits on Pytorch thanks to automatic differentiation. Also, coming from a C background, coding in Python is much easier.

juanmed · September 3, 2019, 11:45am

@dambo @rwightman I performed some other timing tests to try to understand if bottleneck could be in CPU to GPU transfer:

Move 6x 3-element np.arrays from CPU to GPU
Move a single 6x3 np.array from CPU to GPU
Create 6x 3-element torch.tensor directy on GPU whose elements are np.float variables
Create 6x 3-element torch.tensor directly on GPU whose elements are constant floats (hard coded)

I did these 4 tests using np.float16, np.float32 and np.float64, as well as on my desktop and a Jetson Nano. The findings are quiet interesting and counter-intuitive:

Faster operation is of a single 6x3 np.array from CPU to GPU, both on desktop and Jetson Nano. It is around 4x faster than the rest of operations and even faster than creating 6x 3-element torch.tensor directly on GPU.
Using np.float64 is slightly faster than both np.float32 and np.float16
Transfer are 10x faster on the desktop than on Jetson Nano
Effectively, tranfering 6x 3-element np.arrays has transfer times as high as 5.5ms - 6.5ms, which is already too much for calculations at 100Hz in which the limit is 10ms.

It seems that one approach to improve performance is by transferring all data in a single np.array rather than small pieces on several individual np.arrays.

However, does this makes sense at all? Is it really possible that transferring a 6x3 np.array is faster than creating 6 torch.tensors directly on GPU? Sounds to good to be true. I understand that batch processing is one of the keypoints on using a GPU… but “batch transferring”… not really sure.

Any further comments are appreciate.

Below are the graphs:

juanmed · September 3, 2019, 12:37pm

I added two more tests:

Creating a 6x3 torch.tensor using torch.rand(6,3)
Creating a 6x3 torch.tensor directly on GPU using variables
Creating a 6x3 torch.tensor direcly on GPU using constants

It seems that

Creating a torch.rand(6,3) directly on GPU is the fastest operation
However, creating a 6x3 np.array and then transfering it to GPU is yet faster than creating a 6x3 torch.tensor directly on GPU using variables or constants. This is counter intuitive.

rwightman · September 3, 2019, 5:02pm

I’m not understanding what you mean ‘directly on GPU using constants’? There are no constants in Python and any variable in Python exists in the CPU and has to be transferred at some point to the GPU. You can make sure that those tensors are defined and moved to GPU outside of your high frequency loops and not modified, that is as close to constant as you get.

I’m not an expert on the nitty gritty details of CUDA kernels and the specific Pytorch mechanics relating to their handling. I believe a typical kernel launch latency is approx 10us, down to about 5us and up to I’m not sure where. But I’ve seen traces with 50us, etc.

So, if you are doing convolutions or mm with tens of thousands, hundreds of thousand elements, you don’t notice that. But for a tight loop of small operations that could only partially leverage the full parallalelization ability on a modern CPU, you could probably crank out thousands of iterations of your loop (C/C++ optimized) in the time it takes to launch a kernel on the GPU.

If you want to keep it a little higher level, maybe use Eigen, you’d get a bit of cross platform ability in the sense that it may be able to leverage a bit of NEON or SSE depending on what platform you compile for.

juanmed · September 4, 2019, 4:47pm

You are right, I meant to say ‘hard-coded floats’. Thanks for the insights on CUDA kernels and Eigen suggestion.