How can I speed up my custom layer in PyTorch?


I have written a custom layer by using NumPy. But it’s very slow during training because there are several for loops in the code. (for batch_size, for out_channels, and for in_channels)

Is there any method I can try to speed up the training process?

I tried to use Joblib.Parallel but failed in Pytorch forward and backward.
And I tried to change the Numpy operation to the torch operation, then use the .cuda(), but still very slow.

I saw the article about the C++ and CUDA Extension. But still confused and don’t know where should I start.

And if I want to use C++ to write the custom layer, Should I need to know C++ very well?

Thanks in advance.

You probably want to look into vectorizing your function better, i.e. ideally eliminate the loops over in and out channels (the batch one might be less urgent).
Unless you’ll go all the way to custom CUDA kernels (and that’s probably quite some effort) you’ll have a hard time solving this, even using C++.

Best regards


Thanks for your reply. First, I will try to vectorizing my function. If still slow, I will take some time to learn about the CUDA.