Hello,
I have written a custom layer by using NumPy. But it’s very slow during training because there are several for loops in the code. (for batch_size, for out_channels, and for in_channels)
Is there any method I can try to speed up the training process?
I tried to use Joblib.Parallel
but failed in Pytorch forward and backward.
And I tried to change the Numpy operation to the torch operation, then use the .cuda(), but still very slow.
I saw the article about the C++ and CUDA Extension. But still confused and don’t know where should I start.
And if I want to use C++ to write the custom layer, Should I need to know C++ very well?
Thanks in advance.