I created my own custom loss. In both forward and backward, I need to enumerate each sample using for loop in python. There is no way to vectorized the operation, since each sample will have different properties.
It seems that using this approach make the computation very slow. Using a GPU also does not really help. My guess is that the for loop is not being parallelized when using GPU.
How to code a for loop in forward and backward that can be parallelized by GPU?
Thanks