How to make for loop inside a custom loss parallelizable by GPU?

I created my own custom loss. In both forward and backward, I need to enumerate each sample using for loop in python. There is no way to vectorized the operation, since each sample will have different properties.
It seems that using this approach make the computation very slow. Using a GPU also does not really help. My guess is that the for loop is not being parallelized when using GPU.
How to code a for loop in forward and backward that can be parallelized by GPU?

Thanks

there is no easy way around this yet. While we are working on a JIT compiler that should give automatic batching, it wont be ready in the near future.
The only hard path to make it faster is to likely either partly vectorize your computation or learn GPU programming.