Weird behavior observed with Autograd

@ptrblck I am trying to run the TRPO algorithm which uses a batch size of 15000. Autograd takes 27 seconds to backpropagate when the batch size is 15000 but only takes 19 seconds to sequentially compute (loop over batch_size) the individual gradients (batch size 1). What I find strange here is that parallelism is hurting the performance w.r.t. time (I agree that the latter is space inefficient).

What kind if batch norms you have implemented?