Is pytorch suitable for speeding up a large number of independent optimizations?

For my application I need to execute several million independent nonlinear optimizations. The optimization problem is three-dimensional but the cost function comes in very different shapes with several local minima. The only optimizer which works for all cases is the BFGS algorithm with a starting point determined by brute force minimization. At the moment, I am using scipy and MPI to execute the computations in parallel. A further significant speedup can only be achieved by utilizing the GPU as far as I see. The brute force minimization can easily be implemented in cuda but a matured library for the BFGS algorithm is lacking (there exists a three year old one ( which is not suitable for my application as far as I see).

Therefore my question is if pytorch is a good choice for this problem. It comes with GPU portability and a BFGS implementation but it seems to be geared for deep learning applications.

Thanks for your help!