Solving many independent optimization problems with `torch.optim`

I’m interested in doing the torch version of something like this, in Scipy:

import numpy as np
from scipy.optimize import minimize

def objective(x):
    return np.mean(x ** 2)

# a batch of initial conditions
# batch size is 101, problem dimension is 3
x0s = np.random.normal(size=(101, 3))
# use a BFGS to solve these problems independently
xs = [minimize(objective, x0).x for x0 in x0s]

In PyTorch, I could write

import torch
from torch.optim import LBFGS
xs = torch.tensor(x0s, requires_grad=True)
opt = LBFGS([xs])
def closure():
    opt.zero_grad()
    # objective is the MSE, and we sum the independent objectives
    # xs.shape is (101, 3) as above
    loss = torch.mean(xs ** 2, 1).sum()
    loss.backward()
    return loss
opt.step(closure)

and this works, in the sense that the code runs; however, it solves the problem as a (101*3)-dimensional problem rather than as 101 separate 3-dimensional problems, which is much more expensive when using a method which tries to approximate the Hessian, like LBFGS. It would probably also lead to a difference in stopping criteria with the Scipy code above.

I’m wondering, is my understanding here correct, and if so is there any way to run optimizations in a truly independent way with torch? Maybe some vmap strategy?

After some experiments scaling up the problem size (both the number of problems and the problem dimension), I’ve concluded that I don’t understand the full picture. torch is much faster than scipy even on huge problems. So, I’m happy about that, but would still welcome any perspectives on these questions.

For reference, the scaling benchmark I ran was:

  • scipy, in a notebook so that %timeit works:

    import numpy as np
    from scipy.optimize import minimize
    
    def objective(x):
        return np.mean(np.diff(x) ** 2) + x[9] ** 2
    
    # a batch of initial conditions
    # batch size is 101, problem dimension is 3
    x0s = np.random.normal(size=(10001, 10))
    # use a BFGS to solve these problems independently
    %timeit xs1 = np.array([minimize(objective, x0, method="L-BFGS-B").x for x0 in x0s])
    

    [30.8 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)]

  • torch:

    %%timeit
    xs2 = torch.tensor(x0s, requires_grad=True)
    opt = LBFGS([xs2], max_iter=100)
    def closure():
        opt.zero_grad()
        # sum the independent objectives
        loss = (torch.mean(torch.diff(xs2, 1) ** 2, 1) + xs2[:, 9] ** 2).sum()
        loss.backward()
        return loss
    opt.step(closure)
    

    [80.1 ms ± 2.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)]

The objective values are comparable (torch’s is slightly smaller).

This is all just on CPU. Checking htop, torch doesn’t exceed 100% CPU, so it’s fair to run scipy in serial.