L-BFGS memory increasing with number of classes, not parameters?

I have a model that takes as input a tensor of shape (c,n,n) and transforms it into a tensor of shape (c,m,m) by multiplying on both sides with the parameter filters that has shape (m,n), where m<<n.

For the loss, the output tensor of shape (c,m,m) is used to compute a pairwise distance matrix of size c-by-c (in this problem, c is the number of classes, each having an associated m-by-m matrix).

I am using the L-BFGS algorithm to optimize this, and it works great for most problems. However, in one problem where c has a large value (1700), I’m running into memory problems. What I am confused about is that, according to the L-BFGS documentation, memory usage should be determined by the size of the parameter. However, here reducing the size of the parameter filters by reducing n or m doesn’t prevent the memory crash, only reducing the number of classes c does. And the number of parameters (that is, m and n) is comparable to other problems where the model was working, that had fewer classes c.

Is there some reason why this might happen? Is there a way to try to find a work-around, since it doesn’t seem to be the number of parameters what’s causing the issue?

The code is in this repo GitHub - dherrera1911/blur-spectral-sqfa, the model + data seem a bit complicated to provide a minimalistic example, but I can try if needed.