I am interested in computing the diagonal values of the Hessian of the loss function with respect to the parameters of an
nn.Module to use for statistical inference (standard errors depend on the Hessian). The complication is that the model is estimated using mini-batches, because the full data doesn’t fit into memory, and existing solutions I’ve come across only give you the Hessian for the loss evaluated on a single batch of data.
Does anyone have ideas for how to compute the Hessian evaluated on the full dataset? For a single batch, a (not safe) method like the code below works. My current thinking is that I can just compute the Hessian for a loss function with `sum’ reduction on each batch separately and then sum across all batches, but I haven’t yet convinced myself this is valid so figured I’d see if someone else had thoughts
_input=torch.randn(32,3) layer = nn.Linear(3,4) criterion=nn.CrossEntropyLoss() weight = layer.weight def func(weight): del layer.weight layer.weight=weight return criterion(layer(_input), torch.zeros(len(_input),dtype=torch.long)) torch.autograd.functional.hessian(func,weight)