Numerical stability differences between pytorch 0.4.1 and 1.0.0

I was developing a new model on machines with pytorch 0.4.1 and then shifted the code to machines with 1.0.0. On machines with 1.0.0 the models performed significantly worse. I controlled for changes to default inits in nn.Linear and nn.Conv2d by forcing all layers to initialize using the 0.4.1 defaults. I also pulled the 0.4.1 version of Adam into my codebase so that all models used the same optimizer.

The performance difference seems to be related to numerical stability, particularly in F.log_softmax and nn.CrossEntropy (which I think relies on F.log_softmax). I believe this for the following reasons.

I setup a pair of conda environments on the same machine, which differed only in their pytorch version. In each env I ran two models using the same code in each env. Within a given env, the only difference between models was that one model cast the “logit tensors” to float64 before computing losses based on F.log_softmax and nn.CrossEntropy, and the other model cast them to float32. After computing the losses, I always cast back to float32 before summing the losses and doing the backwards pass. I.e., I ran four models: (0.4.1, float32), (0.4.1, float64), (1.0.0, float32), and (1.0.0, float64). Both 0.4.1 models performed the same across all values I am monitoring during training. The float64 1.0.0 model significantly outperformed the float32 1.0.0 model. Both 0.4.1 models significantly outperformed the float64 1.0.0 model. The gap between the 1.0.0 and 0.4.1 models was BIG.

I am running more tests to see how controlling the magnitude/variance of inputs to F.log_softmax and nn.CrossEntropy differentially affect performance across pytorch versions, and these seem to support my belief that numerical instabilities are the main cause of differences across pytorch versions.

What changes were made to the backend numerical stuff for these functions? Do you have any advice for more conclusively determining what’s causing performance differences across pytorch versions?

1 Like

The performance difference seems to be related to numerical stability, particularly in F.log_softmax and nn.CrossEntropy (which I think relies on F.log_softmax). I believe this for the following reasons.

Actually, I had the opposite experience. I have had numerical stability issues before (more due to a custom implementation of a loss function) and thought that 1.0 might fix that because there was some bugfix related to F.log_softmax

However, I noticed zero difference between running the code in 0.4.1 and 1.0 though.

Potential explanation 1:

A potential explanation (assuming when you say “models significantly outperformed the float64 1.0.0 model” you mean predictive performance of the model?) is maybe that you previously used hyperparameters optimal for the previously “buggy” log_softmax, and these parameters are now not the best choice in 1.0 given that bug fix?

Potential explanation 2:

The bugfix mentioned above was not actually a bug fix and made things worse.

Hello Philip,

thank you for reporting this and your analysis!

  • Is this on CPU or GPU?
  • What is the dimension of the tensor you do the log_softmax on?
  • Ideally, we’d get some specific inputs demonstrating a regression. Maybe the following works:
    If you take a somewhat pre-trained model and run it on 0.4.1 and 1.0.0, do can you spot any differences between the log_softmax output, the loss, or the gradients of the logits (use logits.retain_grad() between forward and backward to keep them around)?

Best regards

Thomas

I’m developing an extended/augmented version of the “Deep InfoMax” model described in this paper: https://openreview.net/forum?id=Bklr3j0cKX

While looking for the root cause of differences between 0.4.1 and 1.0.0, I stripped the model down until it was approximately the same as the “local DIM” model in the ICLR paper, modulo some data augmentation tricks.

I’m measuring model “performance” during training in terms of two tasks – (i) a log-likelihood loss from a noise-contrastive estimation objective that drives self-supervised representation learning, and (ii) a classifier which is trained at the same time using the features learned by the self-supervised objective. The classifier is a simple MLP with one ReLU hidden layer and does not backprop its loss into the features provided by the encoder. The classifier is for monitoring the quality of features learned via self-supervised learning.

The NLL for the classifier and the encoder’s NCE cost are both significantly worse when trained in 1.0.0 than in 0.4.1, whether I compute the losses in fp32 or fp64. If I cast the relevant tensors to fp64 prior to computing the NCE and classifier losses, and then cast back to fp32 after computing the losses, the performance in 0.4.1 is unaffected but the performance in 1.0.0 improves relative to the case where I use fp32 inside the loss computations. I believe everything is computed on the GPU, but it’s possible that the pytorch backend silently shifts some things off GPU.

The main step in computing the NCE cost is to do F.log_softmax on a large tensor of pair-wise “scores” between “global” and “local” features taken from all the inputs in a minibatch. The softmax normalization is over a set of 1k-10k (potentially) high-variance logits. The classifier cost is pretty standard, and just passes the MLP’s output logits to an nn.CrossEntropy instance.

Training in 0.4.1 has always been very stable, including while developing the models described in the ICLR paper. I did not encounter any instability until I started working with 1.0.0. One weird guess for what could cause the problem is that “improvements” in numerical stability may be causing some grads/losses which underflow to 0 in 0.4.1 to no longer underflow, and these grads are the ones that explode in backprop and cause the instability. But, idk.

I provided more info in my reply to rasbt.

I believe all computations are on GPU. I’m using P100s in an Azure VM.

The tensor I apply log_softmax to has shape (64, 49, 64*49), and the normalization is along dim=2.

When tracking the [50, 90, 95] percentiles of grad norms observed in each epoch, the difference between training with 0.4.1 and 1.0.0 is stark. It is clear from the first epoch, and not at all subtle.

It is possible that something weird is going on with my conda envs or the VM environment. For now I am just going to work in 0.4.1. If I have time I will try and record a simpler and more precise set of conditions in which the observed differences arise.

This is the code for computing the main NCE-based loss, which seems most strongly affected by changes from 0.4.1 to 1.0.0. In 1.0.0, if I explicitly cast the input to F.log_softmax to torch.float64 and then cast back to torch.float32 before returining the loss, it changes performance compared to using torch.float32 throughout the computation. In 0.4.1 this casting seems to have no effect.

def compute_loss_glb2lcl_nce(rkhs_glb, rkhs_lcl, stabilize=True):
    # compute number of spatial locations for local features
    n_batch = int(rkhs_glb.size(0))
    n_rkhs = int(rkhs_glb.size(1))
    n_locs = int(rkhs_lcl.size(2) * rkhs_lcl.size(3))
    mask_mat = torch.eye(n_batch).unsqueeze(dim=2).cuda()              # (n_batch, n_batch, 1)
    # compute info cost for global features -> local features
    rkhs_pos = rkhs_lcl.reshape(n_batch, n_rkhs, -1).permute(0, 2, 1)  # (n_batch, n_locs, n_rkhs)
    rkhs_lcl_flat = rkhs_pos.reshape(-1, n_rkhs)                       # (n_batch * n_locs, n_rkhs)
    # compute scores on positive samples only
    pred_pos = torch.matmul(rkhs_pos, rkhs_glb.unsqueeze(dim=2))       # (n_batch, n_locs, 1)
    # compute scores on all samples
    all_scores = torch.mm(rkhs_glb, rkhs_lcl_flat.t())                 # (n_batch, n_batch*n_locs)
    all_scores = all_scores.reshape(n_batch, n_batch, n_locs)          # (n_batch, n_batch, n_locs)
    if stabilize:
        lgt_reg = 1e-3 * (all_scores**2.).mean()
    else:
        lgt_reg = 0. * (all_scores**2.).mean()
    # ...
    pred_neg = all_scores.reshape(n_batch, -1).unsqueeze(dim=1)        # (n_batch, 1, n_batch*n_locs)
    pred_neg = pred_neg.expand(-1, n_locs, -1)                         # (n_batch, n_locs, n_batch*n_locs)
    mask_mat = 1. - mask_mat.expand(-1, -1, n_locs)                    # (n_batch, n_batch, n_locs)
    mask_mat = mask_mat.reshape(n_batch, -1).unsqueeze(dim=1)          # (n_batch, 1, n_batch*n_locs)
    mask_mat = mask_mat.expand(-1, n_locs, -1)                         # (n_batch, n_locs, n_batch*n_locs)
    # do not include scores for negative sampler buffer (none was given)
    pred_msk = torch.cat([torch.ones_like(pred_pos), mask_mat], dim=2)  # (n_batch, n_locs, 1 + n_batch*n_locs)
    pred_lgt = torch.cat([pred_pos, pred_neg], dim=2)                   # (n_batch, n_locs, 1 + n_batch*n_locs)
    pred_lgt = (pred_msk * pred_lgt) + (10. * (pred_msk - 1.))          # (n_batch, n_locs, 1 + n_batch*n_locs)
    if stabilize:
        pred_lgt = tanh_clip(pred_lgt, 10.)
    # compute log softmax for positive samples
    pred_nll = -F.log_softmax(pred_lgt, dim=2)                         # (n_batch, n_locs, 1 + n_neg)
    loss_glb2lcl = pred_nll[:, :, 0].mean()                            # scalar :-)
    loss_glb2lcl = loss_glb2lcl + lgt_reg
    return loss_glb2lcl, None

Hi, this looks interesting. Could you do this for me: finding an input tensor in your model where log_softmax (or its backward, or its double backward) gives different results in 0.4.1 and 1.0.1? It would be very helpful to reproduce the issue. Thank you!

Another thing you can try is to replace log_softmax with softmax + log and see if the perf is better.