Understanding cdist() function

What does this new_cdist() function actually do ?

I mean that it seems to be related to a new type of back-propagation equation and adaptive learning rate.

def new_cdist(p, eta):
    class cdist(torch.autograd.Function):
        def forward(ctx, W, X):
            ctx.save_for_backward(W, X)
            out = -torch.cdist(W, X, p)
            return out

        def backward(ctx, grad_output):
            W, X = ctx.saved_tensors
            grad_W = grad_X = None
            if ctx.needs_input_grad[0]:
                _temp1 = torch.unsqueeze(X, 2).expand(X.shape[0], X.shape[1], W.shape[0]).permute(1, 0, 2)
                _temp2 = torch.unsqueeze(W.transpose(0, 1), 1)
                _temp = torch.cdist(_temp1, _temp2, p).squeeze().transpose(0, 1)
                grad_W = torch.matmul(grad_output, _temp)
                # print('before norm: ', torch.norm(grad_W))
                grad_W = eta * np.sqrt(grad_W.numel()) / torch.norm(grad_W) * grad_W
                print('after norm: ', torch.norm(grad_W))
            if ctx.needs_input_grad[1]:
                _temp1 = torch.unsqueeze(W, 2).expand(W.shape[0], W.shape[1], X.shape[0]).permute(1, 0, 2)
                _temp2 = torch.unsqueeze(X.transpose(0, 1), 1)
                _temp = torch.cdist(_temp1, _temp2, p).squeeze().transpose(0, 1)
                _temp = torch.nn.functional.hardtanh(_temp, min_val=-1., max_val=1.)
                grad_X = torch.matmul(grad_output.transpose(0, 1), _temp)
            return grad_W, grad_X
    return cdist().apply

I think the authors of the repository will give you a better answer, but based on the code it seems the backward method was reimplemented for the negated cdist method.

From the docs of cdist:

Computes batched the p-norm distance between each pair of the two collections of row vectors.

I’m not familiar with the implementation of the repository and would recommend to create an issue there.

1 Like

Yes, see https://github.com/huawei-noah/AdderNet/issues/6#issuecomment-613362421

Do you have any idea what _temp = torch.cdist(_temp1, _temp2, p).squeeze().transpose(0, 1) actually does ?

Same for _temp1 and _temp2

_temp1 = torch.unsqueeze(X, 2).expand(X.shape[0], X.shape[1], W.shape[0]).permute(1, 0, 2)

_temp2 = torch.unsqueeze(W.transpose(0, 1), 1)

Let’s go through all operations separately in the calls:

  • toch.cdist(a, b, p) calculates the p-norm distance between each pair of the two collections of row vectos, as explained above
  • .squeeze() will remove all dimensions of the result tensor where tensor.size(dim) == 1
  • .transpose(0, 1) will permute dim0 and dim1, i.e. it’ll “swap” these dimensions
  • torch.unsqueeze(tensor, dim) will add a new dimension specific by dim
    expand() will manipulate the meta data to create a view with the new shape (no copy of the data)
    permuteis similar totranspose` for multiple dimensions
1 Like
  1. What is the actual purpose of _temp1 and _temp2 in the calculation for backward propagation ?

  2. How is needs_input_grad[] used ?

  3. Why is there sqrt() operation when the AdderNet paper does not use it in backward propagation equation ?

  4. How is this new_dist() function different from the implementation of forward() and backward() in https://pytorch.org/docs/master/notes/extending.html ?

  5. How can I eliminate GPU out-of-memory runtime error reported in new_dist() ?

Have a look at https://stackoverflow.com/a/61229485/6422632

Some of the questions have been answered.

Any comments on https://www.reddit.com/r/pytorch/comments/g31hmd/replacing_torchcdist_function_to_eliminate_gpu/ ?

I was told that _temp1 = torch.unsqueeze(X, 2).expand(X.shape[0], X.shape[1], W.shape[0]).permute(1, 0, 2) seems to use up a lot of GPU memory during intermediate calculation .

But how to modify this line to use less GPU memory ?

All mentioned operations manipulate the metadata (shape, stride) of the tensor and will not use more memory in this particular line of code.
However, the result will be a non-contiguous tensor, and the next _temp1.contiguous() call (either manually or in a function) will trigger the copy and use more memory.
You cannot avoid this copy, if a method needs a contiguous tensor to operate on.

Here is a small example showing the memory usage:

print(torch.cuda.memory_allocated() / 1024**2)
> 0.0

x = torch.randn(256, 256, device='cuda')
print(torch.cuda.memory_allocated() / 1024**2)
> 0.25

x = x.unsqueeze(2).expand(x.size(0), x.size(1), 256).permute(1, 0, 2)
print(torch.cuda.memory_allocated() / 1024**2)
> 0.25

x = x.contiguous()
print(torch.cuda.memory_allocated() / 1024**2)
> 64.0
1 Like

I was told to either

  1. decrease the tensors dimension, then increase the tensors dimension again or
  2. increase the tensors dimensions and then use depth-wise convolution or
  3. use mixed-precision training

I know I can use https://github.com/NVIDIA/apex for the mixed precision training.
However, I am not sure how to modify the code for suggestions 1 and 2 above.

So, are you implying that torch.cdist() function itself contains contiguous() call ?

I tried to replace torch.cdist() using fast_cdist() as shown below. However, I still have GPU out-of-memory error.

def fast_cdist(x1, x2):
    adjustment = x1.mean(-2, keepdim=True)
    x1 = x1 - adjustment
    x2 = x2 - adjustment  # x1 and x2 should be identical in all dims except -2 at this point

    # Compute squared distance matrix using quadratic expansion
    # But be clever and do it with a single matmul call
    x1_norm = x1.pow(2).sum(dim=-1, keepdim=True)
    x1_pad = torch.ones_like(x1_norm)
    x2_norm = x2.pow(2).sum(dim=-1, keepdim=True)
    x2_pad = torch.ones_like(x2_norm)
    x1_ = torch.cat([-2. * x1, x1_norm, x1_pad], dim=-1)
    x2_ = torch.cat([x2, x2_pad, x2_norm], dim=-1)
    res = x1_.matmul(x2_.transpose(-2, -1))

    # Zero out negative values
    return res

Probably yes, and these lines are probably the right ones.

In your code other methods, such as torch.cat will create contiguous tensors as seen here:

a = torch.randn(1, 1).expand(10, 10)
> False

b = torch.randn(10, 10)
> True

c = torch.cat((a, b), dim=1)
> True

The main issue, is that your data is too large for the applied operations, as at least some of them work on contiguous tensors, which will create the memory increase.

For mixed-precision training, I would recommend to install the nightly and use native amp as described here.

1 Like

See the runtime error where I try to use another my_cdist() function.

def my_cdist(x1, x2):
    x1_norm = x1.pow(2).sum(dim=-1, keepdim=True)
    x2_norm = x2.pow(2).sum(dim=-1, keepdim=True)
    res = torch.addmm(x2_norm.transpose(-2, -1), x1, x2.transpose(-2, -1), alpha=-2).add_(x1_norm)
    res = res.clamp_min_(1e-30).sqrt_()
    return res

For your reference, please also see this addernet github issue

As for replacing torch.cat(), there is no way to replace the function yet.

As for using pytorch native amp library, I have the following runtime error with train.py

Traceback (most recent call last):
  File "train.py", line 112, in <module>
    loss, outputs = model(imgs, targets)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 563, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rog/Downloads/PyTorch-YOLOv3/models.py", line 266, in forward
    x, layer_loss = module[0](x, targets, img_dim)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 563, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rog/Downloads/PyTorch-YOLOv3/models.py", line 203, in forward
    loss_conf_obj = self.bce_loss(pred_conf[obj_mask], tconf[obj_mask])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 563, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/loss.py", line 520, in forward
    return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 2417, in binary_cross_entropy
    input, target, weight, reduction_enum)
RuntimeError: torch.nn.functional.binary_cross_entropy and torch.nn.BCELoss are unsafe to autocast.
Many models use a sigmoid layer right before the binary cross entropy layer.
In this case, combine the two layers using torch.nn.functional.binary_cross_entropy_with_logits
or torch.nn.BCEWithLogitsLoss.  binary_cross_entropy_with_logits and BCEWithLogits are
safe to autocast.

As the error message stats, you should replace the usage of sigmoid + nn.BCELoss to logits + nn.BCEWithLogitsLoss, as the former approach is unsafe for autocasting.

I have already tried Nvidia Apex, but it still give me OOM issue.

I will re-try pytorch native amp later.

By the way, is there a way to replace torch.cdist() which contains contiguous() calls ?

See the comments about cdist() with regards to contiguous() call

Thanks for following up.
In that case, these calls are no-ops and the method apparently just uses this much memory.

Now back to square one. So, torch.cdist() is not the OOM culprit.

And what do you think about the dimension of the tensor variables ?

I am also getting this error, the output of my model uses F.sigmoid. When i compute loss I use BCELoss. If I decide to use autocast, should I just remove sigmoid and use bcelosswithlogits?

I don’t understand the way of solving it.