Understanding cdist() function

promach · April 11, 2020, 9:21am

What does this new_cdist() function actually do ?

I mean that it seems to be related to a new type of back-propagation equation and adaptive learning rate.

def new_cdist(p, eta):
    class cdist(torch.autograd.Function):
        @staticmethod
        def forward(ctx, W, X):
            ctx.save_for_backward(W, X)
            out = -torch.cdist(W, X, p)
            return out

        @staticmethod
        def backward(ctx, grad_output):
            W, X = ctx.saved_tensors
            grad_W = grad_X = None
            if ctx.needs_input_grad[0]:
                _temp1 = torch.unsqueeze(X, 2).expand(X.shape[0], X.shape[1], W.shape[0]).permute(1, 0, 2)
                _temp2 = torch.unsqueeze(W.transpose(0, 1), 1)
                _temp = torch.cdist(_temp1, _temp2, p).squeeze().transpose(0, 1)
                grad_W = torch.matmul(grad_output, _temp)
                # print('before norm: ', torch.norm(grad_W))
                grad_W = eta * np.sqrt(grad_W.numel()) / torch.norm(grad_W) * grad_W
                print('after norm: ', torch.norm(grad_W))
            if ctx.needs_input_grad[1]:
                _temp1 = torch.unsqueeze(W, 2).expand(W.shape[0], W.shape[1], X.shape[0]).permute(1, 0, 2)
                _temp2 = torch.unsqueeze(X.transpose(0, 1), 1)
                _temp = torch.cdist(_temp1, _temp2, p).squeeze().transpose(0, 1)
                _temp = torch.nn.functional.hardtanh(_temp, min_val=-1., max_val=1.)
                grad_X = torch.matmul(grad_output.transpose(0, 1), _temp)
            return grad_W, grad_X
    return cdist().apply

ptrblck · April 12, 2020, 6:40am

I think the authors of the repository will give you a better answer, but based on the code it seems the backward method was reimplemented for the negated cdist method.

From the docs of cdist:

Computes batched the p-norm distance between each pair of the two collections of row vectors.

I’m not familiar with the implementation of the repository and would recommend to create an issue there.

promach · April 14, 2020, 10:48am

Yes, see https://github.com/huawei-noah/AdderNet/issues/6#issuecomment-613362421

Do you have any idea what _temp = torch.cdist(_temp1, _temp2, p).squeeze().transpose(0, 1) actually does ?

Same for _temp1 and _temp2

_temp1 = torch.unsqueeze(X, 2).expand(X.shape[0], X.shape[1], W.shape[0]).permute(1, 0, 2)

_temp2 = torch.unsqueeze(W.transpose(0, 1), 1)

ptrblck · April 15, 2020, 12:59am

Let’s go through all operations separately in the calls:

toch.cdist(a, b, p) calculates the p-norm distance between each pair of the two collections of row vectos, as explained above
.squeeze() will remove all dimensions of the result tensor where tensor.size(dim) == 1
.transpose(0, 1) will permute dim0 and dim1, i.e. it’ll “swap” these dimensions
torch.unsqueeze(tensor, dim) will add a new dimension specific by dim
expand() will manipulate the meta data to create a view with the new shape (no copy of the data)
permuteis similar totranspose` for multiple dimensions

promach · April 15, 2020, 2:13am

What is the actual purpose of _temp1 and _temp2 in the calculation for backward propagation ?
How is needs_input_grad[] used ?
Why is there sqrt() operation when the AdderNet paper does not use it in backward propagation equation ?
How is this new_dist() function different from the implementation of forward() and backward() in https://pytorch.org/docs/master/notes/extending.html ?
How can I eliminate GPU out-of-memory runtime error reported in new_dist() ?

promach · April 16, 2020, 2:25am

Have a look at https://stackoverflow.com/a/61229485/6422632

Some of the questions have been answered.

promach · April 17, 2020, 2:24pm

Any comments on https://www.reddit.com/r/pytorch/comments/g31hmd/replacing_torchcdist_function_to_eliminate_gpu/ ?

promach · April 23, 2020, 2:48pm

I was told that _temp1 = torch.unsqueeze(X, 2).expand(X.shape[0], X.shape[1], W.shape[0]).permute(1, 0, 2) seems to use up a lot of GPU memory during intermediate calculation .

But how to modify this line to use less GPU memory ?

ptrblck · April 23, 2020, 7:20pm

All mentioned operations manipulate the metadata (shape, stride) of the tensor and will not use more memory in this particular line of code.
However, the result will be a non-contiguous tensor, and the next _temp1.contiguous() call (either manually or in a function) will trigger the copy and use more memory.
You cannot avoid this copy, if a method needs a contiguous tensor to operate on.

Here is a small example showing the memory usage:

print(torch.cuda.memory_allocated() / 1024**2)
> 0.0

x = torch.randn(256, 256, device='cuda')
print(torch.cuda.memory_allocated() / 1024**2)
> 0.25

x = x.unsqueeze(2).expand(x.size(0), x.size(1), 256).permute(1, 0, 2)
print(torch.cuda.memory_allocated() / 1024**2)
> 0.25

x = x.contiguous()
print(torch.cuda.memory_allocated() / 1024**2)
> 64.0

promach · April 24, 2020, 2:30am

I was told to either

decrease the tensors dimension, then increase the tensors dimension again or
increase the tensors dimensions and then use depth-wise convolution or
use mixed-precision training

I know I can use https://github.com/NVIDIA/apex for the mixed precision training.
However, I am not sure how to modify the code for suggestions 1 and 2 above.

promach · April 24, 2020, 12:28pm

So, are you implying that torch.cdist() function itself contains contiguous() call ?

I tried to replace torch.cdist() using fast_cdist() as shown below. However, I still have GPU out-of-memory error.

def fast_cdist(x1, x2):
    adjustment = x1.mean(-2, keepdim=True)
    x1 = x1 - adjustment
    x2 = x2 - adjustment  # x1 and x2 should be identical in all dims except -2 at this point

    # Compute squared distance matrix using quadratic expansion
    # But be clever and do it with a single matmul call
    x1_norm = x1.pow(2).sum(dim=-1, keepdim=True)
    x1_pad = torch.ones_like(x1_norm)
    x2_norm = x2.pow(2).sum(dim=-1, keepdim=True)
    x2_pad = torch.ones_like(x2_norm)
    x1_ = torch.cat([-2. * x1, x1_norm, x1_pad], dim=-1)
    x2_ = torch.cat([x2, x2_pad, x2_norm], dim=-1)
    res = x1_.matmul(x2_.transpose(-2, -1))

    # Zero out negative values
    res.clamp_min_(1e-30).sqrt_()
    return res

ptrblck · April 25, 2020, 12:37am

Probably yes, and these lines are probably the right ones.

In your code other methods, such as torch.cat will create contiguous tensors as seen here:

a = torch.randn(1, 1).expand(10, 10)
print(a.is_contiguous())
> False

b = torch.randn(10, 10)
print(b.is_contiguous())
> True

c = torch.cat((a, b), dim=1)
print(c.is_contiguous())
> True

The main issue, is that your data is too large for the applied operations, as at least some of them work on contiguous tensors, which will create the memory increase.

For mixed-precision training, I would recommend to install the nightly and use native amp as described here.

promach · April 25, 2020, 3:36am

See the runtime error where I try to use another my_cdist() function.

@torch.jit.script
def my_cdist(x1, x2):
    x1_norm = x1.pow(2).sum(dim=-1, keepdim=True)
    x2_norm = x2.pow(2).sum(dim=-1, keepdim=True)
    res = torch.addmm(x2_norm.transpose(-2, -1), x1, x2.transpose(-2, -1), alpha=-2).add_(x1_norm)
    res = res.clamp_min_(1e-30).sqrt_()
    return res

For your reference, please also see this addernet github issue

As for replacing torch.cat(), there is no way to replace the function yet.

promach · April 26, 2020, 11:10am

As for using pytorch native amp library, I have the following runtime error with train.py

Traceback (most recent call last):
  File "train.py", line 112, in <module>
    loss, outputs = model(imgs, targets)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 563, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rog/Downloads/PyTorch-YOLOv3/models.py", line 266, in forward
    x, layer_loss = module[0](x, targets, img_dim)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 563, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rog/Downloads/PyTorch-YOLOv3/models.py", line 203, in forward
    loss_conf_obj = self.bce_loss(pred_conf[obj_mask], tconf[obj_mask])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 563, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/loss.py", line 520, in forward
    return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 2417, in binary_cross_entropy
    input, target, weight, reduction_enum)
RuntimeError: torch.nn.functional.binary_cross_entropy and torch.nn.BCELoss are unsafe to autocast.
Many models use a sigmoid layer right before the binary cross entropy layer.
In this case, combine the two layers using torch.nn.functional.binary_cross_entropy_with_logits
or torch.nn.BCEWithLogitsLoss.  binary_cross_entropy_with_logits and BCEWithLogits are
safe to autocast.

ptrblck · April 26, 2020, 11:07pm

As the error message stats, you should replace the usage of sigmoid + nn.BCELoss to logits + nn.BCEWithLogitsLoss, as the former approach is unsafe for autocasting.

promach · April 28, 2020, 9:22am

I have already tried Nvidia Apex, but it still give me OOM issue.

I will re-try pytorch native amp later.

By the way, is there a way to replace torch.cdist() which contains contiguous() calls ?

promach · April 29, 2020, 1:58am

See the comments about cdist() with regards to contiguous() call

ptrblck · April 29, 2020, 3:01am

Thanks for following up.
In that case, these calls are no-ops and the method apparently just uses this much memory.

promach · April 29, 2020, 3:33am

Now back to square one. So, torch.cdist() is not the OOM culprit.

And what do you think about the dimension of the tensor variables ?

WaterKnight · May 19, 2020, 9:05pm

I am also getting this error, the output of my model uses F.sigmoid. When i compute loss I use BCELoss. If I decide to use autocast, should I just remove sigmoid and use bcelosswithlogits?

I don’t understand the way of solving it.