Visual Explanation of Torch Pad

picklerick · September 21, 2020, 12:25pm

I’m having a hard time visualizing how nn.functional.pad works. The docs about pad say the following:

For example, to pad only the last dimension of the input tensor, then pad has the form (padding_left, padding_right); to pad the last 2 dimensions of the input tensor, then use (padding_left, padding_right, padding_top, padding_bottom) ; to pad the last 3 dimensions, use (padding_left, padding_right, padding_top, padding_bottom, padding_front, padding_back) .

But I can’t understand how (padding_left, padding_right, padding_top, padding_bottom) is the same as padding the last 2 dimensions or how (padding_left, padding_right, padding_top, padding_bottom, padding_front, padding_back) translates to padding the last three dimensions. And what “words” does one use continuing in the same fashion for padding the last 4 dimensions? (…, padding_front, padding_back, padding_???, padding_???)

However my main request is to ask for an intuitive explanation how pad works. A visualization would be nice, but any intuition that you can impart about pad would be nice as well!

FYI, my doubts about pad came up after reading torch’s implementation of local_response_norm

def local_response_norm(input, size, alpha=1e-4, beta=0.75, k=1.):
    # type: (Tensor, int, float, float, float) -> Tensor
    r"""Applies local response normalization over an input signal composed of
    several input planes, where channels occupy the second dimension.
    Applies normalization across channels.
    See :class:`~torch.nn.LocalResponseNorm` for details.
    """
    if not torch.jit.is_scripting():
        if type(input) is not Tensor and has_torch_function((input,)):
            return handle_torch_function(
                local_response_norm, (input,), input, size, alpha=alpha, beta=beta, k=k)
    dim = input.dim()
    if dim < 3:
        raise ValueError('Expected 3D or higher dimensionality \
                         input (got {} dimensions)'.format(dim))
    div = input.mul(input).unsqueeze(1)
    if dim == 3:
        div = pad(div, (0, 0, size // 2, (size - 1) // 2))
        div = avg_pool2d(div, (size, 1), stride=1).squeeze(1)
    else:
        sizes = input.size()
        div = div.view(sizes[0], 1, sizes[1], sizes[2], -1)
        div = pad(div, (0, 0, 0, 0, size // 2, (size - 1) // 2))
        div = avg_pool3d(div, (size, 1, 1), stride=1).squeeze(1)
        div = div.view(sizes)
    div = div.mul(alpha).add(k).pow(beta)
    return input / div

If anyone could explain why padding is done in this way for the above function, I would be very grateful! (Apologies for the loaded question, I’m a noob)

ptrblck · September 23, 2020, 8:21am

The “left”, “right”, “top”, and “bottom” description might be understood if you think about an image tensor.
For multi-dimensional tensors you can think about the “front” or “end” of the dimension.
Each dimension will use two values, one for the “front” the other one for the “end” of this dimension.

Here is a small code snippet to show the behavior:

x = torch.ones(1, 1, 1, 1)
print(x)

# pad last dimension "at the end"
x1 = F.pad(x, (0, 0, 0, 0, 0, 0, 0, 1))
print(x1)

# pad last dimension "at the front"
x2 = F.pad(x, (0, 0, 0, 0, 0, 0, 1, 0))
print(x2)

# pad last dimension on both sides with two zeros
x3 = F.pad(x, (0, 0, 0, 0, 0, 0, 2, 2))
print(x3)

# pad dim1 "at the front" with 4 values
x4 = F.pad(x, (0, 0, 4, 0, 0, 0, 0, 0))
print(x4)

picklerick · September 23, 2020, 9:31am

Ah, that’s cool! The “front” and “end” intuition for high dimensional tensors is exactly what I needed. Thanks a lot for this!