Pytorch: a similar process like reverse pooling and replicate padding?

I have a tensor A that has shape (batch_size, width, height). Assume that it has these values:

A = torch.tensor([[
                  [0, 1],
                  [1, 0]

I am also given a number K that is a positive integer. Let K=2 in this case. I want to do a process that is similar to reverse pooling and replicate padding. This is the expected output:

B = torch.tensor([[
                [0, 0, 1, 1],
                [0, 0, 1, 1],
                [1, 1, 0, 0],
                [1, 1, 0, 0]             

Explanation: for each element in A, we expand it to the matrix of shape (K, K), and put it in the result tensor. We continue to do this with other elements, and let the stride between them equals to the kernel size (that is, K).

How can I do this in PyTorch? Currently, A is a binary mask, but it could be better if I can expand it to non-binary case.

I’m not sure if it would perfectly fit your use case, as I didn’t fully understand the desired behavior regarding the stride and kernel size, but the replication padding would match your output:

A = torch.tensor([[
                  [0, 1],
                  [1, 0]

B = F.pad(A.unsqueeze(1), (1, 1, 1, 1), mode='replicate')
> tensor([[[[0., 0., 1., 1.],
            [0., 0., 1., 1.],
            [1., 1., 0., 0.],
            [1., 1., 0., 0.]]]])

You could squeeze(1) the additionally added dimension afterwards, if needed.

Hi, after several testing I think F.interpolate(a.float(), scale_factor=k, mode="nearest") suits best. But when I use timeit it reports the worst run is ~90x slower than the best run, and sometimes I get only 8x or 10x. Is this a problem?

Is the 90x slowdown reported by timeit for the same method?
If so, I would assume it should remove outliers. In doubt, you could manually profile it by adding proper warmup iterations and measure the time as an average for multiple iterations.

Yes, it is reported by timeit for the same piece of code. Tensor is initialized and transformed through that single line.

As a side note, I am still having this issue, and I think this inconsistency makes inconsistent in the code runtime, which is very bad in benchmarking.

To reproduce the result:

import torch
import torch.nn.functional as F

%%timeit -n 1000000  # Jupyter
F.interpolate(torch.rand(100, 1, 16, 16).to("cuda").float(), scale_factor=2, mode="nearest")

This is expected, since the first iteration would initialize the CUDA context etc. and is thus slow.
Also, to properly profile CUDA operations, you would have to synchronize the code manually, since CUDA ops are executed asynchronously, which is why we recommend to use torch.utils.benchmark to profile workloads (which adds warmup iterations, synchronizes, and averages the runtimes).