How to optimize code to run in GPU?

I find a example about “wrap padding” in

https://github.com/pytorch/pytorch/issues/3858

and I modify the code a little to make some dimension “wrap padding” and some padding with zeros.

def pad_circular_nd2(x: torch.Tensor, pad: int, dim, dim0) -> torch.Tensor:
“”"
:param x: shape [H, W]
:param pad: int >= 0
:param dim: the dimension over which the tensors are padded
:return:
“”"

if isinstance(dim, int):
    dim = [dim]
if isinstance(dim0, int):
    dim0 = [dim0]

for d in dim:
    if d >= len(x.shape):
        raise IndexError(f"dim {d} out of range")

    idx = tuple(slice(0, None if s != d else pad, 1) for s in range(len(x.shape)))
    x = torch.cat([x, x[idx]], dim=d)

    idx = tuple(slice(None if s != d else -2 * pad, None if s != d else -pad, 1) for s in range(len(x.shape)))
    x = torch.cat([x[idx], x], dim=d)
    pass

x0 = torch.zeros(x.size()).double().cuda()
for d in dim0:
    if d >= len(x.shape):
        raise IndexError(f"dim {d} out of range")

    idx = tuple(slice(0, None if s != d else pad, 1) for s in range(len(x.shape)))
    x = torch.cat([x, x0[idx]], dim=d)

    idx = tuple(slice(None if s != d else -2 * pad, None if s != d else -pad, 1) for s in range(len(x.shape)))
    x = torch.cat([x0[idx], x], dim=d)
    pass


return x.cuda()

However this “wrap padding” runs on CPU, though I expect it runs on GPU and makes the training much slower. Is there any way to fix it?

It should run on the GPU, if you pass x as a CUDATensor.
How did you check this operation runs on the CPU?

There also seems to be a mode='circular' now in F.pad.

Hi ptrblck,
The CPU consume is about 70% for a i9 CPU and without this “wrap padding” the CPU consume is close to 0. So I guess I might make something wrong and the code pass the data between CPU and GPU.

The F.pad circular mode works. Thank you!

1 Like