Is there any faster way to rounding a tensor?

Run:
torch.round(torch.tensor([-0.6, -0.5, -0.4, 0.4, 0.5, 0.6]))

Got:
tensor([-1., -0., -0., 0., 0., 1.]).

Expect:
tensor([-1., -1., -0., 0., 1., 1.]).

I try:

x = torch.tensor([-0.6, -0.5, -0.4, 0.4, 0.5, 0.6])
x[x > 0] = torch.floor(x[x > 0] + 0.5)
x[x < 0] = torch.ceil(x[x < 0] - 0.5)

But it is too slow.
Run this code on 2080TI,

import torch
import time

x = torch.rand(3, 64, 128, 128).float() * 10 - 5
x = x.cuda()
tic = time.time()
x[x > 0] = torch.floor(x[x > 0] + 0.5)
x[x < 0] = torch.ceil(x[x < 0] - 0.5)
toc = time.time()
print(toc - tic)  # 0.008876 (if use torch.round, it will be 0.0006)

So, is there any faster way to rounding a tensor in my way(0.5 -> 1 and -0.5 -> -1)?

OK, finally I build a cpp extension to solve this problem.
But I am also intersted in the python solution :slight_smile:

You can always add 1e-6 to your original Tensor :slight_smile:
That should only add a single element-wise operation over the Tensor and won’t have such a bad overhead.

Thanks for your reply. But

x = torch.tensor([-0.6, -0.5, -0.4, 0.4, 0.5, 0.6])
y = torch.round(x + 1e-6)  # [-1., -0., -0., 0., 1., 1.]
y = torch.floor(x + 1e-6)  # [-1., -1., -1., 0., 0., 0.]
y = torch.ceil(x + 1e-6)  # [-0., -0., -0., 1., 1., 1.]

All of them are not [-1., -1., -0., 0., 1., 1.].

element-wise operation won’t have such a bad overhead

I tried x = torch.sign(x) * torch.floor(torch.abs(x) + 0.5). It is indeed faster (a bit).
Thank you.

Ho right you have negative values as well :confused:
I’m afraid this is going to be hard to do as fast as round without reimplementing an efficient C kernel for it.

In [12]: x = torch.tensor([-0.6, -0.5, -0.4, 0.4, 0.5, 0.6])

In [13]: %timeit x.round()
56.9 µs ± 7.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [14]: %timeit torch.sign(x) * torch.floor(torch.abs(x) + 0.5)
349 µs ± 4.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [16]: %timeit (x + x.sign().float() / 2.).round()
305 µs ± 3.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Note that these timings may not be representative for larger Tensors though. Or if you’re running on GPU.

(x + x.sign().float() / 2.).round() is also incorrect.
Thanks again. I had already known the solution.