Faster way to partially add tensors (+ for only some channels)

It seems that working with parts of Tensor like this

x[:,:channels] += y

is very time consuming for some reason.
For example,

x += x

works much faster even when that requires much more GPU computation.
Is there any faster more optimal way to change and address Tensors like this, or maybe get some “compilation”-type speedups? I viewed PyTorch implementations of PyramidNet, which is basically heavy on this kind of requirement, but everyone seems to just

zero-pad y
x += y

as in the paper, which is obviously a wasteful hack around crappy frameworks.

Oops, seems like that delay was mostly caused by me not removing

torch.autograd.set_detect_anomaly(True) 

after a debug. Might be “mostly”, as this

class HalfReLU(nn.Module):
	def __init__(self):
		super(HalfReLU, self).__init__()

	def forward(self, x):
		half = x.shape[1]//2
		y = nn.functional.relu(x[:, :half], inplace=True)
		return torch.cat(( y, x[:, half:] ), dim=1)

still works on the order of 20x slower than plain ReLU, but I’ll post back later only if I’ll find it to have severe impact on what I’m doing.

1 Like