It seems that working with parts of Tensor like this
x[:,:channels] += y
is very time consuming for some reason.
For example,
x += x
works much faster even when that requires much more GPU computation.
Is there any faster more optimal way to change and address Tensors like this, or maybe get some “compilation”-type speedups? I viewed PyTorch implementations of PyramidNet, which is basically heavy on this kind of requirement, but everyone seems to just
zero-pad y
x += y
as in the paper, which is obviously a wasteful hack around crappy frameworks.