How to accelerate the process of tensor index_select and index_copy?

I have defined a customed conv2d function and the forward function is as follows : which consists a tensor slice and tensor copy process,

def forward(self, input):
	self.output.index_copy_(1, self.k_out_mask, F.conv2d(torch.index_select(input, 1, self.k_in_mask), self.weight)
	return self.output

I tested the runing time of this forward process, it cost about 40ms,
but when I tested another forward function as follows,

def forward(self, input):

	return F.conv2d(input, self.weight)

This function does.t have the tensor slice and copy operation, but the conv2d input and weight size are same as the above function, it cost about 14ms.
Why does the tensor slice and tensor copy operation brings so much time cost ?
Can anyone help me to accelerate the tensor slice and tensor copy operation ?
Thanks so much !

Are you running your code on the CPU or GPU?
In the latter case, note that CUDA calls are asynchronous, so that you would need to synchronize the code before starting and stopping the timer:

t0 = time.perf_counter()
# your code
t1 = time.perf_counter()

Thanks for your reply, I know the tensor slice and copy process should’t be added to the running time, but can I have a more solid method to speed up the tensor slice and copy process, because the tensor slice and tensor copy process which brings much time delay for the conv2d function.

I mean maybe you are timing the code wrong and the majority of the time is indeed spent in the convolution.
Could you time your codes again using the synchronization and post the timings?

Oh, I actually run the code on cpu to test the elapse time

Hi Tony,

I have the same problem. Have you find any solution on it (accelerate tensor index_select)?

Best regards,