I am dealing with a huge matrix, but my GPU memory is limited.
So, I chunk the huge matrix on CPU and use for-loop to transfer smaller chunked one to GPU one-by-one. However, the CPU-to-GPU becomes slower after 1st iteration.
sample codes:
chunks = torch.chunk(huge_matrix, chunks=10, dim=0) # on CPU
for i in range(len(chunks):
x = chunks[i].to(device) # device: cuda
# 1st: 0.75s, 2nd: 3.49s, 3rd: 3.65s, ....
....
How are you profiling the code?
Since CUDA operations are executed asynchronously, you would have to synchronize the code before starting and stopping the timers.
Otherwise your profiles will be wrong and would accumulate the runtime of previous kernels into a blocking operation.