Hi @ptrblck, I appreciate your reply. Do you think that starting with the largest size of my input will result in less calls to cudaMalloc
than initially allocating a large chunk of memory (e.g. 1 GB) like in this thread?
How exactly does the caching allocator reuse memory? If I have a call like t_repeat = torch.repeat_interleave(t, n, dim=0)
in my forward
method, then does the caching allocator try to reuse the same memory for the output of repeat_interleave
each time? I.e. if the dimensions of t
stay constant and n
is largest on the very first call, then does each subsequent call necessarily reuse that same chunk of memory?