Profiling: occasional slow cudaMalloc calls

Hi @ptrblck, I appreciate your reply. Do you think that starting with the largest size of my input will result in less calls to cudaMalloc than initially allocating a large chunk of memory (e.g. 1 GB) like in this thread?

How exactly does the caching allocator reuse memory? If I have a call like t_repeat = torch.repeat_interleave(t, n, dim=0) in my forward method, then does the caching allocator try to reuse the same memory for the output of repeat_interleave each time? I.e. if the dimensions of t stay constant and n is largest on the very first call, then does each subsequent call necessarily reuse that same chunk of memory?