Choose a different conv algorithm

Hi there,

Is it at all possible to use a different conv algorithm than the one selected by pytorch/cudnn by default?

Say we have this piece of code:

def run_conv2d_memtest_pytorch():
    with torch.no_grad():
        i = torch.rand((1, 192, 512, 512))
        w = torch.rand((64, 192, 7, 7))
        i = i.cuda()
        w = w.cuda()
        print(torch.cuda.max_memory_allocated() // 1024 // 1024)
        while True:
            o = F.conv2d(i, w, stride=1, padding=3)
            print(torch.cuda.max_memory_allocated() // 1024 // 1024)
            torch.cuda.empty_cache()

The problem is that this conv2d will require over 9 GB of memory even for batch size of 1. Using a 6x6 conv instead results in a much more reasonable requirement of about 320M, which is close to what you would expect when calculating the conv using the naive method. My guess is that cudnn and pytorch both will choose something like winograd and multiply the data for faster computation.

My issue is that a model was trained on a GPU with sufficient memory, but now I’d like to run it on GPUs with 6-8 GB for inference. I’m fine with it being slower due to suboptimal algorithm, but as it is, I can’t figure a way to run even a single example due to the model having this convolution layer inside of it.

Do I have any options outside of compiling a naive cuda conv2d implementation, creating a wrapper for Python and calling it explicitly?

1 Like

Ok, one way to do this is to run two convolutions with stride e.g. (2, 1), adjusting padding accordingly, and combining the resulting tensor, cutting the memory requirement rougly in half.
Interestingly, with stride (2,2), the heuristic reverts back to memory-cheap algorithm, so you can run 4 convs and combine their output.

In any case, this is really dumb and there should be a way in torch to do this painlessly.

1 Like