Check this out. It is a copy of your error.
As a summary setting cudnn.benchmark to False works for some. You could try that too. (Also, you should be setting up this way if this applies to your randomized batch sizes.) Others have fixed the error by mixing and matching CUDA and PyTorch versions.
Something you’re trying that’s different is to push to the GPU inside your collate_fn. (I’m sure this sparks a discussion on multiple processes copying data to GPU together. If you know of some such, do link it here) As a test, you could try to just return batch in custom_collate_fn without the .to('cuda') call. and in your for batch_num, train_batch... call add train_batch.to('cuda', non_blocking=True)
If it still fails with the same error, try sending a reply on that thread. If it does work, still send a reply on that thread; It will greatly help debug the issue 