How can I release the unused gpu memory?

I would recommend to add debug statements using print(torch.cuda.max_memory_allocated()) to try to narrow down which operations are wasting the memory.

Just by skimming through the code, it seems that some lists and dicts are temporarily used and freed later. This might increase the peak memory, e.g. if you are storing the complete feature maps first and delete them one by one later.