RuntimeError: CUDA out of memory. Tried to allocate 1.12 MiB (GPU 0; 11.91 GiB total capacity; 5.52 GiB already allocated; 2.06 MiB free; 184.00 KiB cached)

Could you check, if wrapping the training step in a function helps, as Python uses function scoping as described here.