Finding sources of non-determinism for reproducability

I tried using all measures for ensuring reproducibility that I’m aware of for making sure that differences between hyper parameter changes are primarily due to the parameters and not variance caused by different random initializations:

    os.environ["PYTHONHASHSEED"] = str(seed)
    torch.backends.cudnn.deterministic = True

However, even with those options set in my main module I still get substantially different
losses every epoch across multiple runs. I also tried removing all invocations of cuda
and training on the CPU to eliminate GPU non-determinism as a possible cause which didn’t help.
Since I couldn’t reproduce this behavior with a small toy example I assume it is due to some mistake I made somewhere in my fairly large project. Is there some technique or tool which could help me e.g. record parts of the training process for detecting potential sources of this non-determinism for debugging?