Loading data from HDF5 files takes longer when preceded by memory-intensive operations

Why might loading data (w/ random access) from a large HDF5 file be slower (and get slower and slower as training progresses) if, right before I call getitem on my custom Dataset, I temporarily fill GPU memory (almost completely) with intermediate torch.nn.functional.cosine_similarity computations? I am also doing several dist.gather and dist.broadcast operations, if that could be the culprit for whatever reason.