Finding issue for high memory bandwidth between CPU and GPU

I have implemented some models for generating text with LSTM layers etc., mostly according
to the typical encoder-decoder scheme and autoregressive training. Now, training one of the more complex models produced the issue of relatively constant GPU VRAM usage of about 30 percent, but the bandwidth of data being copied from CPU to GPU or backwards is at about 40GB/s.

In the training loop, I copy the batches to the GPU and do the computations (forward step etc.).
All layers and tensors are initialized directly on the GPU.
The result is a horribly slow training, and I am unsure where this overhead of copying comes from, because I do not use a single explicit copying from GPU to CPU in the code, nor do I get errors that tell me that not all tensors are on the same device.

Does anyone have any idea how to identify the issue and / or what the reason may be?

Thank you in advance

I don’t understand your issue entirely. It seems you are describing a slow host2device copy, which I assume you are able to reproduce with a standalone CUDA code. If so, you should check what your system specs expect.
Later you describe you are copying the data to the GPU only and train the model. Assuming the data is not huge even a slow host2device copy should not slow down the entire training unless your model is tiny.

The issue was, that a function call copied once per forward step all tensors etc. from the GPU to the CPU and back again, which led to a bandwidth usage of 40 GB/s between GPU and CPU. THis slowed down the training process. Effectively, I overlooked that I used “any” instead of “torch.any”, the former being executed on the CPU, the latter on GPU. Now its fixed and training is lightening fast. Is there a method to alert users with e.g. a warning when between pytorch operations, one is using a pythonic function that copies back to CPU? I just did not find the error due to the fact that I wasn’t aware of the copying process that happened under the hood.

I don’t know if the copy itself is slow but since the op is synchronizing you would indeed expect a performance hit. You could use torch.cuda.set_sync_debug_mode to raise a warning or error on synchronizations.