Hi ! I am trying to train a neural network composed of two “sub networks” which outputs each an embedding. I use the cosine similarity of these embeddings as the loss function (trying to maximize it for positive samples). Anyway, I do not understand why my loss.backward() is so slow.

Actually, the forward pass takes 0.008ms on average while the backward pass is about 0.9ms which is way too slow for such a small network (two Linear layers for the first part and only an Embedding layer for the second one). I can’t figure out what’s wrong.

In addition, if I interact with my python script with python debugger and then continue the training it happens that loss.backward speeds up from 0.9ms to 0.02ms. I do not understand this behavior either, maybe a memory access issue ?

I guess that you’re running things on the GPU? And you do not use torch.cuda.synchronize() before measuring time?
Keep in mind that the CUDA API is asynchronous. So if you don’t sync, you only measure the time to ask the GPU to do stuff. But not how long it took to actually do it. In this case, by the time you get to the backward, the async queue is most likely full and you start waiting for computations to actually happen.

You were right thank you very much ! I do not understand however why the kernel synchronization takes so long though ? Both embeddings take only 0.008 ms to be computed and I call loss.backward() right after. Is there any copy from shared to global memory in the GPU or whatever that takes so much time (approx 0.8ms) ? How could I speed up the training in such a case ?

Yes this is with synchronization, I may look at the “sparse” arg but my Embedding is not that huge (100k vocab with 768 embedding size).
Thank you for you help

Hi, actually using sparse gradients for the Embedding did not solve the problem. I still have a very long synchronization time between forward and backward pass.

To recap :

I use a single GPU remotely.

I implemented a light weight two-parts neural network : the first part is just the average of the embeddings (text input), the second part is two linear layers (other input). I end up with two embeddings. I compute the cosine similarity between both embeddings. To have negative samples, I compute embeddings between all samples in the batch. For example if I have a batch size of 512, I have a cosine sim matrix of size 512 x 512. This step takes 0.3ms to be computed.

I use the softmax loss over the cosine similarities (along the first dim)

Synchronization time between all streams of all kernels of my GPU takes around 0.8s to be computed for a batch size of 512 (that’s very long). It is dependent on the size of the batch (with a batch size of 16 it is only 0.02s).

Backward and forward steps are quite fast : 0.008s each

Does someone have an idea about how to speed up the synchro time between forward and backward pass please ? What could be wrong ?

As I said above, the cuda API is fully asynchronous. So it is expected that measuring time without synch is “very fast” but then the synchronization is “slower”.
That’s because the first time just measure how long it takes to queue work for your GPU while the second waits for the computations to actually happen.

Depending on your GPU and the exact computations you do, 800ms to compute a batch of 512 samples isn’t that surprising really.