I am observing some confusing performance issues when profiling using emit_nvtx and Nvidia Visual Profiler.
For example, performance of MSELoss in the following 2 scenarios:
- A simple 3 layer BLSTM network
1.1 Output is contiguous
1.2 Time spent on MSELoss varies between ~800 microseconds to ~5 milliseconds
- A complex network like MMDenseLSTM
2.1 Output is not contiguous, lots of other in-contiguities within the network
2.2 Time spent on MSELoss varies between ~150 milliseconds to ~200 milliseconds
The outputs have the same shape for both the scenarios.
This is a very simple example, similar differences are observed in the backward computations and gradient accumulation and parameter update as well.
I tried making the output contiguous just before calculating MSELoss, but without much effect (just ~5 millisecond reduction)
So my question is, what impact does contiguity of the data have on the performance of different steps, esp. MSELoss, Gradient Accumulation and Optimizer.step()? Or if there are some other considerations to be made?
@albanD, @apaszke Would be great if you could throw some light on this.
Yes! Contiguity of data matters a lot for performance reasons.
In particular, many operations are written in a way such that they only work if the inputs are contiguous. In this case, the operation will copy its inputs to be contiguous and then send the inputs through the operation. Copying tensors to make them contiguous (
tensor.contiguous()) gets very time-consuming for large tensors.
I see. But my question is, does the mseloss calculation depend only on the output and the targets or on the steps how it got to the output? Because, even if I explicitly make the ouput to be contiguous before calling mseloss, i dont see much of a difference, 140ms is still much much more than 2ms for the blstm case.
Also, when you say some operation only work if the inputs are contiguous, so the operation will copy its inputs to be contiguous and then operate, does it mean something like
inputs->copy/contiguous->operate->copy back ? Meaning for these ops, does the copy happen twice?
Usually just inputs -> copy/contiguous -> operate. The operation you have in question, MSELoss, does not care if its inputs are contiguous or not (but the code goes through different code paths depending on the contiguity of inputs).
The MSELoss backward pass depends on the gradOutput, input, and target. It does not depend on the output, but depending on what happens with the output, the gradOutput passed to MSELoss could be non-contiguous.
Can you point me to the relevant code in the pytorch source? Right now I am just worried about the forward pass, backward I will think about later, I need to speed up things in the MSELoss forward path. I put in a lot of contiguous statements in my model forward, but the MSELoss still takes 150ms in its forward pass
Here’s the code for MSELoss.
Also of interest is the THTensorApply file. MSELoss is implemented with a “pointwise apply” operation: depending on the contiguity of the inputs, it’ll pick a different code path.