Hello,
I am training a Conv-TasNet-like network for an application that works on time series data. My training data consists of two vectors of ~20 million points, the time data and a dependent variable relevant to an experiment. The latter is the data I feed through the network. What I have found is that if I load my entire data set and pass it to my training function the loss.backward() call takes significantly longer to run than if I chunk the data into sets of say ~10,000 points. I am obviously not passing the full data set through the network on each iteration, the training function calls another function which builds batches and my GPU couldn’t hold the full set and model in memory anyway. Despite having the batch size the same whether I chunk the data or not there is still a bottleneck at loss.backward(). I have profiled this with torch.utils.bottleneck which returns the following:
Environment Summary
PyTorch 1.7.0+cu110 DEBUG compiled w/ CUDA 11.0
Running with Python 3.7 and
pip3 list
truncated output:
numpy==1.19.3
torch==1.7.0+cu110
torchaudio==0.7.0
torchsummary==1.5.1
torchvision==0.8.1+cu110
cProfile output
961120 function calls (935907 primitive calls) in 6.169 seconds
Ordered by: internal time
List reduced from 4861 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
1 2.472 2.472 2.472 2.472 {method ‘run_backward’ of ‘torch._C._EngineBase’ objects}
109 1.117 0.010 1.117 0.010 {built-in method conv1d}
226 0.921 0.004 0.921 0.004 {method ‘to’ of ‘torch._C._TensorBase’ objects}
1 0.351 0.351 0.351 0.351 {built-in method randperm}
2 0.332 0.166 0.332 0.166 {built-in method _pickle.load}
1 0.127 0.127 0.551 0.551 C:\Users\Dawson\PycharmProjects\deep_learning_denoiser\TasNet_train.py:13(load_eqp_data)
661 0.074 0.000 0.088 0.000 :914(get_data)
3183 0.070 0.000 0.070 0.000 {built-in method nt.stat}
4 0.059 0.015 0.059 0.015 {built-in method tensor}
661 0.049 0.000 0.049 0.000 {built-in method marshal.loads}
3 0.034 0.011 0.034 0.011 {method ‘float’ of ‘torch._C.TensorBase’ objects}
1 0.025 0.025 0.025 0.025 {built-in method batch_norm}
218 0.023 0.000 0.023 0.000 {method 'uniform’ of ‘torch._C._TensorBase’ objects}
1954/1761 0.017 0.000 0.106 0.000 {built-in method builtins.build_class}
1 0.016 0.016 6.169 6.169
autograd profiler output (CUDA mode)
top 15 events sorted by cpu_time_total
Because the autograd profiler uses the CUDA event API,
the CUDA time column reports approximately max(cuda_time, cpu_time).
Please ignore this output if your code does not use CUDA.
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
aten::randperm 25.22% 273.718ms 25.22% 273.718ms 273.718ms 6.125us 0.79% 6.125us 6.125us 1
aten::randperm 25.22% 273.688ms 25.22% 273.688ms 273.688ms 4.125us 0.53% 4.125us 4.125us 1
struct torch::autograd::CopyBackwards 9.50% 103.102ms 9.50% 103.102ms 103.102ms 223.875us 28.83% 223.875us 223.875us 1
aten::to 9.50% 103.096ms 9.50% 103.096ms 103.096ms 221.875us 28.58% 221.875us 221.875us 1
aten::copy_ 9.50% 103.083ms 9.50% 103.083ms 103.083ms 219.500us 28.27% 219.500us 219.500us 1
aten::add 2.77% 30.076ms 2.77% 30.076ms 30.076ms 4.250us 0.55% 4.250us 4.250us 1
aten::to 2.34% 25.408ms 2.34% 25.408ms 25.408ms 5.062us 0.65% 5.062us 5.062us 1
aten::copy_ 2.34% 25.387ms 2.34% 25.387ms 25.387ms 3.125us 0.40% 3.125us 3.125us 1
SliceBackward 2.23% 24.180ms 2.23% 24.180ms 24.180ms 19.250us 2.48% 19.250us 19.250us 1
aten::slice_backward 2.23% 24.148ms 2.23% 24.148ms 24.148ms 17.000us 2.19% 17.000us 17.000us 1
aten::add 1.89% 20.535ms 1.89% 20.535ms 20.535ms 5.750us 0.74% 5.750us 5.750us 1
SliceBackward 1.85% 20.120ms 1.85% 20.120ms 20.120ms 19.500us 2.51% 19.500us 19.500us 1
aten::slice_backward 1.85% 20.110ms 1.85% 20.110ms 20.110ms 16.750us 2.16% 16.750us 16.750us 1
aten::to 1.77% 19.251ms 1.77% 19.251ms 19.251ms 6.125us 0.79% 6.125us 6.125us 1
aten::copy_ 1.77% 19.230ms 1.77% 19.230ms 19.230ms 4.125us 0.53% 4.125us 4.125us 1
Self CPU time total: 1.085s
CUDA time total: 776.438us
The cProfile output points to ‘run_backward’ method as the culprit of my slowdown while the autograd profiler shows ‘aten::randperm’ as the bottleneck. In order to randomize my batches I permute the dataset indices which are used to select a random order for my data but this is only done once for a training run so I am not concerned about that. I don’t know enough of the subtleties of pytorch’s autograd to understand why it should matter how large the ultimate size of my data set is so long as the batch sizes are manageable. Here is a github repository with my training script in it as well if it can help shed some light on any mistakes I may be making. Could there be some copy operation of the full data set that I don’t recognize which is causing the slow down?