Increasing data set size slows loss.backward() even though batch size is constant


I am training a Conv-TasNet-like network for an application that works on time series data. My training data consists of two vectors of ~20 million points, the time data and a dependent variable relevant to an experiment. The latter is the data I feed through the network. What I have found is that if I load my entire data set and pass it to my training function the loss.backward() call takes significantly longer to run than if I chunk the data into sets of say ~10,000 points. I am obviously not passing the full data set through the network on each iteration, the training function calls another function which builds batches and my GPU couldn’t hold the full set and model in memory anyway. Despite having the batch size the same whether I chunk the data or not there is still a bottleneck at loss.backward(). I have profiled this with torch.utils.bottleneck which returns the following:

Environment Summary

PyTorch 1.7.0+cu110 DEBUG compiled w/ CUDA 11.0
Running with Python 3.7 and

pip3 list truncated output:

cProfile output

     961120 function calls (935907 primitive calls) in 6.169 seconds

Ordered by: internal time
List reduced from 4861 to 15 due to restriction <15>

ncalls tottime percall cumtime percall filename:lineno(function)
1 2.472 2.472 2.472 2.472 {method ‘run_backward’ of ‘torch._C._EngineBase’ objects}
109 1.117 0.010 1.117 0.010 {built-in method conv1d}
226 0.921 0.004 0.921 0.004 {method ‘to’ of ‘torch._C._TensorBase’ objects}
1 0.351 0.351 0.351 0.351 {built-in method randperm}
2 0.332 0.166 0.332 0.166 {built-in method _pickle.load}
1 0.127 0.127 0.551 0.551 C:\Users\Dawson\PycharmProjects\deep_learning_denoiser\
661 0.074 0.000 0.088 0.000 :914(get_data)
3183 0.070 0.000 0.070 0.000 {built-in method nt.stat}
4 0.059 0.015 0.059 0.015 {built-in method tensor}
661 0.049 0.000 0.049 0.000 {built-in method marshal.loads}
3 0.034 0.011 0.034 0.011 {method ‘float’ of ‘torch._C.TensorBase’ objects}
1 0.025 0.025 0.025 0.025 {built-in method batch_norm}
218 0.023 0.000 0.023 0.000 {method 'uniform
’ of ‘torch._C._TensorBase’ objects}
1954/1761 0.017 0.000 0.106 0.000 {built-in method builtins.build_class}
1 0.016 0.016 6.169 6.169

autograd profiler output (CUDA mode)

    top 15 events sorted by cpu_time_total

    Because the autograd profiler uses the CUDA event API,
    the CUDA time column reports approximately max(cuda_time, cpu_time).
    Please ignore this output if your code does not use CUDA.

                                 Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls

                       aten::randperm        25.22%     273.718ms        25.22%     273.718ms     273.718ms       6.125us         0.79%       6.125us       6.125us             1
                       aten::randperm        25.22%     273.688ms        25.22%     273.688ms     273.688ms       4.125us         0.53%       4.125us       4.125us             1
struct torch::autograd::CopyBackwards         9.50%     103.102ms         9.50%     103.102ms     103.102ms     223.875us        28.83%     223.875us     223.875us             1
                             aten::to         9.50%     103.096ms         9.50%     103.096ms     103.096ms     221.875us        28.58%     221.875us     221.875us             1
                          aten::copy_         9.50%     103.083ms         9.50%     103.083ms     103.083ms     219.500us        28.27%     219.500us     219.500us             1
                            aten::add         2.77%      30.076ms         2.77%      30.076ms      30.076ms       4.250us         0.55%       4.250us       4.250us             1
                             aten::to         2.34%      25.408ms         2.34%      25.408ms      25.408ms       5.062us         0.65%       5.062us       5.062us             1
                          aten::copy_         2.34%      25.387ms         2.34%      25.387ms      25.387ms       3.125us         0.40%       3.125us       3.125us             1
                        SliceBackward         2.23%      24.180ms         2.23%      24.180ms      24.180ms      19.250us         2.48%      19.250us      19.250us             1
                 aten::slice_backward         2.23%      24.148ms         2.23%      24.148ms      24.148ms      17.000us         2.19%      17.000us      17.000us             1
                            aten::add         1.89%      20.535ms         1.89%      20.535ms      20.535ms       5.750us         0.74%       5.750us       5.750us             1
                        SliceBackward         1.85%      20.120ms         1.85%      20.120ms      20.120ms      19.500us         2.51%      19.500us      19.500us             1
                 aten::slice_backward         1.85%      20.110ms         1.85%      20.110ms      20.110ms      16.750us         2.16%      16.750us      16.750us             1
                             aten::to         1.77%      19.251ms         1.77%      19.251ms      19.251ms       6.125us         0.79%       6.125us       6.125us             1
                          aten::copy_         1.77%      19.230ms         1.77%      19.230ms      19.230ms       4.125us         0.53%       4.125us       4.125us             1

Self CPU time total: 1.085s
CUDA time total: 776.438us

The cProfile output points to ‘run_backward’ method as the culprit of my slowdown while the autograd profiler shows ‘aten::randperm’ as the bottleneck. In order to randomize my batches I permute the dataset indices which are used to select a random order for my data but this is only done once for a training run so I am not concerned about that. I don’t know enough of the subtleties of pytorch’s autograd to understand why it should matter how large the ultimate size of my data set is so long as the batch sizes are manageable. Here is a github repository with my training script in it as well if it can help shed some light on any mistakes I may be making. Could there be some copy operation of the full data set that I don’t recognize which is causing the slow down?

I think I have found the issue. I had wrongly assumed that the input data tensors needed requires_grad=True for proper training but after experimenting a little and setting requires_grad=False for the input data everything is running much faster and the network still learns. I guess only model.parameters() needs required_grad=True.