I have a model that computes a time varying covariance matrix for time series data. The data has about 40 features per time point, with a few hundred thousand time points. The forward pass and loss calculation are quite quick, ~1 or 2 seconds, but the backwards pass takes upwards of 5 minutes for each minibatch.
I’ve tried using the profiler to diagnose the issue, but it doesn’t seem to be overly helpful. I’ve uploaded the exported profile to dropbox here.
I’m struggling to glean any useful information from this profile, could someone take a look at it and see if I’m missing anything, or whether it indicates something I’m doing fundamentally incorrectly?
In your training code, do you wrap everything that does not need to be included in the computation graph with with torch.no_grad()? I experienced a similar issue and it was because my evaluation code was being included in the comp. graph .
All variables that don’t need a gradient have the flag requires_grad set to False, and I call the optimiser with the call torch.optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr)
I’ve gone back and had a look at how the number of spatial dimensions affects the time taken to do the backwards pass, and it seems to be quadratic, which I suppose makes sense as a covariance matrix is a DxD matrix. The question is, does taking 200x longer to do a backwards pass than a forwards pass make sense?
Is there any way of debugging the backwards pass? If I try and do it through PyCharm for example, I get to Variable._execution_engine.run_backward before I can no longer step through it – I assume that this part is in C? Is there any way I can look at the operations mentioned in the profiler specifically?
I’m also wondering if the time taken in the backwards pass is simply due to how many operations take place; running the profiler on only the backwards pass gives over 300,000 operations in the profiler. Printing the contents of the profiler alone takes ages…
Could anyone throw out some suggestions? Now I’ve got it running on a cluster with GPUs which is faster but still quite slow, on the order of ~1 minute 20 per minibatch, which is around 1 hour per epoch. It would be nice to run quicker than a week or two!