Memory allocation for backward pass

I am trying to estimate the memory allocation of my GPU before training. Therefore I count the number of model-parameters, the dimensions of the in/out-tensors and I track all intermediate computations. When I run model(inputs) with torch.no_grad(), the estimated memory allocation in byte is (model_params + input_params + intermediate_params) * 32 / 8. This also matches the observation with torch.cuda.memory_allocated().
When I do the same with building the graph, I would expect the allocation to be (model_params + input_params + 2 * intermediate_params) * 32 / 8, because of the backward pass. But the observation with torch.cuda.memory_allocated() is about (model_params + input_params + 3 * intermediate_params) * 32 / 8. Are there any other things allocated on the GPU, or am I missing something?
Thanks for helping out,


All the intermediary states do get allocated but they should all be freed when you run in no_grad() (unless you keep reference to them).
And when grad mode is enabled, most of these will be kept alive by the autograd to be able to compute the backward indeed. But nothing else no.