Tracking access patterns during training

jmalinza · September 24, 2020, 4:25pm

Hello everyone, this is my first post on the Pytorch forums - I hope you are keeping well!
I understand that what I am going to ask is detailed and/or advanced. But please bare with me - I am willing to go all the way

I want to track what (layer module) accessed my tensor and when my tensor was accessed.

For example - when training:

A tensor is used in the forward computation of a layer.
1a) The output produced by that layer is then saved for autograd on the later backwards pass.
Then during the backpass of the autograd, that same tensor is used to calculate gradients for the layer that produced it.

By recording when steps 1) and 2) are started, a tensor access pattern can be observed. The goal of this is to identify the time difference between the 1st and 2nd accesses. (I am a grad student - please feel free to ask more about my work)

Please understand that I have spent some time learning the Pytorch internals (i recommend here or here) and have thought of some methods of implementation.

I am posting here just to hear from some other voices who are more knowledgable and experienced in Pytorch.

Thank you for your time.

albanD · September 24, 2020, 5:05pm

Hi,

1a) The output produced by that layer is then saved for autograd on the later backwards pass.

This is not strictly true I think. In particular it depends what you call a “layer”.
But the layers in torch.nn and neural nets are completely independent of the autograd. So this statement would be wrong in that case.
The autograd works at a lower level where we have a set of specific “elementary ops” for which we do (only when needed, not all the time!) save input and/or outputs.

Are you interested in the first or second in your case?

I can’t really think of an easy way to hack this in though. Maybe you can use this construct that is used to store Tensors to be used in the backward. This is created at the same time (sometimes before, sometimes after) as the forward function and the unpack is called during the backward pass when the Tensor is needed.

Hope this helps!

jmalinza · September 24, 2020, 5:38pm

Hello, thank you for your reply.

I have wrongly used “autograd” to refer to the general backwards propogation pass (error and gradient). The autograd is something rather mysterious to me; something that will require much more attention.

Are you interested in the first or second in your case?

I am interested in the second case.

When refering to GPU memory usage, we generally have the problem where memory is full of “intermediate” tensors and a OOM event occurs. This is closely linked to which tensors are generated by forward propogation, stored in memory and then reused for error and gradient calculation.

That construct looks somewhat promising - I will take a look over the next week.
I was also thinking about potentially using a tensor’s version counter. Each access could trigger a function call here that may be able to record some metadata for my use.

albanD · September 24, 2020, 5:44pm

The version counter is only ever bumped when inplace operations happen. So that might not cover all the cases you want.

If you have an example or idea of what exactly you want to get for a given sample code, that might help us guide you more

jmalinza · September 24, 2020, 5:48pm

Ok, it is rather late here now, so I will reply tomorrow with a code snippet.

jmalinza · September 25, 2020, 4:53am

Ok, I put together a quick toy network.
Its a simple model, with the forward function fully expanded for the discussion.

#include <torch/torch.h>

struct ToyImpl : torch::nn::Module {
	ToyImpl(int64_t input_size, int64_t output_size) : 
		linear1(register_module("linear1", torch::nn::Linear(input_size,64))),
		linear2(register_module("linear2", torch::nn::Linear(64, output_size))) {}

	torch::Tensor forward(torch::Tensor input){
		torch::Tensor p = linear1(input);
		torch::Tensor q = torch::relu(p);
		torch::Tensor r = linear2(q);
		return r;
	}
	torch::nn::Linear linear1, linear2;
};

TORCH_MODULE(Toy);

const int64_t input_size  = 16;
const int64_t output_size = 2;
const int64_t N  = 256; // batch size

int main()
{
	Toy toyNet(input_size,output_size);
	toyNet->to(torch::kCUDA);
	// input tensor & desired output
	torch::Tensor input        = torch::rand({N, input_size }).to(torch::kCUDA);
	torch::Tensor output_truth = torch::rand({N, output_size}).to(torch::kCUDA);

	float learning_rate = 1e-4;
 	 torch::optim::SGD optimizer(
      toyNet->parameters(), torch::optim::SGDOptions(learning_rate));

 	// a single epoch
	optimizer.zero_grad();
	auto output_pred = toyNet->forward(input);
	auto loss = torch::mse_loss(output_pred, output_truth.detach());
	loss.backward();
	optimizer.step();

	return 0;
}

So, during toyNet->forward(input) I want to track when tensors p, q, r are accessed and when their computation is finished. It would also be useful to know which of them are saved for backwards.

Then, during loss.backwards() I want to again track when the saved tensors (amoungst p, q ,r) are accessed and when their computation is complete.

albanD · September 25, 2020, 1:54pm

Things like linear and relu are at the level of nn. The linear function in particular is composed of subfunctions that may save their input/output as well. Do you explicitly want to only track p q r?

Otherwise, the SavedVariable seem to be the right place to hook in. But it is going to be a big hack and quite fragile

jmalinza · September 28, 2020, 1:26am

Ideally, it would be for all tensors, implicitly and explicitly created by the user. That means p, q, r and all the other tensors produced by subfuctions.

Originally, I thought I would be able to augment the TensorImpl structure, so that each unique tensor could track its own access pattern (access count and timestep). But I couldn’t imagine a way where a TensorImpl::function could be run when a tensor is accessed. I also wondered if tensor accessors could possibly provide similar functionality…

One more question: what is Pytorch’s unit for enqueuing computations to the CUDA stream?

albanD · September 28, 2020, 6:26pm

One more question: what is Pytorch’s unit for enqueuing computations to the CUDA stream?

Not sure what you mean by unit. But usually each op corresponds to a single kernel. With few complex ones launching few kernels.
It will depend on the op I’m afraid.

But I couldn’t imagine a way where a TensorImpl::function could be run when a tensor is accessed.

There is no check for “accessing” a Tensor.
That is why I recommended to attach to the autograd construct that save/unpack the Tensors for the backward pass.