Pytorch compile not working

While training I get

The CUDA Graph is empty. This ususally means that the graph was attempted to be captured on wrong device or stream.

When I try and run
explanation, out_guards, graphs, ops_per_graph = dynamo.explain(self.encoder, input_tensor, lens)

I get an error

*** torch._dynamo.exc.TorchRuntimeError: 
[..]
.. line 382 in <graph break in forward>
[..]
x = self.out(x.transpose(1, 2).contiguous().view(b, t, c * f))

I’m doing single node distributed training, which I presume is causing this: skipping cudagraphs due to multiple devices warning.

Also lots of skipping incompatible /usr/lib/libcuda.so when searching for -lcuda (presume that’s unrelated and could be fixed with LD_LIBRARY_PATH set to have matching cuda entries to what pytorch has installed)

This is with a model conformer model (has convolution and attention layers).

Any ideas?

I don’t believe CUDA graphs is supported with distributed and torch.compile yet, I presume you don’t see this error if you use the default mode?

1 Like

@marksaroufim good to know. May I ask is there official statement that CUDA graphs is not supporting torch.compile yet and when to support in the future? Thanks!

No official statement but I can speak on behalf of the team. CUDA graphs is supported if you use mode="reduce-overhead" but only for single nodes. If you’re curious about more granular updates feel free to open an issue on Github and tag @eelison

I put most of the gotchas with compile here torch.compile — PyTorch master documentation

1 Like

To be clear I’m just using torch compile. I assume CUDA graphs was used internally by torch compile, is that not the case? @marksaroufim

CUDA graphs is supported if you use mode="reduce-overhead" but only for single nodes.

Is torch compile expected to work for single node multi GPU training? That’s what I’m doing. (I was using that mode too)

When is multinode expected to work, is it weeks or months/years?

@wconstab can probably give you a more definite timeline

And to be clear I misspoke above I meant single GPU

1 Like

RE graph breaks caused here

x = self.out(x.transpose(1, 2).contiguous().view(b, t, c * f))

I’m not quite sure what causes the graph break without knowing what self.out is. In general, graph breaks aren’t hard errors so this may be OK.

RE

The CUDA Graph is empty. This ususally means that the graph was attempted to be captured on wrong device or stream

and

skipping cudagraphs due to multiple devices

First, skipping cudagraphs is not generally a hard error either - you should still get some speedup from compilation even if you don’t use cudagraphs. Second, I’ll need more info about what kind of distributed APIs you’re using in your model.

Cudagraphs is not supported with FSDP currently, but I am going to have to check whether it works with DDP + compile. I think it probably should, but haven’t tested it.

Are you using DDP (DistributedDataParallel)? In this case you’d have one process per gpu, not multiple gpus per process. This is the recommended way to do it. DataParallel is not supported by torch.compile… (but DDP is more or less equivalent or better, you just have to update your code to use it)

1 Like

Thank you for the replies!

I’m not quite sure what causes the graph break without knowing what self.out is. In general, graph breaks aren’t hard errors so this may be OK.

self.out is a linear layer followed by a relative positional encoding.

Are you using DDP (DistributedDataParallel)? In this case you’d have one process per gpu, not multiple gpus per process.

Yes, I’m using standard pytorch DDP with NCCL.

So if I understand you correctly I should see a speedup even with the cuda graph error when doing single node multi GPU? In my case I definitely am bottlenecked by having a complicated model that’s doing many separate kernel launches.

And to clarify multi node DDP doesnt work yet with torch compile or yes?

You could file a github issue on pytorch/pytorch providing a repro (full script that we can run to locally repro the error), but otherwise it’s hard to know what’s going on in your case.

So if I understand you correctly I should see a speedup even with the cuda graph error when doing single node multi GPU? In my case I definitely am bottlenecked by having a complicated model that’s doing many separate kernel launches.

Well, how much cudagraphs helps depends on how much your model is cpu/overhead bound in the first place. From your description, it sounds like you may benefit from cudagraphs. But you may still expect speedup from compilation even without cudagraphs. It just depends what amount of fusion the compiler can do and whether that ends up being significant in your case.

And to clarify multi node DDP doesnt work yet with torch compile or yes?

Multi-node DDP does work with torch.compile. See this post for more information

1 Like