Hi!
I can successfully capture the CUDAGraph and replay. I took the API example from this blog and modified it for my own model. Basically, I can forward and backward run my model normally with a certain batch size (25). But when I need to capture the graph, the reserved memory suddenly doubles, making me only able to do batch size 8.
...
device = 'cpu'
if torch.cuda.is_available():
device = 'cuda'
train_x, train_y = data_generator(T, seq_len, n_train)
model = TCN(1, n_classes, channel_sizes, kernel_size, dropout=dropout).to(device)
criterion = torch.nn.CrossEntropyLoss().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
print('Before stream s, allocated: ', torch.cuda.memory_allocated('cuda'), '; reserved: ', torch.cuda.memory_reserved('cuda'))
# warmup
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
for i in range(3):
inputs = train_x[batch_size*i:batch_size*(i+1)].unsqueeze(1).contiguous().to(device)
labels = train_y[batch_size*i:batch_size*(i+1)].to(device)
optimizer.zero_grad(set_to_none=True)
y_pred = model(inputs)
loss = criterion(y_pred.view(-1, n_classes), labels.view(-1))
loss.backward()
optimizer.step()
torch.cuda.current_stream().wait_stream(s)
print('After stream s before graph g, allocated: ', torch.cuda.memory_allocated('cuda'), '; reserved: ', torch.cuda.memory_reserved('cuda'))
print('Start capturing CUDA graph.')
# capture
g = torch.cuda.CUDAGraph()
# Sets grads to None before capture, so backward() will create
# .grad attributes with allocations from the graph's private pool
optimizer.zero_grad(set_to_none=True)
static_inputs = train_x[:batch_size].unsqueeze(1).contiguous().to(device)
static_labels = train_y[:batch_size].to(device)
with torch.cuda.graph(g):
static_y_pred = model(static_inputs)
static_loss = criterion(static_y_pred.view(-1, n_classes), static_labels.view(-1))
static_loss.backward()
optimizer.step()
print('After graph g before replay, allocated: ', torch.cuda.memory_allocated('cuda'), '; reserved: ', torch.cuda.memory_reserved('cuda'))
for i in range(30):
inputs = train_x[batch_size*i:batch_size*(i+1)].unsqueeze(1).contiguous().to(device)
labels = train_y[batch_size*i:batch_size*(i+1)].to(device)
static_inputs.copy_(inputs)
static_labels.copy_(labels)
g.replay()
print('After replay, allocated: ', torch.cuda.memory_allocated('cuda'), '; reserved: ', torch.cuda.memory_reserved('cuda'))
The output is:
Before stream s, allocated: 2978816 ; reserved: 4194304
After stream s before graph g, allocated: 13397522944 ; reserved: 15101591552
Start capturing CUDA graph.
After graph g before replay, allocated: 13400187392 ; reserved: 30045896704
After replay, allocated: 13400187392 ; reserved: 30045896704
If I try a batch size larger (9, 25, etc), it works up to capturing the graph. The error output is:
Traceback (most recent call last):
File "/gpfs/fs1/home/minzhao.liu/TCN/cuda_graph.py", line 123, in <module>
optimizer.step()
File "/home/minzhao.liu/.conda/envs/qtensor-torch/lib/python3.9/site-packages/torch/cuda/graphs.py", line 149, in __exit__
self.cuda_graph.capture_end()
File "/home/minzhao.liu/.conda/envs/qtensor-torch/lib/python3.9/site-packages/torch/cuda/graphs.py", line 71, in capture_end
super(CUDAGraph, self).capture_end()
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Also, can I save the graph and run it on a different machine? If anyone can recommend some resources and more examples of CUDA graphs with PyTorch, I would really appreciate since I have only found a few minimal examples.