I hereby seek help on how to use FSDP2 and how to get started. I have a ML-experiment on retrained models that delivered interesting results. For a paper I need to run it with a very large pertained model that does not fit on a single GPU. I think the best option is now to use FSDP2 on two GPUs.
I have converted my script a use fully_shard(), however the two processes seem both to use the same GPU, thus I receive errors. I would like to know a) how can my different subprocesses get the device (ie. ‘cuda:0’ or ‘cuda:1’)? And do you have a good (official?) resource how to look up things?
The linked tutorial shows that torchrun is used to launch the application, which launches a single process per device. The linked example.py script then uses the rank and creates the device from it.
Thanks! That helped a lot, now the processes connect to the correct GPUs.
Can I ask further questions here?
I am trying to run an adversarial attack on a huge pre-trained model. Running an inference step without gradients works fine on a single GPU, but after tracking gradients of the input (requires_grad(True)) only, I received OOM errors (batch-size is 1 already, so it cannot be lowered). Thats why I wanted to use FSDP2 anyways. However, I now use four instead of one A100 GPUs (and FSDP2) and still receive OOM errors which should not be the case. Do I have a fundamental misunderstanding about FSDP? I thought FSDP would spread the model on two GPUs and make my experiment possible by utilizing multiple GPUs (four instead of one)?
I initially thought pipeline parallelism is what I am looking for, but since the original model is not sequential (it has skip connections), this is not an option if I understood correctly.
FSDP will indeed reduce the memory footprint, but could you explain how you’ve estimated that 4x A100s should fit? Assuming the workload is still too large for your setup you could try to use activation offloading as described here.
Thanks! That helped a lot, now the processes connect to the correct GPUs.
Can I ask further questions here?
I am trying to run an adversarial attack on a huge pre-trained model. Running an inference step without gradients works fine on a single GPU, but after tracking gradients of the input (requires_grad(True)) only, I received OOM errors (batch-size is 1 already, so it cannot be lowered). Thats why I wanted to use FSDP2 anyways. However, I now use four instead of one A100 GPUs (and FSDP2) and still receive OOM errors which should not be the case. Do I have a fundamental misunderstanding about FSDP? I thought FSDP would spread the model on two GPUs and make my experiment possible by utilizing multiple GPUs (four instead of one)?
I initially thought pipeline parallelism is what I am looking for, but since the original model is not sequential (it has skip connections), this is not an option if I understood correctly.
I
Thanks a lot!
Hi! Thanks you very much again for the quick response!
My estimation of the required workload was to simply assume, that the require_grad(True) will just double the amount of GPU memory. So 2 GPUs should suffice. I did not take into account that there will be some overhead of course, but also only some parameters require gradients, not all. So, naively I would assume that the required memory should not more than quadruple?
I have tested my assumption in a short snippet (measuring the memory consumption on an adversarial attack without FSDP2), and it somehow seems to be very false. I have run a few runs with different batchsize and saw the following, which I dont quite understand.
d_model = 4096
Batchsize
No gradients (in GB)
Gradients (in GB)
Factor
1
6.14
10.93
1.78
2
6.14
10.98
1.79
4
6.14
11.06
1.80
8
6.14
11.23
1.83
16
6.14
11.56
1.88
32
6.14
12.23
1.99
64
6.14
13.51
2.20
128
6.14
14.28
2.32
256
6.14
18.04
2.94
512
6.14
25.56
4.16
1024
?
OOM
nan
2048
?
OOM
nan
4096
?
OOM
nan
with d_model = 1024
Batchsize
No gradients (GB)
Gradients (GB)
Factor
1
0.56
0.99
1.75
2
0.56
1.00
1.77
4
0.56
1.02
1.81
8
0.56
1.07
1.89
16
0.56
1.16
2.06
32
0.56
1.36
2.42
64
0.56
1.72
3.05
128
0.56
2.48
4.39
256
0.59
3.95
6.65
512
0.70
5.10
7.28
1024
0.92
9.25
10.10
2048
1.35
17.31
12.86
4096
2.21
33.66
15.27
I have gotten the results from this code
import torch
import torch.nn as nn
import torch.optim as optim
def main():
# Transformer parameters
d_model = 8*512
seq_len = 10
batch_size = 16
device = 'cuda:0'
# Initialize out-of-the-box Transformer
model = nn.Transformer(d_model=d_model, batch_first=True, device=device)
model.eval()
# Create random input tensors (src and tgt for transformer)
src = torch.randn(batch_size, seq_len, d_model, requires_grad=True, device=device)
tgt = torch.randn(batch_size, seq_len, d_model, device=device)
target = torch.randn(batch_size, seq_len, d_model, device=device)
# L-BFGS optimizer for adversarial attack
optimizer = optim.LBFGS([src], max_iter=50, line_search_fn='strong_wolfe')
print(f"Input shapes: src={src.shape}, tgt={tgt.shape}, target={target.shape}")
# Get original prediction and loss
with torch.no_grad():
original_output = model(src, tgt)
original_loss = nn.MSELoss()(original_output, target)
peak_memory_inference = torch.cuda.max_memory_allocated()
print(f"Peak memory (inference): {peak_memory_inference / 1024**3:.2f} GB")
#reset cuda memory
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
def closure():
optimizer.zero_grad()
output = model(src, tgt)
loss = nn.MSELoss()(output, target)
loss.backward()
return loss
# Perform adversarial attack
print("Performing L-BFGS adversarial attack...")
optimizer.step(closure)
peak_memory_training = torch.cuda.max_memory_allocated()
print(f"Peak memory (training): {peak_memory_training / 1024**3:.2f} GB")
if __name__ == "__main__":
main()
Regarding activation offloading: (as far as I understand) this requires manipulating the forward pass already. I can not change anything in the model, since I (need to) use a pre-trained model.
My original task is quite similar to the above code snippet. Is there any way in running this, lets say with d_model = 4096 and batch size = 4096? Either on multiple GPUs or in a different way? Note that in my original application i can not change the models architecture since its pretrained (its not sequential), and the actual batchsize is already one (but the individual samples are large).
That’s not necessarily the case since the forward activations (which might be needed for the gradient calculation) could increase the memory usage by a significantly larger factor (depending on the model architecture). E.g. take a look at this post describing the memory usage for a ResNet.
This shouldn’t be the case as you should be able to simply wrap your forward pass into the context manager. Here is a small example showing the hooks (which are also used for CPU offloading) for a ResNet:
import torch
import torchvision.models as models
def pack_hook(x):
print("Packing", x.sum())
return x
def unpack_hook(x):
print("Unpacking", x.sum())
return x
device = "cuda"
model = models.resnet50()
model.to(device)
x = torch.randn(1, 3, 224, 224, device=device)
with torch.autograd.graph.saved_tensors_hooks(pack_hook, unpack_hook):
out = model(x)
out.sum().backward()
Besides that, you could also try to use activation checkpointing as described here.