How to get started with FSDP2

JosAr · January 5, 2026, 3:13pm

Hi!

I hereby seek help on how to use FSDP2 and how to get started. I have a ML-experiment on retrained models that delivered interesting results. For a paper I need to run it with a very large pertained model that does not fit on a single GPU. I think the best option is now to use FSDP2 on two GPUs.

I have converted my script a use fully_shard(), however the two processes seem both to use the same GPU, thus I receive errors. I would like to know a) how can my different subprocesses get the device (ie. ‘cuda:0’ or ‘cuda:1’)? And do you have a good (official?) resource how to look up things?

I think in Getting Started with Fully Sharded Data Parallel (FSDP2) — PyTorch Tutorials 2.9.0+cu128 documentation my question does not get answered.

I appreciate your help a lot!

ptrblck · January 5, 2026, 7:43pm

The linked tutorial shows that torchrun is used to launch the application, which launches a single process per device. The linked example.py script then uses the rank and creates the device from it.

JosAr · January 6, 2026, 2:30pm

Thanks! That helped a lot, now the processes connect to the correct GPUs.

Can I ask further questions here?

I am trying to run an adversarial attack on a huge pre-trained model. Running an inference step without gradients works fine on a single GPU, but after tracking gradients of the input (requires_grad(True)) only, I received OOM errors (batch-size is 1 already, so it cannot be lowered). Thats why I wanted to use FSDP2 anyways. However, I now use four instead of one A100 GPUs (and FSDP2) and still receive OOM errors which should not be the case. Do I have a fundamental misunderstanding about FSDP? I thought FSDP would spread the model on two GPUs and make my experiment possible by utilizing multiple GPUs (four instead of one)?

I initially thought pipeline parallelism is what I am looking for, but since the original model is not sequential (it has skip connections), this is not an option if I understood correctly.

Thanks a lot!

ptrblck · January 6, 2026, 10:59pm

FSDP will indeed reduce the memory footprint, but could you explain how you’ve estimated that 4x A100s should fit? Assuming the workload is still too large for your setup you could try to use activation offloading as described here.

JosAr · January 7, 2026, 4:05pm

Thanks! That helped a lot, now the processes connect to the correct GPUs.

Can I ask further questions here?

I am trying to run an adversarial attack on a huge pre-trained model. Running an inference step without gradients works fine on a single GPU, but after tracking gradients of the input (requires_grad(True)) only, I received OOM errors (batch-size is 1 already, so it cannot be lowered). Thats why I wanted to use FSDP2 anyways. However, I now use four instead of one A100 GPUs (and FSDP2) and still receive OOM errors which should not be the case. Do I have a fundamental misunderstanding about FSDP? I thought FSDP would spread the model on two GPUs and make my experiment possible by utilizing multiple GPUs (four instead of one)?

I initially thought pipeline parallelism is what I am looking for, but since the original model is not sequential (it has skip connections), this is not an option if I understood correctly.

I

Thanks a lot!

Hi! Thanks you very much again for the quick response!

My estimation of the required workload was to simply assume, that the require_grad(True) will just double the amount of GPU memory. So 2 GPUs should suffice. I did not take into account that there will be some overhead of course, but also only some parameters require gradients, not all. So, naively I would assume that the required memory should not more than quadruple?

I have tested my assumption in a short snippet (measuring the memory consumption on an adversarial attack without FSDP2), and it somehow seems to be very false. I have run a few runs with different batchsize and saw the following, which I dont quite understand.

d_model = 4096

Batchsize	No gradients (in GB)	Gradients (in GB)	Factor
1	6.14	10.93	1.78
2	6.14	10.98	1.79
4	6.14	11.06	1.80
8	6.14	11.23	1.83
16	6.14	11.56	1.88
32	6.14	12.23	1.99
64	6.14	13.51	2.20
128	6.14	14.28	2.32
256	6.14	18.04	2.94
512	6.14	25.56	4.16
1024	?	OOM	nan
2048	?	OOM	nan
4096	?	OOM	nan

with d_model = 1024

Batchsize	No gradients (GB)	Gradients (GB)	Factor
1	0.56	0.99	1.75
2	0.56	1.00	1.77
4	0.56	1.02	1.81
8	0.56	1.07	1.89
16	0.56	1.16	2.06
32	0.56	1.36	2.42
64	0.56	1.72	3.05
128	0.56	2.48	4.39
256	0.59	3.95	6.65
512	0.70	5.10	7.28
1024	0.92	9.25	10.10
2048	1.35	17.31	12.86
4096	2.21	33.66	15.27

I have gotten the results from this code

import torch
import torch.nn as nn
import torch.optim as optim

def main():
    # Transformer parameters
    d_model = 8*512
    seq_len = 10
    batch_size = 16
    device = 'cuda:0'

    # Initialize out-of-the-box Transformer
    model = nn.Transformer(d_model=d_model, batch_first=True, device=device)
    model.eval()

    # Create random input tensors (src and tgt for transformer)
    src = torch.randn(batch_size, seq_len, d_model, requires_grad=True, device=device)
    tgt = torch.randn(batch_size, seq_len, d_model, device=device)
    target = torch.randn(batch_size, seq_len, d_model, device=device)

    # L-BFGS optimizer for adversarial attack
    optimizer = optim.LBFGS([src], max_iter=50, line_search_fn='strong_wolfe')

    print(f"Input shapes: src={src.shape}, tgt={tgt.shape}, target={target.shape}")

    # Get original prediction and loss
    with torch.no_grad():
        original_output = model(src, tgt)
        original_loss = nn.MSELoss()(original_output, target)
    peak_memory_inference = torch.cuda.max_memory_allocated()
    print(f"Peak memory (inference): {peak_memory_inference / 1024**3:.2f} GB")

    #reset cuda memory
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()

    def closure():
        optimizer.zero_grad()
        output = model(src, tgt)
        loss = nn.MSELoss()(output, target)
        loss.backward()
        return loss

    # Perform adversarial attack
    print("Performing L-BFGS adversarial attack...")
    optimizer.step(closure)

    peak_memory_training = torch.cuda.max_memory_allocated()
    print(f"Peak memory (training): {peak_memory_training / 1024**3:.2f} GB")

    
    if __name__ == "__main__":
    main()

Regarding activation offloading: (as far as I understand) this requires manipulating the forward pass already. I can not change anything in the model, since I (need to) use a pre-trained model.

My original task is quite similar to the above code snippet. Is there any way in running this, lets say with d_model = 4096 and batch size = 4096? Either on multiple GPUs or in a different way? Note that in my original application i can not change the models architecture since its pretrained (its not sequential), and the actual batchsize is already one (but the individual samples are large).

ptrblck · January 7, 2026, 11:32pm

That’s not necessarily the case since the forward activations (which might be needed for the gradient calculation) could increase the memory usage by a significantly larger factor (depending on the model architecture). E.g. take a look at this post describing the memory usage for a ResNet.

This shouldn’t be the case as you should be able to simply wrap your forward pass into the context manager. Here is a small example showing the hooks (which are also used for CPU offloading) for a ResNet:

import torch 
import torchvision.models as models 


def pack_hook(x):
    print("Packing", x.sum())
    return x
def unpack_hook(x):
    print("Unpacking", x.sum())
    return x

device = "cuda"
model = models.resnet50()
model.to(device)
x = torch.randn(1, 3, 224, 224, device=device)

with torch.autograd.graph.saved_tensors_hooks(pack_hook, unpack_hook):
    out = model(x)
out.sum().backward()

Besides that, you could also try to use activation checkpointing as described here.