FSDP without data parallelism

Is there a way to use the FSDP system without using data parallelism. The tutorial here (Getting Started with Fully Sharded Data Parallel(FSDP) — PyTorch Tutorials 2.2.1+cu121 documentation) seems to use both. In my use case, I want the benefit of splitting up large gradients in backprop across multiple GPUs but I don’t want to spawn multiple training processes simultaneously. Is this possible?