How are the activations stored when we use FSDP ?
So in a multiple GPU setup every gpu has its own shard of the FSDP unit so during the forward pass,as we do all gather for the parameters of a FSDP unit and calculate the result? where are the results stored are the results stored in the GPU for the forward pass of every FSDP unit as these would be required again during backward passs? And the activations are not sharded right ?