Hey thanks for putting together the transformer_auto_wrap_policy for FSDP. I wanted to check if there are any tips as to which layers we can combine when we’re wrapping Conv blocks, or if wrapping the whole blocks in an FSDP unit should be good.
The basic structure is that it has a few MbConv blocks, sequentially followed by Transformer blocks. I wrapped the transformer layers as described in the transformer_auto_wrap_policy and tried wrapping the Conv blocks with size_based_auto_wrap_policy but felt that was inefficient.
Is the gist representative of your scaled model? If not and you are planning to scale the model further, would it be by increasing individual nn.Parameter sizes, adding more layers, or both?
I was thinking that something like the following might be good for scaling:
This is interesting, so it should be safe to add non-transformer classes to transformer_auto_wrap_policy? Wasn’t sure if adding Conv layers to it would be safe so I was thinking of adding a wrapper class around transformer_auto_wrap_policy and size_based. This is insightful thanks!