Instead they are using:
from fairscale.nn.model_parallel.layers import (
ParallelEmbedding,
RowParallelLinear,
ColumnParallelLinear,
)
I’m guessing they used FSDP only for training and didn’t think it was necessary for inference.
Any other ideas?