If the model does not use any adaptive pooling layers but linear layers at the end, the number of input features to the first linear layer basically defines the expected input shape.
E.g. if the input features are set as 64, the previous activation might be e.g. [batch_size, 1, 8, 8] or any other valid combination which results in 64 input activations.
Adaptive pooling layers relax this condition, as they define the output shape and use an adaptive kernel.
Also fully convolutional networks shouldn’t have any size requirements.