I think that the computation required for the fully connected layers is so small, that the overhead of the parallelization makes it slower if you parallelize.
Parallelization is only usefull if you have heavily waiting for computations to be performed.
So the computational overhead in FC layers is not so heavy, however the number of parameters is
extremely large. The will be much time on param/gradient sync.