Can I carry out multiplication of the fully connected layer in CPU? For example, I am assuming that if there are 4 GPUs available - matrix multiplications in the convolution layer might be distributed among 3 GPUs and the matrix multiplications of the fully connected layer might be allocated to the 4th GPU.

What if there are only 3 GPUs available - can the matrix multiplication of the fully connected layer be done using the RAM+CPU (assuming enough memory/processing power is available) while the GPUs are used for convolution layer matrix multiplications? Does PyTorch does this automatically? If not, can I instruct the machine to use the RAM+CPU for the fully connected layer multiplications? Can this help speed up runtime?

No, PyTorch does not apply model sharding automatically and you would need to apply it.
Based on your description it seems you are looking for something like pipeline parallelism?
If so, then you could certainly use the CPU for specific layers too, but would need to check if you could get the expected performance as the data transfer between the host and device could be expensive compared to the actual operations.

Maybe something like a data parallel approach would be a good starter?

Your code snippet looks like it will run the FC layer’s forward pass before moving the FC layer to CPU, which means that if the FC layer is already on GPU, then the x = self.fc1(x) computation will happen on GPU. If you want the computation to happen on CPU, then you should call self.fc1.to(cpu) before x = self.fc1(x).

However, as @ptrblck mentioned, the host-to-device (H2D) / device-to-host (D2H) transfers may become a bottleneck. More specifically, if there are data dependencies, e.g. the output of a FC layer is the input to a convolutional layer or vice versa, then upcoming computations must wait for either a H2D or D2H transfer for their inputs. Without careful pipelining, this may lead to an overall slowdown compared to simply running all computations on GPU.

Instead, you may want to consider data parallelism to help leverage your available hardware. For example, if you currently have batch size N and have 3 GPUs available, you may wrap your model with DistributedDataParallel (DDP) and use a DistributedSampler to distribute the dataset across the 3 GPUs, where you now specify the batch size of your dataset to be N / 3. This should allow you to run through your dataset up to 3x as fast. The DDP tutorial may be a helpful starting point if you are not already familiar.