CUDA: out of memory on Multi-GPUs

When I was training my model on single GPU(cuda:0), it just worked with batch_size==4. However, I keep the batch_size == 4 and train my model on 4 GPUs, it raise warning : RuntimeWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
and error:RuntimeError: CUDA out of memory. Tried to allocate 98.38 MiB (GPU 1; 15.89 GiB total capacity; 14.56 GiB already allocated; 33.25 MiB free; 150.76 MiB cached
Hope somebody helps me. It’s confusing。

Traceback:

File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 142, in forward
    replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
  File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 147, in replicate
    return replicate(module, device_ids)
  File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 13, in replicate
    param_copies = Broadcast.apply(devices, *params)
  File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 21, in forward
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
  File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/torch/cuda/comm.py", line 40, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: CUDA out of memory. Tried to allocate 11.38 MiB (GPU 3; 15.89 GiB total capacity; 13.87 GiB already allocated; 7.25 MiB free; 408.58 MiB cached) (malloc at /opt/conda/conda-bld/pytorch_1544174967633/work/aten/src/THC/THCCachingAllocator.cpp:231)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f59fa323cc5 in /home/user/miniconda/envs/py36/lib/python3.6/site-packages/torch/lib/libc10.so)