Memory cost of nn.SyncBatchNorm

When I use nn.parallel.DistributedDataParallel for multi-gpu training in a single node, I use the nn.SyncBatchNorm to work as batch normalization across GPUs. However, I found the gpu memory cost increased a lot, at least 1gb for one gpu. When I use the SyncBatchNorm provided by apex (but I cannot successfully compile apex in this server), the gpu memory cost is normal. Can anyone help with it?