Multi-gpu can't speed

I run a code similar to the dcgan example with 4 Tesla M40 with batchsize 64. The only difference for the model is I changed the 2D convolution/deconvolution to 3D. however the speed is just the same or a little slower than runing with a single gpu. anyone knows the reason?