Using data.parallel with a CNN

Im very new to deeplearning and am trying to understand how to use data parallel with my semantic segmentation training work.
My data loader gives me a batch of 16 files, of both the input image and the ground truth image. I understand I need to transfer the files into the GPU and then train the model. The following code looks right to me, and when i run it I can see 3 GPUS of 16GB being used. soon after though it bombs, saying out of memory. Is the code below right?
One thing Im confused about is whether when training the model, the entire batch can be sent in at one time, or if we need to send each image one by one.We used to send each image one by one using a custom dataloader and that worked but was very slow. Hence this attempt of speeding training using data Parallel.

These are 3 16GB GPUs.

 for epoch in range(0, num_epochs):
        optimizer = get_optimizer(trainable_model, epoch) # optimizer for current epoch
        total_train_loss = 0

        for i_batch, sample_batched in tqdm(enumerate(training_generator)):
            rgb,mask = sample_batched
            var_rgb = Variable(rgb.float())
            print (var_rgb.shape) < -- this is (16,3,363,400)
            var_rgb = var_rgb.cuda()

            var_mask = Variable(mask.float())
            var_mask = var_mask.cuda()

            output = trainable_model(var_rgb)
            loss = ((criterion(output, var_mask.long()) / (settings.opt['batch_size'])) 
            total_train_loss +=[0]

        del var_mask,var_rgb

Can you reduce batch size and check? Basically, the resolution of images could be large, such that each gpu is unable to handle ~5 (16/3) images at once.

Hi Innovarul
think your right, setting a batch size of 6 and am able to run the images through without running out of memory.