Batch size and num_workers vs GPU and memory utilization

pramod.srinivasan · February 6, 2019, 6:57am

Experimenting with a 4 GPU AWS instance setup to run batch inference on a segmentation network.

I varied the num_workers with batch size and found that I could not improve the volatile GPU memory utilization beyond 50%.

When num_workers increased beyond 400, the 4th GPU could not allocate enough memory. More details in the plot below.

Here was the output of the nvidia-smi.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   41C    P0   193W / 300W |   9116MiB / 16152MiB |     78%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   42C    P0   199W / 300W |   7624MiB / 16152MiB |     79%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   39C    P0   231W / 300W |   7624MiB / 16152MiB |     74%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P0    51W / 300W |    794MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Oli · February 6, 2019, 12:16pm

That seems like an awful lot of num_workers. I haven’t used any AWS GPUs but when I run locally I typically have the num_workers=8, which is plenty for most models to bottleneck elsewhere.

Is it possible that the code is bottlenecked elsewhere than the GPU / data reading?

pramod.srinivasan · February 8, 2019, 7:05pm

Do you have a way to see if the code is bottlenecked? I ran a small experiment to track time taken by each step when processing approx 36k records with batch size of 16.

This step takes 23.6 seconds.

dataloader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_PREPROCESS_WORKERS, drop_last=False, collate_fn=filtered_collate_fn)

This step takes 145 seconds.

for (cids, img_batch, heatmap_batch, tag_batch) in dataloader:
       output_probs = model.extract_mask(img_batch, heatmap_batch)
       q.put((cids, output_probs, tag_batch))

Oli · February 9, 2019, 10:43am

One very easy way to measure time in your program is the snakeviz package. I don’t think it can iron out all questions since it doesn’t really understand GPU stuff but is quick & easy to start with. Nvidia has GPU-profiler tool but I haven’t got it to work very well

This step initializes your dataloader but doesn’t actually read your data. It shouldn’t take 23 seconds though. Mine takes like a second but idk.

dataloader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_PREPROCESS_WORKERS, drop_last=False, collate_fn=filtered_collate_fn)

Here you are reading your data and putting it through you model. The time spend is correlated to how much data you have.

for (cids, img_batch, heatmap_batch, tag_batch) in dataloader:
       output_probs = model.extract_mask(img_batch, heatmap_batch)
       q.put((cids, output_probs, tag_batch))

The GPU utilization is dependant on your model & input size. Could you try number_of_workers = 8 with batch_size=16, and then batch_size=32 and report the differences?

pramod.srinivasan · February 10, 2019, 10:32am

A few clarifications, during the __init__ call of the Dataloader, I actually end up reading, decoding and resizing images. This was the reason why it took 23 seconds. I moved the read/decode/resize steps in the __getitem__ module and now, the dataloader takes just over a 1 second to initialize – where it doesn’t read data – simply loops through it to calculate the length of the dataset (each image file can be input variable number of times). (Reference : Inference Code Optimizations+ DataLoader - #2 by ptrblck)

Revisiting the dataloader’s initialization parameters here:

dataloader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_PREPROCESS_WORKERS, drop_last=False, collate_fn=filtered_collate_fn)

With, NUM_PREPROCESS_WORKERS = 8 and BATCH_SIZE = 16, subsequently varied the batch size = 32, the metrics for GPU/CPU utilization and their memory usage are shown here:

Here is a Snakeviz profiler with Batch Size = 16 and num_workers = 8, total batch inference item 164.95 seconds for 2302 batches at 224 frames per second.

Below is the Snakeviz profiler with Batch Size = 32 and num_workers = 8, total batch inference time 139 seconds for 1151 batches at 264 frames per second

Oli · February 11, 2019, 1:52pm

Hello again! I’m out traveling so will be brief. It seems like you are doing some kind of copy in your extract_mask function that takes a long time. It seems to be on your cpu. This could be your culprit

pramod.srinivasan · February 11, 2019, 6:39pm

Thanks – the copy in the extract_mask function moves the current batch of images from CPU to a GPU device. Here is the pseudo code for the function:

def extract_mask(self, image_batch):
     #image_batch is the batch obtained from the dataloader.
     image_var = Variable(torch.Tensor(len(image_batch), 3, 224, 224), requires_grad=False).type(self.dtype_float).cuda(device=self.gpu_id)
     image_var.data.copy_(image_batch)
     output = self.model(image_var).cpu().numpy()
     return output

I probably should have mentioned this earlier – I am using pytorch0.3.1. Here’s a workaround which avoids the copy altogether.

def extract_mask(self, image_batch):
     #image_batch is the batch obtained from the dataloader.
     image_batch = Variable(torch.squeeze(image_batch), requires_grad=False).type(self.dtype_float).cuda(device=self.gpu_id)
     output = self.model(image_var).cpu().numpy()
     return output

This definitely speeds up the dataloader…

Oli · February 12, 2019, 1:45am

I’m not sure about version 0.3 but can’t you wrap your image_batch from the loader in a variable without the extra copy? Or somehow return the needed tensor from the loader directly

pramod.srinivasan · February 12, 2019, 3:48am

You mean something like this?

image_batch = Variable(torch.squeeze(image_batch), requires_grad=False).type(self.dtype_float).cuda(device=self.gpu_id)

Oli · February 12, 2019, 2:54pm

Yes that’s what I had in mind. Are you satisfied with the speed now? Do you have better gpu-utilization?

pramod.srinivasan · February 12, 2019, 5:26pm

Thanks for the inputs – I think your suggestions removing the explicit copy and using the Variable to wrap the tensor largely helped in speed-up, but I encountered something quite strange about the inference which I think warranted a new post here : [pytorch0.3.1] Forward pass takes 10x longer time for every 2nd batch inference