Optimizing GPU forward pass time and output packaging time during inference

I am using torch.utils.dataloaders to create a dataloader (Phase 0) with a few thousand images and do forward pass (Phase 1) on batches of 32 images. So far, I have been able to optimize the GPU usage and memory usage during the inference phase (Phase 2). During this period, I collect the outputs of these batches into python lists and then in a separate post-processing step, package all these outputs into a file (before compressing them).

I noticed that per 30k images, Phase 0 takes negligible time, whereas for Phase 1, forward pass takes 120 seconds (approx 250 images per second) and the packaging takes an proportionate amount of time (approx 100 seconds). During the packaging, the GPU remains idle and only after saving outputs to disk, do we prepare the next dataloader.

Phases 0-3 need to be repeated for millions of images.

Here is the rough skeleton of my code:


### PHASE 0 : PREPARING DATA
dataset = CustomDataset(items)
data_loader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_PREPROCESS_WORKERS, drop_last=False, collate_fn=filtered_collate_fn)
 
cids, features, meta = [], [], []

### PHASE 1 : FORWARD PASS #####
# note each of *_batch is a tuple
for batch_idx, (id_batch, img_batch, tag_batch) in enumerate(data_loader):
  try:
    feat_batch = self.model.forward_pass(img_batch)
    cids = [*cids, *list(id_batch)]
    features = [*features, *list(feat_batch)]
    meta = [*meta, *list(meta_batch)]
 
 

### PHASE 2 : PACKAGING OUTPUTS of PHASE 2 #####
filename_pattern = 'chunk-{}.json'.format(get_random_string_with_timestamp())
items = []
for cid, feature, tag in zip(cids, features, meta):
  json_obj = { "cid" : str(cid), "feature": feature, "tag" : str(tag) } 
  items.append(json.dumps(json_obj))
file_data = '\n'.join(items)
 
f.write(file_data)