I spent some time tracking down the biggest bottleneck in the training phase, which turned out to be the transforms on the input images. 500-3000 tiles need to be interactively transformed using the below Composition, which takes 5-20 seconds. I tried a variety of python tricks to speed things up (pre-allocating lists, generators, chunking), to no avail. I already use multiple workers. The issue seems to be each individual transforms takes some time, and it really adds up:
self.img_finalize = transforms.Compose([ transforms.ToPILImage(), # 1 s. transforms.RandomRotation(25), # 2 s. transforms.RandomResizedCrop(1024, scale=(0.5, 1.5), ratio=(.8,1.2), interpolation=2), # 0.5 s. transforms.Resize(resolution), # 0.5 s. transforms.RandomHorizontalFlip(p=0.5), # 0.5 s. transforms.RandomVerticalFlip(p=0.5), # 0.5 s. transforms.ColorJitter(brightness=0.2, contrast=0.1, saturation=0.05, hue=0.02), # 3 s. transforms.ToTensor(), # 2 s. transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), # 3 s. ])
- I notice with the dataloader, every thread that was started needs to finish before any data is pushed to the GPU. If I add a print statement
DONEat the end of the
__getitem__()method and before the forward pass
FOWARD, there is significant delay from the first DONE and first
FORWARD, in fact there needs to be 16 (=n_workers)
This doesn’t happen if n_workers=None. I suspect some form of lock is implemented? I am working in the unusual situation where my batchsize is 1, but each batch is huge (it’s a MIL paradigm…). So it would be useful to process a batch asap once it’s
__getitem__() method is done. This makes sense to me is batch_size > 1, as you need to be sure all batches are down before collating. But it’d be great if it could be relaxed if batch_size =1.
E.g. I see output like this, for example, and I don’t think this is the latency from sending to GPU (I checked that too, it’s much faster than the individual methods). Now it might be latency from sending from forked process to main?
DONE! DONE! DONE! DONE! DONE! DONE! DONE! DONE! DONE! FOWARD
- Is there any use in pre-compliling the transform? If there would be significant performance gains (and it’s possible), it’s something I definintely want to explore. Any advice or helpful links on this? I have some experience with exposing C++ to python.