I had an issue at work where the questions was if I should stream the data from an S3 bucket via the Dataset
class, or if I should download first and simply read it in. I was hoping that increasing prefetch_factor
in dataloaders would increase the speed when streaming it via S3, and possibly even be an alternative to downloading.
In order to stream data instead of opening from disk (as is done in many tutorials) the dataset class was setup as the following:
class Data(Dataset):
def __init__(self, prefix, transform):
self.prefix = "https://aft-vbi-pds.s3.amazonaws.com/bin-images"
self.transform = transform
def __len__(self):
return 999
def __getitem__(self, i):
response = requests.get(self.prefix + f"/{i+1}.jpg")
img = Image.open(BytesIO(response.content))
return self.transform(img)
As shown in the experiments done in this kaggle kernel, prefetch_factor
flag did not speed things in a meaningful manner. The results are summarisd below. For each iteration the following code snippet was run, where model is simply resnet18.
with torch.inference_mode():
for img_batch in tqdm(dl):
out = model(img_batch.to(device))
Settings | Time Elapsed |
---|---|
num_workers = 2 |
04:02 |
num_workers = 2 , prefetch_factor=8
|
03:57 |
num_workers = 8 |
1:01 |
num_workers = 8 , prefetch_factor=8
|
1:01 |
All other parameters such as batch_size=32
, pin_memory=True
was held constant across all iterations.
Note that the reason we had 2 workers was due to the fact that this was the number given by multiprocessing.cpu_count()
. However, going past that number in the last iteration and setting it at 8 worked, and gave the following ugly (repeated) warnings: Exception ignored in: Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__
.
Any thoughts as to why prefetch_factor
doesn’t work?