Time profiling for input feeding to model?

Hi, I am trying to assess how much time my code takes to acquire the input (with data augmentation). Currently I am using Pytorch’s dataloader. I started recording the time immediately after __getitem__ method till it returned a batch. Is it the right way to assess the time?
My code looks something like this:

def __getitem__(self, idx):
    ts = time.time()
......
......
print("time elapsed for inputs{}s".format(time.time() - ts))
   return batch

Your approach could work to profile the loading of a single sample.
However, if you want to profile the data loading via a DataLoader (and thus potentially multiple workers), I would recommend to use this code snippet from the ImageNet example.

1 Like

single sample or a single batch?

The __getitem__ method is usually used to load a single sample using the passed index.
So your code snippet would profile the loading time of a single sample, while the ImageNet example would profile the complete DataLoader, i.e. the time could approach zero if the workers are preloading the next batches fast enough while your GPU is busy.

Right. So improving the loading speed of a single example correlates to speeding up the whole batch speed and subsequently training? I tried that ImageNet example, but I am confused between data time and batch time. So according to example:

ts = time.time()
for iter, batch in enumerate(train_loader):
data_time.update(time.time() - ts)



outputs = fcn_model(inputs)
labels = labels.type_as(outputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
batch_time.update(time.time() - ts)

So data time is the time taken by dataloader to load a batch? and batch time is time taken by model to process a batch? I think I can only improve the efficiency of data loading, not batch processing (batch time) because that depends on model, right?

Yes, your statement is correct.
The first iteration of the DataLoader would be slower since all workers are loading a complete batch. If the data loading is not a bottleneck, the data loading time should decrease towards zero.

To accelerate the model, you could use e.g. mixed-precision training and check if this would yield a speedup.
Also, torch.backends.cudnn.benchmark = True would enable cudnn to profile the kernels for each new input shape and could accelerate the training.

Regarding this code snippet that you highlighted: isn’t there a potential problem with this way of measuring the data load time because of non_blocking=True?

Basically, this won’t take into account the CPU to GPU transfer, no?

The linked timer is not used to explicitly measure the time it takes to load and process each batch and to send it to the device. Instead it measures how much time of your training loop is spent on waiting on the next batch. If your use case is not suffering from a data loading bottleneck this timer would decrease towards zero in the optimal case, since the next batch would already be preloaded and you wouldn’t have to wait for it.
In this sense you are correct that we are ignoring the host to device copy, but we also don’t want to measure it.

1 Like