Negative iteration time measurements

lbhm · November 7, 2020, 12:26pm

I am currently doing some experiments where I want to measure data loading times and compute times (i.e. forward/backward pass) for a couple of models (starting of with Resnet50 on ImageNet). However, I sometimes encounter negative time measurements in my experiments and would like to understand if this issue is related to PyTorch or rather core python components or something else in my code base.

My code is based on the deep learning examples from NVIDIA which in turn are based on an example from the PyTorch community. The core of my time measurements looks like this:

    for i, (data, label) in data_iter:
        batch_size = data.size(0)
        lr_policy(optimizer, i, epoch)
>>>     data_time = time.time() - end

        optimizer_step = ((i + 1) % batch_size_multiplier) == 0
        # Forward/backward happens in step()
        loss = step(data, label, optimizer_step=optimizer_step)

>>>     iteration_time = time.time() - end

        if logger is not None:
            <some logging stuff>

>>>     end = time.time()

My logs then show measurements like this:

Train [1/1492] t.data_time : 0.00018 s t.compute_time : 0.26296 s
Train [1/1493] t.data_time : 0.00018 s t.compute_time : 0.26300 s
Train [1/1494] t.data_time : 0.00018 s t.compute_time : 0.26295 s
Train [1/1495] t.data_time : 0.00018 s t.compute_time : 0.26285 s
Train [1/1496] t.data_time : 0.00018 s t.compute_time : -2.66764 s
Train [1/1497] t.data_time : 0.00018 s t.compute_time : 0.26303 s
Train [1/1498] t.data_time : 0.00017 s t.compute_time : 0.26280 s
Train [1/1499] t.data_time : 0.00017 s t.compute_time : 0.26296 s

I would greatly appreciate some insight of where these negative measurements could come from. If it helps, I can also provide more information on my experiments and code base.

ptrblck · November 8, 2020, 2:10am

Is t.compute_time just printing iteration_time?
Also, are you using the logger or did you add the timing code manually to the script?

lbhm · November 8, 2020, 7:55am

t.compute_time is printing the difference between iteration_time and data_time.

I’m using the logger from the NVIDIA script that I mentioned above with DLLogger as the logging backend. The logging code that I excluded in my initial post is:

        if logger is not None:
            logger.log_metric("t.loss", to_python_float(loss), batch_size)
            logger.log_metric("t.data_speed", calc_speed(batch_size, data_time))
            logger.log_metric("t.compute_speed", calc_speed(batch_size, iteration_time - data_time))
            logger.log_metric("t.iteration_speed", calc_speed(batch_size, iteration_time))
            logger.log_metric("t.data_time", data_time)
            logger.log_metric("t.compute_time", iteration_time - data_time)
            logger.log_metric("t.iteration_time", iteration_time)

The print logs are actually much larger but above I only showed the relevant part for conciseness.

ptrblck · November 9, 2020, 6:14am

Thanks for the update. I don’t see any definition of calc_speed in the repository, so could you link to its implementation?

lbhm · November 9, 2020, 8:18am

I refactored some of the names in my code to be consistent with the rest of my project. calc_speed in my snippet equals calc_ips in the original code.

ptrblck · November 9, 2020, 9:19am

Thanks for the update. In that case I don’t know what might cause the wrongly reported time, as the input values shouldn’t get negative (batch size as well as world size) and even if you are not synchronizing the code properly to measure the GPU time, you should not get negative results.
Maybe I’m missing something and someone else has an idea.