Results of a calculated output depends on whether `torch.cuda.synchronize` is present or not

numpee · March 20, 2022, 9:37am

Hi,
I’ve come across a very peculiar bug in my code. Although I’ve been able to bypass it by using adding a torch.cuda.synchronize() in my loop, I really can’t understand why this synchronization is to get reliable outputs.

The real code I use is a bit complicated, so I’ve whipped up a dummy version of it that shows the general process:

class PeculiarBug:
    def __init__(self, model1, model2, dataloader):
        # Two models, both on GPU. Iterating over dataloader returns images casted to GPU.
        self.model1 = model1
        self.model2 = model2
        self.dataloader = dataloader

        # HookManager adds forward hooks to intermediate layers of the input model.
        # Intermediate layer outputs are appended to a list within the HookManager
        self.hook_manager1 = HookManager(self.model1)
        self.hook_manager2 = HookManager(self.model2)

        self.A = torch.zeros(10, 10).cuda()
        self.B = torch.zeros(10, 1).cuda()
        self.C = torch.zeros(1, 10).cuda()

    def calculate():
        tempA = torch.zeros(10, 10).cuda()
        tempB = torch.zeros(10, 1).cuda()
        tempC = torch.zeros(1, 10).cuda()
        for imgs in self.dataloader:
            self.model1(imgs)
            self.model2(imgs)

            # Get list of intermediate features from hook manager
            layers1: List[Tensor] = self.hook_manager1.get_features()
            layers2: List[Tensor] = self.hook_manager2.get_features()

            # Calculate the tempA matrix. All matrix operations on GPU.
            # Updates self.A inplace via Tensor.__iadd__ method 
            # (e.g., something like self.A += tempA)
            self.calculate_A(layers1, layers2, tempA)

            # Calculate the tempB and tempC matrix. All operations are on GPU,
            # and updates to self.B and self.C are inplace via Tensor.__iadd__
            # (e.g., something like self.B += tempB)
            self.calculate_BC(layers1, layers2, tempB, tempC)

            # self.A, B, and C have been updated in the functions.
            # Fill tempA, B, C back with zeros
            tempA.fill_(0)
            tempB.fill_(0)
            tempC.fill_(0)
            
            # clear hook features
            self.hook_manager1.clear_features()
            self.hook_manager2.clear_features()
        
        return self.A / (self.B * self.C)

    # Example code for the calculate_BC function
    def calculate_BC(self, layers1, layers2, tempB, tempC):
        # tempB is changed inplace. 
        for start_idx in range(0, 50, group_size):
            end_idx = min(start_idx + group_size, 50)
            X = torch.stack([layers1[i] for i in range(start_idx, end_idx)], dim=0)
            tempB[0, start_idx:end_idx] += some_metric(X, X)
        self.B += tempB

        # similar operation for tempB and self.C
        ...

The problem

Given two identical models as inputs, the returned square matrix of calculate() must have ones on the diagonal. HOWEVER, running this code fast returns a matrix with values <1 on the diagonal. Interestingly, this bug shows the following properties:

Bug seems to be dependent on how fast the loop is run. For example, this bug will appear when using ResNet18 as the two models, but not when using ResNet50 (presumably because the forward pass is slower on Res50). Furthermore, adding a breakpoint inside the for loop and slowly stepping through the loop will return valid results. BUT, quickly stepping over breakpoints will mess up the results again.
torch.cuda.synchronize() can be added in any line within the for loop to bypass this issue (yes, I’ve checked every line).
Following up on point 2, it doesn’t even have to be torch.cuda.synchronize() specifically; for example, adding something like if (self.A / (self.B * self.C)).diagonal().sum() < 50: breakpoint() will not set off the breakpoint and will return valid results. (In this case, the comparison operator seems to be a synchronization point). This makes it incredibly hard to debug, since trying to pinpoint the bug will lead to the bug not appearing at all. It’s very akin to the “observer effect” in Physics, where trying to observe the process affects the process itself.
I have tried adding assertions to the len(layers1) == len(layers2) and len(layers1) == 10, which do not raise an AssertionError. (Adding this assertion does not fix the bug).

In essence, it seems like self.A, self.B, and self.C are not being updated in a timely manner (or even skipped completely?). I have suspected that the self.B += tempB and fill_(0) operations may be executing in the incorrect order, but this doesn’t make sense theoretically, since they are both tensor operations and should be queued up in the correct order. Also, if self.hook_manager1.clear_features() were to run before any of the calculation functions, the code should raise other errors such as a mismatch in dimensions.

Has anyone experienced something similar, or provide any insights into what may be going wrong? Although adding synchronization points does solve the issue, I can’t seem to pinpoint the issue, and don’t know if it’s just a temporary solution.

Thanks for reading this long post

ptrblck · March 20, 2022, 10:06pm

All your explanations point to a synchronization issue in the backend.
In case you are not using the latest PyTorch release, could you update to the nightly binary and check if you are still hitting this issue (remove all synchronizations which seem to fix the issue).
If so, would you be able to create a minimal, executable code snippet which would reproduce the issue and post the output of python -m torch.utils.collect_env here, please?

numpee · March 21, 2022, 6:25am

Hi @ptrblck,
thanks for looking into this.

I’m working on getting a minimal code that can reproduce the error.

Below are my env details:

PyTorch version: 1.10.1
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (GCC) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.27

Python version: 3.9.9 | packaged by conda-forge | (main, Dec 20 2021, 02:41:03)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-4.15.0-163-generic-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 2080 Ti
GPU 1: NVIDIA GeForce RTX 2080 Ti
GPU 2: NVIDIA GeForce RTX 2080 Ti
GPU 3: NVIDIA GeForce RTX 2080 Ti

Nvidia driver version: 465.19.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.3.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.3.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.3.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.3.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.3.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.3.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.3.1
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.5
[pip3] pytorch-pfn-extras==0.5.5
[pip3] torch==1.10.1
[pip3] torchmetrics==0.7.2
[pip3] torchvision==0.11.2
[conda] blas                      2.113                       mkl    conda-forge
[conda] blas-devel                3.9.0            13_linux64_mkl    conda-forge
[conda] cudatoolkit               11.3.1               ha36c431_9    conda-forge
[conda] libblas                   3.9.0            13_linux64_mkl    conda-forge
[conda] libcblas                  3.9.0            13_linux64_mkl    conda-forge
[conda] liblapack                 3.9.0            13_linux64_mkl    conda-forge
[conda] liblapacke                3.9.0            13_linux64_mkl    conda-forge
[conda] mkl                       2022.0.1           h8d4b97c_803    conda-forge
[conda] mkl-devel                 2022.0.1           ha770c72_804    conda-forge
[conda] mkl-include               2022.0.1           h8d4b97c_803    conda-forge
[conda] mypy_extensions           0.4.3            py39hf3d152e_4    conda-forge
[conda] numpy                     1.21.5           py39haac66dc_0    conda-forge
[conda] pytorch                   1.10.1          py3.9_cuda11.3_cudnn8.2.0_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] pytorch-pfn-extras        0.5.5                    pypi_0    pypi
[conda] torchmetrics              0.7.2              pyhd8ed1ab_0    conda-forge
[conda] torchvision               0.11.2               py39_cu113    pytorch

numpee · March 21, 2022, 6:45am

While simplifying the code, I realized I can’t seem to reproduce the issue with Pytorch’s native DataLoader. This leads me to believe it might be a dataloader bug (I’m using an FFCV dataloader). I’ll see if I can get updates from the developers on the FFCV side and post a reply here

numpee · March 21, 2022, 8:45am

@ptrblck, The issue seems to be in the custom dataloader library. Your request to create a minimal, executable code snippet led to me changing the dataloader and realizing it was an issue on the dataloader side. Thanks for the help!

ptrblck · March 21, 2022, 5:40pm

That’s good to hear! Would you mind sharing what the issue was, i.e. was the synchronization issue created due to a wrong usage of FFCV or an internal FFCV bug?

numpee · March 21, 2022, 6:01pm

From what I can tell, it seems like an internal FFCV bug, but I’ll have to wait and see what the developers think.
I’ve posted a minimal reproducible code as an issue on their repo.

Basically, the code checks whether two identical models received the same input by looking at the Batch Norm’s running mean value after a certain number of iterations (other methods of checking seem to change the behavior). I’ve found that at some point, the inputs they receive are actually different, unless a synchronization point is added within the loop.