How to speedup the back-propagation function of AdderNet

How to speedup the back-propagation function of AdderNet because training time is extremely long ?

Note: More context could be found at https://github.com/huawei-noah/AdderNet/issues/16#issuecomment-627446321

What is slow in there? The cuda kernels?

I tested the cuda kernels separately on its own, and it is not slow compared to the cuda kernel(unoptimized_cuda.cpp) used together with YOLO

Why ?

Streaming output truncated to the last 5000 lines.
17.9488s  1.6960us              (1 1 1)        (64 1 1)         8        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_16fill_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlvE_NS_6detail5ArrayIPcLi1EEEEEviT0_T1_ [6598]
17.9488s  3.5840us              (1 1 1)       (256 1 1)        48        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN89_GLOBAL__N__65_tmpxft_00002210_00000000_11_DistributionNormal_compute_75_cpp1_ii_ccb2fd7d43distribution_elementwise_grid_stride_kernelIfLi4EZZZN2at6native18normal_kernel_cudaERNS1_14TensorIteratorEddPNS1_9GeneratorEENKUlvE_clEvENKUlvE0_clEvEUlP24curandStatePhilox4_32_10E0_ZNS_27distribution_nullary_kernelIffLi4ENS1_13CUDAGeneratorESB_ZZZNS2_18normal_kernel_cudaES4_ddS6_ENKS7_clEvENKS8_clEvEUlfE_EEvS4_PT2_RKT3_T4_EUlifE_EEviSt4pairImmET1_SF_ [6606]
17.9489s  1.5680us              (1 1 1)        (64 1 1)         8        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_16fill_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlvE_NS_6detail5ArrayIPcLi1EEEEEviT0_T1_ [6611]
17.9489s  3.6800us              (1 1 1)       (256 1 1)        48        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN89_GLOBAL__N__65_tmpxft_00002210_00000000_11_DistributionNormal_compute_75_cpp1_ii_ccb2fd7d43distribution_elementwise_grid_stride_kernelIfLi4EZZZN2at6native18normal_kernel_cudaERNS1_14TensorIteratorEddPNS1_9GeneratorEENKUlvE_clEvENKUlvE0_clEvEUlP24curandStatePhilox4_32_10E0_ZNS_27distribution_nullary_kernelIffLi4ENS1_13CUDAGeneratorESB_ZZZNS2_18normal_kernel_cudaES4_ddS6_ENKS7_clEvENKS8_clEvEUlfE_EEvS4_PT2_RKT3_T4_EUlifE_EEviSt4pairImmET1_SF_ [6619]
17.9489s  1.6960us              (1 1 1)        (64 1 1)         8        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_16fill_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlvE_NS_6detail5ArrayIPcLi1EEEEEviT0_T1_ [6624]
17.9490s  3.6480us              (1 1 1)       (256 1 1)        48        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN89_GLOBAL__N__65_tmpxft_00002210_00000000_11_DistributionNormal_compute_75_cpp1_ii_ccb2fd7d43distribution_elementwise_grid_stride_kernelIfLi4EZZZN2at6native18normal_kernel_cudaERNS1_14TensorIteratorEddPNS1_9GeneratorEENKUlvE_clEvENKUlvE0_clEvEUlP24curandStatePhilox4_32_10E0_ZNS_27distribution_nullary_kernelIffLi4ENS1_13CUDAGeneratorESB_ZZZNS2_18normal_kernel_cudaES4_ddS6_ENKS7_clEvENKS8_clEvEUlfE_EEvS4_PT2_RKT3_T4_EUlifE_EEviSt4pairImmET1_SF_ [6632]
17.9490s  1.5680us              (1 1 1)        (64 1 1)         8        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_16fill_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlvE_NS_6detail5ArrayIPcLi1EEEEEviT0_T1_ [6637]
17.9491s  3.6480us              (1 1 1)       (256 1 1)        48        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN89_GLOBAL__N__65_tmpxft_00002210_00000000_11_DistributionNormal_compute_75_cpp1_ii_ccb2fd7d43distribution_elementwise_grid_stride_kernelIfLi4EZZZN2at6native18normal_kernel_cudaERNS1_14TensorIteratorEddPNS1_9GeneratorEENKUlvE_clEvENKUlvE0_clEvEUlP24curandStatePhilox4_32_10E0_ZNS_27distribution_nullary_kernelIffLi4ENS1_13CUDAGeneratorESB_ZZZNS2_18normal_kernel_cudaES4_ddS6_ENKS7_clEvENKS8_clEvEUlfE_EEvS4_PT2_RKT3_T4_EUlifE_EEviSt4pairImmET1_SF_ [6645]
17.9491s  1.7280us              (1 1 1)        (64 1 1)         8        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_16fill_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlvE_NS_6detail5ArrayIPcLi1EEEEEviT0_T1_ [6650]
17.9492s  3.6480us              (1 1 1)       (256 1 1)        48        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN89_GLOBAL__N__65_tmpxft_00002210_00000000_11_DistributionNormal_compute_75_cpp1_ii_ccb2fd7d43distribution_elementwise_grid_stride_kernelIfLi4EZZZN2at6native18normal_kernel_cudaERNS1_14TensorIteratorEddPNS1_9GeneratorEENKUlvE_clEvENKUlvE0_clEvEUlP24curandStatePhilox4_32_10E0_ZNS_27distribution_nullary_kernelIffLi4ENS1_13CUDAGeneratorESB_ZZZNS2_18normal_kernel_cudaES4_ddS6_ENKS7_clEvENKS8_clEvEUlfE_EEvS4_PT2_RKT3_T4_EUlifE_EEviSt4pairImmET1_SF_ [6658]
17.9492s  1.5680us              (1 1 1)        (64 1 1)         8        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_16fill_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlvE_NS_6detail5ArrayIPcLi1EEEEEviT0_T1_ [6663]
17.9492s  3.5840us              (1 1 1)       (256 1 1)        48        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN89_GLOBAL__N__65_tmpxft_00002210_00000000_11_DistributionNormal_compute_75_cpp1_ii_ccb2fd7d43distribution_elementwise_grid_stride_kernelIfLi4EZZZN2at6native18normal_kernel_cudaERNS1_14TensorIteratorEddPNS1_9GeneratorEENKUlvE_clEvENKUlvE0_clEvEUlP24curandStatePhilox4_32_10E0_ZNS_27distribution_nullary_kernelIffLi4ENS1_13CUDAGeneratorESB_ZZZNS2_18normal_kernel_cudaES4_ddS6_ENKS7_clEvENKS8_clEvEUlfE_EEvS4_PT2_RKT3_T4_EUlifE_EEviSt4pairImmET1_SF_ [6671]
17.9492s  1.7280us              (1 1 1)        (64 1 1)         8        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_16fill_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlvE_NS_6detail5ArrayIPcLi1EEEEEviT0_T1_ [6676]
17.9493s  3.6480us              (1 1 1)       (256 1 1)        48        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN89_GLOBAL__N__65_tmpxft_00002210_00000000_11_DistributionNormal_compute_75_cpp1_ii_ccb2fd7d43distribution_elementwise_grid_stride_kernelIfLi4EZZZN2at6native18normal_kernel_cudaERNS1_14TensorIteratorEddPNS1_9GeneratorEENKUlvE_clEvENKUlvE0_clEvEUlP24curandStatePhilox4_32_10E0_ZNS_27distribution_nullary_kernelIffLi4ENS1_13CUDAGeneratorESB_ZZZNS2_18normal_kernel_cudaES4_ddS6_ENKS7_clEvENKS8_clEvEUlfE_EEvS4_PT2_RKT3_T4_EUlifE_EEviSt4pairImmET1_SF_ [6684]
17.9493s  1.6000us              (1 1 1)        (64 1 1)         8        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_16fill_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlvE_NS_6detail5ArrayIPcLi1EEEEEviT0_T1_ [6689]
17.9493s  3.6800us              (1 1 1)       (256 1 1)        48        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN89_GLOBAL__N__65_tmpxft_00002210_00000000_11_DistributionNormal_compute_75_cpp1_ii_ccb2fd7d43distribution_elementwise_grid_stride_kernelIfLi4EZZZN2at6native18normal_kernel_cudaERNS1_14TensorIteratorEddPNS1_9GeneratorEENKUlvE_clEvENKUlvE0_clEvEUlP24curandStatePhilox4_32_10E0_ZNS_27distribution_nullary_kernelIffLi4ENS1_13CUDAGeneratorESB_ZZZNS2_18normal_kernel_cudaES4_ddS6_ENKS7_clEvENKS8_clEvEUlfE_EEvS4_PT2_RKT3_T4_EUlifE_EEviSt4pairImmET1_SF_ [6697]
17.9494s  1.6960us              (1 1 1)        (64 1 1)         8        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_16fill_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlvE_NS_6detail5ArrayIPcLi1EEEEEviT0_T1_ [6702]
17.9494s  3.6480us              (1 1 1)       (256 1 1)        48        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  

I’m a bit confused about what you’re doing here.
What are these timings here for?

Also you mean that just using the kernels by themselves is faster than when using them embedding in the rest of the code?
Maybe you’re not using the same input sizes? Or not testing the backward kernel?

Yes

But the CUDA kernel itself is only using CIFAR10, while YOLO together with CUDA kernel is using VOC2007

So these two have different input sizes right? Could it be that the kernel scales very badly with the input size?

I am checking where exactly the CUDA kernel calculation bottleneck is

Do you know how to use VOC2007 dataset instead of the CIFAR-10 dataset (which is only of size 32 x 32) on the CUDA kernel itself ?

torchvision.dataset does not have VOC2007

    if args.dataset == 'cifar10':
        train_dataset = datasets.CIFAR10(
            root=args.data_dir,
            train=True,
            transform=transforms.Compose([
                transforms.RandomHorizontalFlip(),
                transforms.RandomCrop(size=32, padding=4),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.4914, 0.4822, 0.4465],
                                     std=[0.2023, 0.1994, 0.2010]),
            ]), download=True)
        val_dataset = datasets.CIFAR10(
            root=args.data_dir,
            train=False, 
            transform=transforms.Compose([
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.4914, 0.4822, 0.4465],
                                     std=[0.2023, 0.1994, 0.2010]),
            ]), download=True)
    else:
        train_dataset = datasets.ImageFolder(
            root=args.data_dir, 
            transform=transforms.Compose([
                transforms.RandomHorizontalFlip(),
                transforms.Resize(224),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                         std=[0.229, 0.224, 0.225])
            ]))
        val_dataset = datasets.ImageFolder(
            root=args.data_dir, 
            transform=transforms.Compose([
                transforms.Resize(224),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                         std=[0.229, 0.224, 0.225])
            ]))  

what does the transform do ?

I mean the RandomCrop

transforms.RandomCrop could not be applied to VOC2007 dataset which consists of images of different dimensions.

Do you have a particular question regarding the linked documentation for RandomCrop?

What do you mean by different dimensions and what is not working?
Do you get any error message using this transformation?

Now, I have the following draft code for loading VOC2007 dataset.

However, this is only for image itself, what about labels for classification purpose ?

Could anyone advise how to load both images and the corresponding labels correctly ?

        #os.system("mkdir -p /home/rog/Downloads/AdderNetCuda/dataset/voc2007")
        #os.system("cd /home/rog/Downloads/AdderNetCuda/dataset/voc2007")
        
        #os.system("wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar")
        #os.system("wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar")

        os.system("mkdir -p /home/rog/Downloads/AdderNetCuda/dataset/voc2007/test")
        os.system("mkdir -p /home/rog/Downloads/AdderNetCuda/dataset/voc2007/train")

        os.system("tar -C /home/rog/Downloads/AdderNetCuda/dataset/voc2007/test -xf VOCtest_06-Nov-2007.tar")
        os.system("tar -C /home/rog/Downloads/AdderNetCuda/dataset/voc2007/train -xf VOCtrainval_06-Nov-2007.tar")
    
        train_dataset = datasets.ImageFolder(
            root="/home/rog/Downloads/AdderNetCuda/dataset/voc2007/train/VOCdevkit/VOC2007/JPEGImages", #args.data_dir, 
            transform=transforms.Compose([
                transforms.RandomHorizontalFlip(),
                #transforms.Resize(224),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                         std=[0.229, 0.224, 0.225])
            ]))
        val_dataset = datasets.ImageFolder(
            root="/home/rog/Downloads/AdderNetCuda/dataset/voc2007/test/VOCdevkit/VOC2007/JPEGImages", #args.data_dir, 
            transform=transforms.Compose([
                #transforms.Resize(224),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                         std=[0.229, 0.224, 0.225])
            ]))        

    train_loader = torch.utils.data.DataLoader(train_dataset,
        batch_size=args.batch_size, shuffle=False,num_workers=args.workers, pin_memory=True)
    val_loader = torch.utils.data.DataLoader(val_dataset,
        batch_size=args.batch_size, shuffle=False, num_workers=args.workers, pin_memory=True)

If you are dealing with target images and would like to apply the same random transformations on both the data and target, you could use the functional API as described in this post.
This would make sure to keep the correspondence between the data and target.

1 Like