How to speedup the back-propagation function of AdderNet

promach · June 5, 2020, 1:05pm

How to speedup the back-propagation function of AdderNet because training time is extremely long ?

Note: More context could be found at https://github.com/huawei-noah/AdderNet/issues/16#issuecomment-627446321

albanD · June 5, 2020, 2:54pm

What is slow in there? The cuda kernels?

promach · June 7, 2020, 2:38am

I tested the cuda kernels separately on its own, and it is not slow compared to the cuda kernel(unoptimized_cuda.cpp) used together with YOLO

Why ?

Streaming output truncated to the last 5000 lines.
17.9488s  1.6960us              (1 1 1)        (64 1 1)         8        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_16fill_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlvE_NS_6detail5ArrayIPcLi1EEEEEviT0_T1_ [6598]
17.9488s  3.5840us              (1 1 1)       (256 1 1)        48        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN89_GLOBAL__N__65_tmpxft_00002210_00000000_11_DistributionNormal_compute_75_cpp1_ii_ccb2fd7d43distribution_elementwise_grid_stride_kernelIfLi4EZZZN2at6native18normal_kernel_cudaERNS1_14TensorIteratorEddPNS1_9GeneratorEENKUlvE_clEvENKUlvE0_clEvEUlP24curandStatePhilox4_32_10E0_ZNS_27distribution_nullary_kernelIffLi4ENS1_13CUDAGeneratorESB_ZZZNS2_18normal_kernel_cudaES4_ddS6_ENKS7_clEvENKS8_clEvEUlfE_EEvS4_PT2_RKT3_T4_EUlifE_EEviSt4pairImmET1_SF_ [6606]
17.9489s  1.5680us              (1 1 1)        (64 1 1)         8        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_16fill_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlvE_NS_6detail5ArrayIPcLi1EEEEEviT0_T1_ [6611]
17.9489s  3.6800us              (1 1 1)       (256 1 1)        48        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN89_GLOBAL__N__65_tmpxft_00002210_00000000_11_DistributionNormal_compute_75_cpp1_ii_ccb2fd7d43distribution_elementwise_grid_stride_kernelIfLi4EZZZN2at6native18normal_kernel_cudaERNS1_14TensorIteratorEddPNS1_9GeneratorEENKUlvE_clEvENKUlvE0_clEvEUlP24curandStatePhilox4_32_10E0_ZNS_27distribution_nullary_kernelIffLi4ENS1_13CUDAGeneratorESB_ZZZNS2_18normal_kernel_cudaES4_ddS6_ENKS7_clEvENKS8_clEvEUlfE_EEvS4_PT2_RKT3_T4_EUlifE_EEviSt4pairImmET1_SF_ [6619]
17.9489s  1.6960us              (1 1 1)        (64 1 1)         8        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_16fill_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlvE_NS_6detail5ArrayIPcLi1EEEEEviT0_T1_ [6624]
17.9490s  3.6480us              (1 1 1)       (256 1 1)        48        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN89_GLOBAL__N__65_tmpxft_00002210_00000000_11_DistributionNormal_compute_75_cpp1_ii_ccb2fd7d43distribution_elementwise_grid_stride_kernelIfLi4EZZZN2at6native18normal_kernel_cudaERNS1_14TensorIteratorEddPNS1_9GeneratorEENKUlvE_clEvENKUlvE0_clEvEUlP24curandStatePhilox4_32_10E0_ZNS_27distribution_nullary_kernelIffLi4ENS1_13CUDAGeneratorESB_ZZZNS2_18normal_kernel_cudaES4_ddS6_ENKS7_clEvENKS8_clEvEUlfE_EEvS4_PT2_RKT3_T4_EUlifE_EEviSt4pairImmET1_SF_ [6632]
17.9490s  1.5680us              (1 1 1)        (64 1 1)         8        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_16fill_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlvE_NS_6detail5ArrayIPcLi1EEEEEviT0_T1_ [6637]
17.9491s  3.6480us              (1 1 1)       (256 1 1)        48        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN89_GLOBAL__N__65_tmpxft_00002210_00000000_11_DistributionNormal_compute_75_cpp1_ii_ccb2fd7d43distribution_elementwise_grid_stride_kernelIfLi4EZZZN2at6native18normal_kernel_cudaERNS1_14TensorIteratorEddPNS1_9GeneratorEENKUlvE_clEvENKUlvE0_clEvEUlP24curandStatePhilox4_32_10E0_ZNS_27distribution_nullary_kernelIffLi4ENS1_13CUDAGeneratorESB_ZZZNS2_18normal_kernel_cudaES4_ddS6_ENKS7_clEvENKS8_clEvEUlfE_EEvS4_PT2_RKT3_T4_EUlifE_EEviSt4pairImmET1_SF_ [6645]
17.9491s  1.7280us              (1 1 1)        (64 1 1)         8        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_16fill_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlvE_NS_6detail5ArrayIPcLi1EEEEEviT0_T1_ [6650]
17.9492s  3.6480us              (1 1 1)       (256 1 1)        48        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN89_GLOBAL__N__65_tmpxft_00002210_00000000_11_DistributionNormal_compute_75_cpp1_ii_ccb2fd7d43distribution_elementwise_grid_stride_kernelIfLi4EZZZN2at6native18normal_kernel_cudaERNS1_14TensorIteratorEddPNS1_9GeneratorEENKUlvE_clEvENKUlvE0_clEvEUlP24curandStatePhilox4_32_10E0_ZNS_27distribution_nullary_kernelIffLi4ENS1_13CUDAGeneratorESB_ZZZNS2_18normal_kernel_cudaES4_ddS6_ENKS7_clEvENKS8_clEvEUlfE_EEvS4_PT2_RKT3_T4_EUlifE_EEviSt4pairImmET1_SF_ [6658]
17.9492s  1.5680us              (1 1 1)        (64 1 1)         8        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_16fill_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlvE_NS_6detail5ArrayIPcLi1EEEEEviT0_T1_ [6663]
17.9492s  3.5840us              (1 1 1)       (256 1 1)        48        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN89_GLOBAL__N__65_tmpxft_00002210_00000000_11_DistributionNormal_compute_75_cpp1_ii_ccb2fd7d43distribution_elementwise_grid_stride_kernelIfLi4EZZZN2at6native18normal_kernel_cudaERNS1_14TensorIteratorEddPNS1_9GeneratorEENKUlvE_clEvENKUlvE0_clEvEUlP24curandStatePhilox4_32_10E0_ZNS_27distribution_nullary_kernelIffLi4ENS1_13CUDAGeneratorESB_ZZZNS2_18normal_kernel_cudaES4_ddS6_ENKS7_clEvENKS8_clEvEUlfE_EEvS4_PT2_RKT3_T4_EUlifE_EEviSt4pairImmET1_SF_ [6671]
17.9492s  1.7280us              (1 1 1)        (64 1 1)         8        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_16fill_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlvE_NS_6detail5ArrayIPcLi1EEEEEviT0_T1_ [6676]
17.9493s  3.6480us              (1 1 1)       (256 1 1)        48        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN89_GLOBAL__N__65_tmpxft_00002210_00000000_11_DistributionNormal_compute_75_cpp1_ii_ccb2fd7d43distribution_elementwise_grid_stride_kernelIfLi4EZZZN2at6native18normal_kernel_cudaERNS1_14TensorIteratorEddPNS1_9GeneratorEENKUlvE_clEvENKUlvE0_clEvEUlP24curandStatePhilox4_32_10E0_ZNS_27distribution_nullary_kernelIffLi4ENS1_13CUDAGeneratorESB_ZZZNS2_18normal_kernel_cudaES4_ddS6_ENKS7_clEvENKS8_clEvEUlfE_EEvS4_PT2_RKT3_T4_EUlifE_EEviSt4pairImmET1_SF_ [6684]
17.9493s  1.6000us              (1 1 1)        (64 1 1)         8        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_16fill_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlvE_NS_6detail5ArrayIPcLi1EEEEEviT0_T1_ [6689]
17.9493s  3.6800us              (1 1 1)       (256 1 1)        48        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN89_GLOBAL__N__65_tmpxft_00002210_00000000_11_DistributionNormal_compute_75_cpp1_ii_ccb2fd7d43distribution_elementwise_grid_stride_kernelIfLi4EZZZN2at6native18normal_kernel_cudaERNS1_14TensorIteratorEddPNS1_9GeneratorEENKUlvE_clEvENKUlvE0_clEvEUlP24curandStatePhilox4_32_10E0_ZNS_27distribution_nullary_kernelIffLi4ENS1_13CUDAGeneratorESB_ZZZNS2_18normal_kernel_cudaES4_ddS6_ENKS7_clEvENKS8_clEvEUlfE_EEvS4_PT2_RKT3_T4_EUlifE_EEviSt4pairImmET1_SF_ [6697]
17.9494s  1.6960us              (1 1 1)        (64 1 1)         8        0B        0B         -           -           -           -     Tesla P4 (0)         1         7  _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_16fill_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlvE_NS_6detail5ArrayIPcLi1EEEEEviT0_T1_ [6702]
17.9494s  3.6480us              (1 1 1)       (256 1 1)        48        0B        0B         -           -           -           -     Tesla P4 (0)         1         7

albanD · June 8, 2020, 12:07am

I’m a bit confused about what you’re doing here.
What are these timings here for?

Also you mean that just using the kernels by themselves is faster than when using them embedding in the rest of the code?
Maybe you’re not using the same input sizes? Or not testing the backward kernel?

promach · June 8, 2020, 3:19am

Yes

But the CUDA kernel itself is only using CIFAR10, while YOLO together with CUDA kernel is using VOC2007

albanD · June 8, 2020, 3:36pm

So these two have different input sizes right? Could it be that the kernel scales very badly with the input size?

promach · June 10, 2020, 4:01pm

I am checking where exactly the CUDA kernel calculation bottleneck is

promach · June 24, 2020, 3:13pm

Do you know how to use VOC2007 dataset instead of the CIFAR-10 dataset (which is only of size 32 x 32) on the CUDA kernel itself ?

torchvision.dataset does not have VOC2007

promach · July 1, 2020, 3:13am

    if args.dataset == 'cifar10':
        train_dataset = datasets.CIFAR10(
            root=args.data_dir,
            train=True,
            transform=transforms.Compose([
                transforms.RandomHorizontalFlip(),
                transforms.RandomCrop(size=32, padding=4),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.4914, 0.4822, 0.4465],
                                     std=[0.2023, 0.1994, 0.2010]),
            ]), download=True)
        val_dataset = datasets.CIFAR10(
            root=args.data_dir,
            train=False, 
            transform=transforms.Compose([
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.4914, 0.4822, 0.4465],
                                     std=[0.2023, 0.1994, 0.2010]),
            ]), download=True)
    else:
        train_dataset = datasets.ImageFolder(
            root=args.data_dir, 
            transform=transforms.Compose([
                transforms.RandomHorizontalFlip(),
                transforms.Resize(224),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                         std=[0.229, 0.224, 0.225])
            ]))
        val_dataset = datasets.ImageFolder(
            root=args.data_dir, 
            transform=transforms.Compose([
                transforms.Resize(224),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                         std=[0.229, 0.224, 0.225])
            ]))

what does the transform do ?

I mean the RandomCrop

transforms.RandomCrop could not be applied to VOC2007 dataset which consists of images of different dimensions.

ptrblck · July 1, 2020, 9:19am

Do you have a particular question regarding the linked documentation for RandomCrop?

What do you mean by different dimensions and what is not working?
Do you get any error message using this transformation?

promach · July 5, 2020, 1:41am

Now, I have the following draft code for loading VOC2007 dataset.

However, this is only for image itself, what about labels for classification purpose ?

Could anyone advise how to load both images and the corresponding labels correctly ?

        #os.system("mkdir -p /home/rog/Downloads/AdderNetCuda/dataset/voc2007")
        #os.system("cd /home/rog/Downloads/AdderNetCuda/dataset/voc2007")
        
        #os.system("wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar")
        #os.system("wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar")

        os.system("mkdir -p /home/rog/Downloads/AdderNetCuda/dataset/voc2007/test")
        os.system("mkdir -p /home/rog/Downloads/AdderNetCuda/dataset/voc2007/train")

        os.system("tar -C /home/rog/Downloads/AdderNetCuda/dataset/voc2007/test -xf VOCtest_06-Nov-2007.tar")
        os.system("tar -C /home/rog/Downloads/AdderNetCuda/dataset/voc2007/train -xf VOCtrainval_06-Nov-2007.tar")
    
        train_dataset = datasets.ImageFolder(
            root="/home/rog/Downloads/AdderNetCuda/dataset/voc2007/train/VOCdevkit/VOC2007/JPEGImages", #args.data_dir, 
            transform=transforms.Compose([
                transforms.RandomHorizontalFlip(),
                #transforms.Resize(224),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                         std=[0.229, 0.224, 0.225])
            ]))
        val_dataset = datasets.ImageFolder(
            root="/home/rog/Downloads/AdderNetCuda/dataset/voc2007/test/VOCdevkit/VOC2007/JPEGImages", #args.data_dir, 
            transform=transforms.Compose([
                #transforms.Resize(224),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                         std=[0.229, 0.224, 0.225])
            ]))        

    train_loader = torch.utils.data.DataLoader(train_dataset,
        batch_size=args.batch_size, shuffle=False,num_workers=args.workers, pin_memory=True)
    val_loader = torch.utils.data.DataLoader(val_dataset,
        batch_size=args.batch_size, shuffle=False, num_workers=args.workers, pin_memory=True)

ptrblck · July 6, 2020, 1:29am

If you are dealing with target images and would like to apply the same random transformations on both the data and target, you could use the functional API as described in this post.
This would make sure to keep the correspondence between the data and target.