DataParallel application to non-NN models

Hello everyone,

I have a somewhat complicated question. I am trying to implement computationally expensive models on multi-GPU systems with DataParallel. These models feature a variant of a Kalman filter called an Ensemble Kalman Filter, which I abbreviate as EnKF. I’ve managed to express these models in the class-based architecture expected by DataParallel, but I’ve ran into two issues:

  1. If the model size is small enough to fit on one of my GPUs, the model is put entirely on cuda:0, even though pytorch recognizes the presence of both cuda:0 and cuda:1.
  2. If the model size is too big to fit on cuda:0, python aborts the script with an out of memory error instead of beginning to fill cuda:1, as expected.

The out of memory error is as follows:

RuntimeError: CUDA out of memory. Tried to allocate 3.73 GiB (GPU 0; 3.82 GiB total capacity; 38.00 KiB already allocated; 2.99 GiB free; 2.00 MiB reserved in total by PyTorch)
For reference, cuda:0 is an NVIDIA GeForce GTX 1650, cuda:1 is an NVIDIA GeForce GTX 1070, my cpu is an AMD Ryzen 7 3700x, I am running CUDA Version 10.2, and pytorch version 1.6.0.

I suspect that these issues stem from me not correctly adapting the structure of my EnKF models to the NN-based structure expected by the backend of DataParallel. My hope is that you all could help me diagnose the problem. The most relevant parts of my code are the KF class, and wrapping the class with DataParallel and then executing it. Here is the KF class:

class KF(Module):
    def __init__(self, Xens: pt.FloatTensor, x_KF: pt.FloatTensor, 
                 y_KF: pt.FloatTensor, y_True: pt.FloatTensor,
                 F: pt.FloatTensor, G: pt.FloatTensor, H: pt.FloatTensor, 
                 processMeans: pt.FloatTensor, processVars: pt.FloatTensor,
                 outputMeans: pt.FloatTensor, outputVars: pt.FloatTensor,
                 onesmat: pt.FloatTensor, U: pt.FloatTensor, Lt: int):
        super(KF, self).__init__() #call super (parent) class (nn.module) constructor, this instantiates it right
        self.Xens: pt.nn.Parameter = pt.nn.Parameter(Xens, requires_grad=False)
        self.x_KF: pt.nn.Parameter = pt.nn.Parameter(x_KF, requires_grad=False)
        self.y_KF: pt.nn.Parameter = pt.nn.Parameter(y_KF, requires_grad=False)
        self.y_True: pt.nn.Parameter = pt.nn.Parameter(y_True, requires_grad=False)
        self.F: pt.nn.Parameter = pt.nn.Parameter(F, requires_grad=False)
        self.G: pt.nn.Parameter = pt.nn.Parameter(G, requires_grad=False)
        self.H: pt.nn.Parameter = pt.nn.Parameter(H, requires_grad=False)
        self.processMeans: pt.nn.Parameter = pt.nn.Parameter(processMeans, requires_grad=False)
        self.processVars: pt.nn.Parameter = pt.nn.Parameter(processVars, requires_grad=False)
        self.outputMeans: pt.nn.Parameter = pt.nn.Parameter(outputMeans, requires_grad=False)
        self.outputVars: pt.nn.Parameter = pt.nn.Parameter(outputVars, requires_grad=False)
        self.onesmat: pt.nn.Parameter = pt.nn.Parameter(onesmat, requires_grad=False)
        self.U: pt.nn.Parameter = pt.nn.Parameter(U, requires_grad=False)
        self.Lt: int = Lt

    def Propagate(self) -> Tuple[pt.FloatTensor, pt.FloatTensor]:
        N=self.Xens.shape[1]
        for tt in range(1,self.Lt):
            ####### predict
            self.Xens.copy_(self.F @ self.Xens + self.G @ self.U[:,tt:tt+1]\
                            @ self.onesmat\
                                + pt.normal(self.processMeans,self.processVars))
            D=self.y_True[:,tt:tt+1] @ self.onesmat\
                + pt.normal(self.outputMeans,self.outputVars)
            Xens_bar=pt.mean(self.Xens,axis=1); Xens_bar=pt.reshape(Xens_bar,(-1,1)) 
            Dbar=pt.mean(D,axis=1); Dbar=pt.reshape(Dbar,(-1,1))
            Acov=self.Xens-Xens_bar @ self.onesmat; Ad=D-Dbar @ self.onesmat #deviations from mean
            P=(Acov @ (Acov.T) )/(N-1); Rest=(Ad @ (Ad.T))/(N-1) #sample covariances
            
            ####### update
            Rtilde=self.H @ P @ (self.H.T) + Rest; 
            Rtilde=pt.pinverse(Rtilde); 
            Kk=P @ (self.H.T) @ Rtilde
            self.Xens.copy_(self.Xens+Kk @ (D - self.H @ self.Xens))
            Xens_bar=pt.mean(self.Xens,axis=1)
            Xens_bar=pt.reshape(Xens_bar,(-1,1)) 
            self.x_KF[:,tt:tt+1]=Xens_bar
            self.y_KF[:,tt:tt+1]=self.H @ self.x_KF[:,tt:tt+1]
            print('tt = '+str(tt)+' of '+str(self.Lt-1))
        return self.x_KF,self.y_KF

I’ve initialized all the EnKF arrays as parameters with requires_grad=False because I don’t want stochastic gradient descent routines to inject randomness into their values as the linear algebra executes. However, I’ve also discovered that this setup prevents the usage of DistributedDataParallel, which requires at least one parameter with requires_grad=True. If this is a mistake, please let me know.

Here is the code that wraps the models with DataParallel, and then executes them:

####### KF parameters
    N: int =int(5e8)
    Xens: pt.FloatTensor = pt.zeros(numStates,N)
    x_KF: pt.FloatTensor = pt.zeros(numStates, Lt)
    y_KF: pt.FloatTensor = pt.zeros(numOutputs, Lt)
    onesmat: pt.FloatTensor = pt.ones(1,N)
    
    # device=pt.device("cuda:0")

    t = time.time()

    sys_True: pt.nn.Module = True_sys(x_True, y_True, F, G, H, U, Qmat, Rmat,
                                      processMeans, processVars,
                                      outputMeans, outputVars,
                                      Lt)

    # wrap into DataParallel object (allows access to multiple GPUs)
    device_selection="cuda"
    devices=pt.device(device_selection)
    if "cpu" not in device_selection:
        sys_True = pt.nn.DataParallel(sys_True,
                                   device_ids=list(range(pt.cuda.device_count())))
        sys_True.to(devices) # move to GPUs
        [x_True, y_True] = sys_True.module.Propagate() # run true system


        sys_OL = Nonoise_sys(x_noNoise, y_noNoise, F, G, H, U, Lt)

        # wrap into DataParallel object (allows access to multiple GPUs)
        sys_OL = pt.nn.DataParallel(sys_OL,
                                   device_ids=list(range(pt.cuda.device_count())))
        sys_OL.to(devices) # move to GPUs
        [x_noNoise, y_noNoise] = sys_OL.module.Propagate() #run OL system


        
        sys_KF: pt.nn.Module = KF(Xens, x_KF, y_KF, y_True, F, G, H, processMeans,
                                  processVars, outputMeans, outputVars,
                                  onesmat, U, Lt)

        # wrap into DataParallel object (allows access to multiple GPUs)
        sys_KF = pt.nn.DataParallel(sys_KF, 
                                   device_ids=list(range(pt.cuda.device_count())))
        sys_KF.to(devices) # move to GPUs
        [x_KF, y_KF] = sys_KF.module.Propagate() #run KF
    else:
        [x_True, y_True] = sys_True.Propagate() # run true system
        sys_OL = Nonoise_sys(x_noNoise, y_noNoise, F, G, H, U, Lt)
        [x_noNoise, y_noNoise] = sys_OL.Propagate() #run OL system
        sys_KF: pt.nn.Module = KF(Xens, x_KF, y_KF, y_True, F, G, H, processMeans,
                                  processVars, outputMeans, outputVars,
                                  onesmat, U, Lt)
        [x_KF, y_KF] = sys_KF.Propagate() #run KF

The variable N sets the number of particles used by the EnKF, which I have set unnecessarily high to explode the model size and provoke an out of memory error. The classes True_sys() and Nonoise_sys() are structured similarly to KF(), but with less complicated linear algebra. To my knowledge, this is the correct setup for getting DataParallel to distribute the linear algebra among both cuda:0 and cuda:1. However, the script fails to do so.

My instinct is that somehow not properly mimicking how NNs execute their linear algebra in batches. I’m not sure where I’m going wrong with this, or if this instinct is even correct. My thanks in advance for any clarifications and advice you can provide!