Sparse dataset and dataloader

ironv · September 8, 2019, 2:10pm

I am training a set of FF networks on tabular data. The input is a sparse matrix. Here’s the relevant code:

    #training
    self.data_tr = TensorDataset(
            torch.tensor(train_csc.toarray(), dtype=torch.float32, device=self.device),
            torch.tensor(train_pd['is_case'].values, dtype=torch.float32, device=self.device)   #labels
        )
    
    #validation
    self.data_va = TensorDataset(
            torch.tensor(valid_csc.toarray(), dtype=torch.float32, device=self.device),
            torch.tensor(valid_pd['is_case'].values, dtype=torch.float32, device=self.device)   #labels
        )

and used it in the training

    train_ldr = DataLoader(dataset=self.data_tr, batch_size=param['bs'], shuffle=True)
    for X_mb,y_mb in train_ldr:
        yhat_mb = model(X_mb)
        loss = criterion(yhat_mb[:,0], y_mb)
        ...

The dense array is being stored on the GPU and sliced as required. This runs very fast. Unfortunately, a couple of instances are so big that they do not fit on the GPU memory as required in the above approach. For those instances I have the following:

class SparseDataset(Dataset):

    def __init__(self, mat_csc, label, device='cpu'):
        self.dim = mat_csc.shape
        self.device = torch.device(device)

        csr = mat_csc.tocsr(copy=True)
        self.indptr = torch.tensor(csr.indptr, dtype=torch.int64, device=self.device)
        self.indices = torch.tensor(csr.indices, dtype=torch.int64, device=self.device)
        self.data = torch.tensor(csr.data, dtype=torch.float32, device=self.device)

        self.label = torch.tensor(label, dtype=torch.float32, device=self.device)

    def __len__(self):
        return self.dim[0]

    def __getitem__(self, idx):
        obs = torch.zeros((self.dim[1],), dtype=torch.float32, device=self.device)
        ind1,ind2 = self.indptr[idx],self.indptr[idx+1]
        obs[self.indices[ind1:ind2]] = self.data[ind1:ind2]

        return obs,self.label[idx]

instantiated as

    self.data_tr = SparseDataset(train_csc, train_pd['is_case'].values, device)
    self.data_va = SparseDataset(valid_csc, valid_pd['is_case'].values, device)

and used as

    train_ldr = DataLoader(dataset=self.data_tr, batch_size=param['bs'], shuffle=True, collate_fn=my_collate)
    for X_mb,y_mb in train_ldr:
        yhat_mb = model(X_mb)
        loss = criterion(yhat_mb[:,0], y_mb)
        ....

While this is VERY memory efficient, even on my smallest instance (which fits in the memory), it is 20 times slower than the first approach. I am looking for ideas to make this faster. Thx.

keshav-b · May 25, 2021, 4:00am

I’m using a matrix of size 5k x 90k. When i load treat this as a dense matrix and run a AE network, each epoch took me around 60-65 sec. But using your sparse matrix dataloader approach took me about 15-16 secs only. Looks like it works ?