Matrix slicing makes GPU training slow?

I have a pytorch lightning model on tabular data, a mix of categorical and continuous.

On Google Colab, switching from CPU to GPU only increases speed by 2x. (batch_size=1024 so it’s not small batches)

I suspect this is because I am doing a lot of matrix slicing, which is reducing locality. Here is a snippet from my forward method:

    def forward(self, x):
        xcat = []
        for i, (start, end) in enumerate(self.cat_ranges):
            xcat.append(self.embeddings[i](x[:,start:end]))
        xcat = torch.cat(xcat, 1)
        xcat = self.emb_dropout_layer(xcat)

        xcont = x[:,self.cont_range[0]:self.cont_range[1]]

Is this sort of slicing the reason I can’t speed up a lot on GPU? If so, how do I resolve it? Pre-slice the matrices so they are not views?

I also notice that the datasets are on the CPU. How can I preload them all onto the GPU for faster training? They will fit in GPU memory.

for i, (x, y, d) in enumerate(model.train_dataloader()):
    print(x.device)
    print(d.device)
    break

cpu
cpu

If you don’t need to preprocess the data, you could directly push the tensors to the GPU inside your Dataset or you could create the batch directly from the data via indexing.

The loops might slow down your code. Getting rid of it depends of course on your use case and I don’t know, how embeddings is defined and if you could use a single layer for it.

I am pushing all the data onto the GPU.

I am using indexing (views) on the matrix.

I cannot get rid of the for-loops because I have different sized one-hot categoricals. (I explain in another post why I need one-hot and not Longs.) One might have 10 categories, one might have 20, so they each have different embedding matrices.

I am concerned that with indexes so small, that data locality will be lost and that I should pre-splice the matrix by categorical so they each are in their own matrix.