Parallelize embedding process in the forward method


I’m working on a textual modeling problem. There are three steps in forward method as shown below.

class Model(nn.Module):
    def __init__(self):
        self.layers = nn.ModuleDict()
        self.layers['layer_0'] = nn.Linear(in, out)
        self.layers['relu_0'] = nn.ReLU()

    def forward(self, inp):
        with torch.no_grad():
            # Step 1: tokenize input
            tokenized_inp = tokenize(inp)
            # Step 2: convert out of tokenization to embeddings on cuda device
            x = generate_embedding(tokenized_inp)

        # Step 3: run the network that follows
        for layer_name, layer in self.layers.items():
            x = layer(x)

        return x

Steps 1 and 2 do not include any trainable parameters. I’m using huggingface tokenizer and embedding model. Can I parallelize these steps before executing step 3 which is the real ‘training’ step of the network?

I want to do this because my GPU memory is not fully utilized. With a batch size of 64, I only use 20% of the GPU memory.

Step 1 includes tokenization that is done serially on the CPU and step 2 includes moving data from CPU to CUDA. These steps scale linearly with the batch size. So if I try to utilize more GPU memory with a higher batch size, I lose on speed at these two steps.

Is it possible to perform tokenization and embedding in parallel to take advantage of the GPU memory capacity?


I think what you describe is part of the reason that people don’t usually do the tokenization in the nn.Module but move it to the dataset. This allows the DataLoader to use several processes (if you pass num_workers). Then you can increase the batch size.
Doing embeddings on the GPU should be fast, so I would not worry about it too much.

Best regards


1 Like

That makes sense. Thanks for the suggestion!

DataLoader documentation says the following:

It is generally not recommended to return CUDA tensors in multi-process loading because of many subtleties in using CUDA and sharing CUDA tensors in multiprocessing (see CUDA in multiprocessing). Instead, we recommend using automatic memory pinning (i.e., setting pin_memory=True), which enables fast data transfer to CUDA-enabled GPUs.

Does that mean when the embeddings are generated as CUDA tensors, I shouldn’t return them if I’m using DataLoader?! Would I need to copy them back to CPU before returning and setting automatic memory pinning to True? It seems redundant to copy to CPU to only copy back to the GPU.

Yes, so the standard idea is to have the parallelized CPU part (the tokenization) in the dataset, move the tokens to the CPU in the loop (so after the dataloader batched them) and the embedding in the model on the GPU.

Best regards


I see, that makes sense. Thanks