How to get more utilization out of my GPU

I’m using libtorch (C++) and developing on Windows and I’m wanting to try and get more utilization out of my GTX970.

I gauge my GPU usage by using MSI Afterburner which also has a handy GPU memory usage graph.

My little project is a self-learning network ala Alpha Zero, so most of the time the network is spent generating the data rather than training on it. During these “generating” phases I’m mainly just calling “forward” and I notice that afterburner reports a usage of around 35-40%, but when I’m using “loss.backward()” and “optimizer.step()” the usage does spike up to 90%.

I tried to increase my usage by increase more threads that call forward in parallel but this didn’t increase my GPU usage. Nor did moving extra copies of the network onto the GPU and calling those separately help.

I’m not super familiar with the libtorch library just yet but is there a way to try to get more parallelism so that I can use more of my GPU during “generation?”

P.S. I once accidentally ran my .exe file twice and my GPU usage was 70% so is there perhaps a per process lock that’s limiting my GPU usage?

Are you generating the data in your model’s forward method?
If so, are you generating it directly on the GPU?

Not exactly, I have a function that turns a gameboard encoded in a char[] into a {4, 15, 15} shaped tensor. so what I do is in a loop that iterates through many many different states of the gameboard, call my function to get a tensor then call forward on it. (Every tensor being called forward on is one of shape {1, 4, 15, 15})

I save these char[] for after the game for the learning phase where I recreate the tensor but now in a {batchSize, 4, 15, 15} where only now do I compare with a target and get the loss and step on my optimizer.

But my situation is, if I have 1 thread calling forward on board states sequentially, that has the same gpu utilization as 4 threads calling forward on different board states in parallel. However if I have 2 instances of my .exe running. my GPU utilization doubles so I know there’s headroom on my gpu hardware for more expansion.

I’m not sure how you are using different threads, so you might need to explain this use case a bit more.
However, if your GPU has to wait for the next batch of samples, you would have to speed up the data creation process.

If you are running two separate processes and this speeds up your training (or yields a higher GPU utilization, this could point towards a not fully utilized CPU, which would be used for the data creation.

The code has a bit more classes it needs to go through but the idea is the same.

CustomNetwork network;
network.to(torch::kCuda);
Game game;
while(game.NotFinished())
{
    int move = SelectMoveFromSearchTree();
    game.PlayMove(move);
    torch::Tensor value = network.forward(ConvertBoard(game.GetBoard());
    // do stuff with value
}

The above has the same gpu utilization as

CalculateGame(Game game, CustomNetwork network, int move);

CustomNetwork network[4];
network[0].to(torch::kCuda);
// Do this for all networks
Game game1;
Game game2;
Game game3;
Game game4;

int move[4] = Select4BestMoves(game1);

auto thread1 = std::thread(&CalculateGame, game1, network[0], move[0]);
auto thread2 = std::thread(&CalculateGame, game2, network[1], move[1]);
auto thread3 = std::thread(&CalculateGame, game3, network[2], move[2]);
auto thread4 = std::thread(&CalculateGame, game4, network[3], move[3]);

//join all threads

CalculateGame(Game game, CustomNetwork network, int move)
{
    while(game.NotFinished())
     {
         int move = SelectMoveFromSearchTree();
         game.PlayMove(move);
         torch::Tensor value = network.forward(ConvertBoard(game.GetBoard());
         // do stuff with value
     }
}

So I’m not sure if my CPU is bottlenecking here, because none of my 4 threads are waiting on other threads for data.

I finally found out why I wasn’t getting a lot of utilization from my GPU. It looks like the Nvidia drivers has their own scheduler which automatically makes the parallel calls to their API single threaded. I did the below experiment

void Test()
{
   int batchSize = 1;
   torch::Tensor randomTensor = torch::rand({batchSize, 4, 15, 15})
   while (true)
   {
        network->forward(randomTensor);
   }
}

int numThreads = 5;
std::vector<std::thread>> threads;
for (int i = 0; i < numThreads; i++)
{
    threads.push_back(std::thread(Test));
}
for (int i = 0; i < numThreads; i++)
{
    threads[i].join();
}

No matter how much I increased my numThread variable, it kept at 30% utilization, but the second I increased batchSize from 1 to 10, the utilization jumped from 30% to 90%. What’s even more interesting is that it will keep using 90% even if numThreads == 1.

This doesn’t help me because in my training, I’m way more often only curious in 1-4 different inputs, but hopefully this can shed some light to any future developers wondering why GPU utilization might be low.

I finally found out why I wasn’t getting a lot of utilization from my GPU. It looks like the Nvidia drivers has their own scheduler which automatically makes the parallel calls to their API single threaded.

I would love to know where this is documented and whether it is possible to change this behavior. I’m using pytorch (Python) and I’ve never been able to get GPU utilization over 30%. I’ve played around with the number of worker threads, batch size, and so on. Nothing helps.

Is there example pytorch code I could look at that results in ~100% GPU usage to prove that this is even possible?

Profile your code and check if your workload is e.g. CPU-bound (you should see whitespaces between the CUDA kernels). If so, increasing the batch size would increase the GPU utilization since the CPU would have more time to run ahead with the scheduling.
I don’t know what @DeuS_CaNoN is referring to regarding the driver.

Profile your code and check if your workload is e.g. CPU-bound (you should see whitespaces between the CUDA kernels).

The main problem ended up being the underlying Dataset but not how you’d expect. The GPU was waiting on the data from the DataLoader but increasing the number of workers didn’t help. Each loader was invoking torch.randn() to generate dummy data but it turns out this is does not generate data fast enough no longer how many workers you throw at it. Upon increasing the call with torch.cuda.FloatTensor(size).normal_() GPU usage shot up. I experimented with different Dataset lengths, batch_size and num_workers. Eventually I was able to increase GPU utilization to 50% and this is the best I could manage.

The GPU memory (10GB) was the limiting factor on the Dataset length and batch_size I could play with.

The profiler shows that the DataLoader is still the bottleneck, but I can’t think of any other way to improve this further. Can you?

Here is my code:

class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.cuda.FloatTensor(length, size).normal_()

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.batch_size = 4096
        self.feature_size = 512
        self.layer1 = torch.nn.Linear(self.feature_size, self.feature_size)
        self.layer2 = torch.nn.Linear(self.feature_size, self.feature_size)
        self.layer3 = torch.nn.Linear(self.feature_size, 2)

    def forward(self, x):
        x = self.layer1(x)
        x = self.layer2(x)
        return self.layer3(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.parameters(), lr=0.1)

    def optimizer_zero_grad(self, epoch, batch_idx, optimizer, optimizer_idx):
        # Set gradients to None instead of zero to improve performance
        optimizer.zero_grad(set_to_none=True)

    def train_dataloader(self) -> DataLoader:
        return DataLoader(RandomDataset(self.feature_size, self.batch_size * 256),
                          batch_size=self.batch_size,
                          pin_memory=False, num_workers=5, persistent_workers=True)

    def val_dataloader(self) -> DataLoader:
        return DataLoader(RandomDataset(self.feature_size, self.batch_size * 6),
                          batch_size=self.batch_size,
                          pin_memory=False, num_workers=2, persistent_workers=True)

    def test_dataloader(self) -> DataLoader:
        return DataLoader(RandomDataset(self.feature_size, self.feature_size * 10),
                          batch_size=self.batch_size,
                          pin_memory=False, num_workers=2, persistent_workers=True)


def run():
    model = BoringModel()
    # profiler = PyTorchProfiler(with_stack=True)
    profiler = None
    trainer = Trainer(
        gpus=GPUS,
        default_root_dir=os.getcwd(),
        num_sanity_val_steps=0,
        max_epochs=10_000,
        enable_model_summary=False,
        detect_anomaly=False,
        auto_select_gpus=True,
        profiler=profiler
    )
    trainer.fit(model)

if __name__ == "__main__":
    run()