VRAM usage incredibly low and not increasing with batch size or worker count

NyanSequitur · May 5, 2023, 9:54pm

If I run the following code:

def main(batchSize, workerCount):

    # cut out df creation

    X = df[['NDVI', 'EVI', 'NIR', 'MIR','DayOfYear']].to_numpy()
    y = df[['label']].to_numpy().ravel()
    X = torch.from_numpy(X).float()
    y = torch.from_numpy(y).long()
    dataset = TensorDataset(X, y)
    print("Created dataset")

    dataloader = DataLoader(dataset, batch_size=batchSize, shuffle=True, num_workers=workerCount)
    print("Created dataloader")

    class Net(torch.nn.Module):
        def __init__(self):
            
            super(Net, self).__init__()
            self.fc1 = torch.nn.Linear(5, 256)
            self.fc2 = torch.nn.Linear(256, 128)
            self.fc3 = torch.nn.Linear(128, 64)
            self.fc4 = torch.nn.Linear(64, 32)
            self.fc5 = torch.nn.Linear(32, 10)
        def forward(self, x):
            x = torch.nn.functional.relu(self.fc1(x))
            x = torch.nn.functional.relu(self.fc2(x))
            x = torch.nn.functional.relu(self.fc3(x))
            x = torch.nn.functional.relu(self.fc4(x))
            x = self.fc5(x)
            return x
    print("Created net")

    torch.cuda.init()
    print("Initialized CUDA")

    # create device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    # Print available VRAM
    print(f'Available VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024 ** 3} GB')
    net = Net().to(device)

    for p in net.parameters():
        # get name of device of the index of the parameter
        print(torch.cuda.get_device_name(p.device.index))

    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(net.parameters(), lr=0.001)
    print("Model training started")

    for epoch in range(50):
        for i, (inputs, labels) in enumerate(dataloader):
            inputs, labels = inputs.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            # print name of gpu in use
            print(torch.cuda.get_device_name(inputs.device.index))
            print(f'Epoch {epoch + 1} loss: {loss.item()}')
            # vram in use
            print(f'VRAM in use: {torch.cuda.memory_allocated(0) / 1024 ** 3} GB')
    X_test, y_test = X[::10], y[::10]
    X_test, y_test = X_test.to(device), y_test.to(device)
    y_pred = torch.argmax(net(X_test), dim=1).cpu().numpy()

if __name__ == "__main__":
    main(16384, 0)

it reports

VRAM in use: 0.016548633575439453 GB

with a batch size of 16, and with a batch size of 16384 (chosen specificially to be ‘too large’) reports

VRAM in use: 0.01758432388305664 GB

I don’t see a significant increase with increasing the number of workers. Running with 4 workers does increase CUDA usage on the performance tab of task manager to spike at ~40%, but it pauses after each epoch to the point where I’m not sure it’s any faster at all

An example debug output for the end of an epoch is

Epoch 1 loss: 1.9087867736816406
VRAM in use: 0.01770639419555664 GB
NVIDIA GeForce RTX 3070 Laptop GPU
Epoch 1 loss: 1.9021888971328735
VRAM in use: 0.01702880859375 GB
The thread 0x5 has exited with code 0 (0x0).
The thread 0x4 has exited with code 0 (0x0).
The thread 0x2 has exited with code 0 (0x0).
The thread 0x3 has exited with code 0 (0x0).

which lines up with a drop to zero in CUDA usage (each drop is the end of an epoch).

Screenshot 2023-05-05 174608

Running torch.cuda.get_device_properties(0).total_memory reports

Available VRAM: 7.99951171875 GB

so I know my GPU is visible to pytorch.

I tried increasing the general complexity of my model architecture, but that didn’t help either. The dataset being used has been heavily preprocessed in R, I tried using the original un-preprocessed dataset but that only increased the usage to

VRAM in use: 0.01770639419555664 GB

Running the unprocessed dataset, with a batch size of 16384 and 4 workers (trying my hardest to force a worst-case scenario), the results of running nvidia-smi while training are as follows:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 528.49       Driver Version: 528.49       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ... WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   59C    P5    19W / 125W |    301MiB /  8192MiB |     14%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      6276      C   ...oeva\anaconda3\python.exe    N/A      |
|    0   N/A  N/A     17164    C+G   ...bbwe\PaintStudio.View.exe    N/A      |
+-----------------------------------------------------------------------------+

I can provide nvidia-smi for other configurations if needed, but don’t want to clutter my question even more.

I am out of ideas on how to drag any significant GPU usage out of this code, all advice would be appreciated. Sorry this is all over the place, I wanted to cover all my attempted debugging results.

eqy · May 5, 2023, 10:11pm

How large is the dataset you are using? It could be that the entire dataset fits in memory, and increasing the batch size beyond the size of the dataset would not increase memory usage.

If the entire dataset fits into memory, you could consider copying it in its entirety to GPU memory before creating the TensorDataset to avoid repeatedly copying things from host to device memory for each “batch.”

NyanSequitur · May 6, 2023, 5:06am

Thanks, this was the solution! I was able to store it all in memory.

carbocation · August 27, 2023, 11:02pm

Mentioning this because I came across it in my own search - on Linux, this sometimes doesn’t help because the OS will be good about keeping actively used files in RAM if nothing is evicting them. In my case, I compared fully loading 8GB of image files at initialization of the dataset vs fetching them from (pre-parsed) files on-the-fly, and I found no difference once the disk cache was warm.

Not saying this isn’t good to do, just mentioning this for future readers who might try this but find no benefit.