Low GPU Utilization (~10%) with HF Datasets Streaming on Vertex with PyTorch Trainer

lpolisi · November 19, 2024, 12:51pm

Hi everyone,

I’m training a model on a g2-standard-32 machine with a single NVIDIA L4 GPU on Google Vertex AI. My dataset consists of 4000 sharded parquet files (Snappy-compressed), each with ~4500 examples, stored in a GCS bucket. I’m using Hugging Face Datasets to stream the data and Hugging Face Trainer for training. My DL framework is PyTorch 1.13.

Here’s what I have tried:

Different batch sizes: 512, 1024, 2048, 4096
Grid search on DataLoader parameters with batch size fixed to 512:
- num_workers: [2, 4, 8, 16]
- prefetch_factor: [2, 4, 8, 16]

Observations:

num_workers improves performance but goes out of memory with more than 8 workers. At 8 workers, GPU utilization is still very low (~10%).
prefetch_factor has minimal impact on GPU utilization.
batch_size has minimal impact on GPU utilization, and overall GPU memory is low (biggest batch size had 50% GPU memory)

Below is the collate function I use:

def collate_fn(batch: list[dict[str, Any]]):
    labels = torch.tensor([item["label"] for item in batch], dtype=torch.long)
    col1 = torch.stack([torch.tensor(item["col1"], dtype=torch.float32).reshape(num1, num2, -1) for item in batch])
    col2 = torch.stack(
        [
            torch.tensor(item["col2"], dtype=torch.float32).reshape(num1, num2, -1)
            for item in batch
        ]
    )
    col3 = torch.stack(
        [torch.tensor(item["col3"], dtype=torch.float32).reshape(num1, num2, -1) for item in batch]
    )
    col4 = torch.stack(
        [torch.tensor(item["col4"], dtype=torch.float32).reshape(num1, num2, -1) for item in batch]
    )

    batch_processed = {
        "labels": labels,
        "col1": col1,
        "col2": col2,
        "col3": col3,
        "col4": col4,
    }
    return batch_processed

Questions:

Could this low GPU utilization be due to the dataset sharding strategy or the parquet format with Snappy compression?
Are there any best practices I could adopt to optimize streaming from cloud storage?
Is there something specific about my dataloader or collate function that could be causing the bottleneck?

Any advice or recommendations would be greatly appreciated! Let me know if more information is needed.

Thanks!