Dataloader: transfer tensors to cuda stalls

tcm · January 22, 2025, 4:55am

Hi,
I’m working on video large language models and my dataset contains videos, some of very long duration (over an hour). I’ve implemented some preprocessing utilities to:

load paths to videos from json files
load vision encoders to extract features from video’s frames
customize a Dataset subclass to read frames from videos (using decord library) and preprocess them (resizing, padding, stacking to tensors, etc.)
customize a data collator function to group samples into batches

The main script I run is as below:

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    ...
    args = parser.parse_args()

    # mp.set_start_method('spawn')

    cambrianConfig = CambrianConfig.from_json_file(args.config_file)
    processor = CambrianEncoders(cambrianConfig)
    image_processors = []
    if not processor.vision_tower_aux_list[0].is_loaded:
        processor.vision_tower_aux_list[0].load_model()
    image_processors.append(processor.vision_tower_aux_list[0].image_processor)

    folder_paths: List[str] = args.folders
    data_tensor = dict()
    
    entube_dataset = EnTubeDataset(folder_paths, image_processors)
    dataloader = DataLoader(
        entube_dataset, 
        batch_size=1, 
        collate_fn=collate_fn,
        num_workers=0 # I've tried different values for this param
    )

    for batch_idx, (videos, image_sizes) in enumerate(dataloader):
        print(f"Processing batch {batch_idx + 1}/{len(dataloader)}")
        assert isinstance(videos, list), "List of videos features for each processor (vision encoder)"
        assert isinstance(videos[0], list) or isinstance(videos[0], torch.Tensor), "List of videos in the batch"
        image_aux_features_list = processor.prepare_mm_features(videos, image_sizes)
        for i, image_aux_features in enumerate(image_aux_features_list):
            print(f"In main(): image_aux_features[{i}].shape={image_aux_features.shape}")

If necessary, my short repo that contains all the code is here: GitHub - tcm03/EnTube_preprocessing: Preprocessing steps for EnTube before model running

To context for the error to occur as follows:

When I run the code, the Dataset subclass loads the video frames and preprocesses them successfully, and returns to the data loader function.

class EnTubeDataset(Dataset):
    
    def __init__(
        self,   
        folder_paths: List[str],
        image_processors: List[BaseImageProcessor],
    ) -> None:
        self.file_paths = []
        self.image_processors = image_processors

        for folder_path in folder_paths:
            file_names = os.listdir(folder_path)
            for file_name in file_names:
                file_path = os.path.join(folder_path, file_name)
                self.file_paths.append(file_path)      

    def __len__(self):
        return len(self.file_paths)

    def __getitem__(self, idx):
        print(f'@tcm: In EnTubeDataset.__getitem__(): idx={idx}')
        video, image_size = process_video_frames(self.file_paths[idx], self.image_processors)
        return video, image_size

In the data loader function, I reorganize the list of tensors from the preprocessing steps, and transfer each tensor to cuda using .to(device='cuda').

def collate_fn(batch):
    """
    batch: list of samples from EnTubeDataset.__getitem__()
    """
    assert isinstance(batch, list)
    assert isinstance(batch[0], tuple)
    print('collate_fn')
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    image_sizes = batch[0][1]
    batch_videos = [video for video, _ in batch]
    # batch_videos = [[video.to(device) for video in videos] for videos in zip(*batch_videos)]
    tmp_batch_videos = []
    for i, videos in enumerate(zip(*batch_videos)):
        print(f'processor {i}')
        tmp = []
        for j, video in enumerate(videos):
            print(f'video {j} shape: {video.shape}')
            video = video.to(device) # the stall stems from this line, no more code is executed afterwards
            tmp.append(video)
        tmp_batch_videos.append(tmp)
    batch_videos = tmp_batch_videos
    return batch_videos, image_sizes

This is the end of the log produced:

...
collate_fn
processor 0
video 0 shape: torch.Size([1, 1140, 3, 384, 384])

As you can see here, when execution enters the line video = video.to(device), the running cell continues forever and never stops, but the next iteration of those loops don’t occur, so the problem lies in that first to(device). The full log is available in this notebook: EnTube_preprocessing | Kaggle

I tried a couple of methods as follows to find out the cause of this problem:

When I change the num_workers argument of the data loader, many strange errors occur such as:

RuntimeError: DataLoader worker (pid 5678) is killed by signal: Illegal instruction

RuntimeError: DataLoader worker (pid 8928) is killed by signal: Segmentation fault.

I did suspect the out of memory issue, but if you run the notebook above, there is no usage of GPU whatsoever.
To verify that the general transfer of tensors to GPU is normal, I did a normal transfer of a random tensor with the same shape and it was totally fine:

import torch
def test_gpu_transfer():
    try:
        print("Testing GPU transfer with a small tensor...")
        test_tensor = torch.rand((1, 1334, 3, 384, 384), dtype=torch.float16)
        test_tensor = test_tensor.to("cuda")
        print(f"Test tensor successfully transferred to: {test_tensor.device}")
    except Exception as e:
        print(f"Error during test transfer: {e}")

test_gpu_transfer() # this runs successfully

Therefore, I’m confused why the transfer of tensors to cuda in my Dataset pipeline doesn’t work out. In the notebook, there were no errors reported, it just that the execution abruptly stops after to(device).

You can reproduce the scenario above by copying and running my notebook: EnTube_preprocessing | Kaggle
I’ve tested the same code on Colab and the result is the same. I’ve tried many ideas to solve these issues but unsuccessful, so I’d love to hear your help!

Thanks in advance.