Efficient shuffling buffer in pinned memory

Hello everyone,

I have a question regarding the efficient implementation of a shuffling buffer during data loading. I am dealing with video data and I am using NVIDIA DALI for efficiently loading and transforming the data on the GPU. However, since my data is loaded sequentially, I must additionally implement some sort of shuffling buffer. In order to not fill GPU memory too much this buffer must reside on the CPU side.

The data flow therefore is:

  1. Process samples on GPU
  2. Put samples into large buffer residing on CPU (device change #1)
  3. Randomly pick samples from buffer and transfer back to GPU (device change #2)

I would guess the most efficient way for such a structure would be to utilize pinned memory and non blocking device transfers as much as possible. Sample code:

...
data_pipe = make_data_pipe(...)
buffer = []
...
for sample in data_pipe:
    sample = sample.to('cpu', non_blocking=True)
    # needs synchronization ?
    # torch.cuda.current_stream().synchronize()
    buffer.append(sample)
    if len(buffer) >= max_buffer_size:
        sample = randomly_pick_from_buffer(buffer)
        sample = sample.to('cuda', non_blocking=True)
        # or needs synchronization here ?
        # torch.cuda.current_stream().synchronize()
        yield sample
  1. Am I correct in the assumption that I need to synchronize the current cuda stream when transferring data from cuda to cpu (pinned) in a non blocking way?
  2. If yes, is there a way to synchronize only the sample that is actually picked from the buffer? The buffer is pretty large, so there is no need to synchronize all the data transfer that is put in the buffer, as it might reside there for quite a while. So synchronizing every transfer put in the buffer at each iteration might be inefficient? Or is my general view about the cuda semantics entirely wrong?

I would really appreciate some clarification and help!