Manually moving data to GPU becomes much slower after using multiprocessing, how to solve it?

remi · July 22, 2023, 8:42am

I have a large tabular dataset with sampling number of 300,000,000 and dimension of 400. I first copy all data into RAM which costs about 500GB, and then sequentially move data batch by batch to GPU for the model training.

The problem is that training one batch takes much less time than moving one batch from RAM to GPU, so it means that it will not work if we simply use a separate thread and queue to asynchronously keep saving the batch data moved to GPU and get data from the queue for the model training in the main thread (refer to How do I copy data to GPU in parallel?)

So my solution is that I use multiprocessing to get the batch and move them to GPU with 10+ cores. I simulate this solution and find it will not work. Could you please make it work?

import ctypes
import torch.multiprocessing as mp
import numpy as np

import os
os.environ["CUDA_VISIBLE_DEVICES"]= '0'
import time

import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print(device)

train_df_np = np.random.normal(0, 1, size=(799200, 400)).astype(np.float32)
train_df_np = np.ravel(train_df_np)
train_df_np = mp.Array(ctypes.c_float, train_df_np) # **share data between different cores without copying** 

def move_data(df_np):
    x = np.array([df_np[i:i+400] for i in range(10)])
    inputs = torch.from_numpy(x).to(device)
    return inputs


if __name__ == "__main__":
    #**parallel takes 90s**
    ctx = mp.get_context('spawn')
    pool = ctx.Pool(10)
    time_start=time.time()
    data = {}
    for i in range(20):
        res = pool.apply_async(move_data, args=(train_df_np))
        data[i] = res  
    pool.close()
    pool.join()
    time_end=time.time()
    print('time cost',time_end-time_start,'s') # 

    #**sequential takes 2s**
    time_start=time.time()
    data = {}
    for i in range(20):
        res = move_data(train_df_np)  
        data[i] = res
    time_end=time.time()
    print('time cost',time_end-time_start,'s')

eqy · July 24, 2023, 12:25am

If your goal is asynchronous (nonblocking copies), have you looked into using pinned memory instead (e.g., check the details of the non_blocking keyword here torch.Tensor.to — PyTorch 2.0 documentation).

Note that you may need to do some modifications to use pinned memory and do your numpy conversion before the actual data movement (I would also verify that these are not the actual bottlenecks in your code).