I have a large tabular dataset with sampling number of 300,000,000 and dimension of 400. I first copy all data into RAM which costs about 500GB, and then sequentially move data batch by batch to GPU for the model training.
The problem is that training one batch takes much less time than moving one batch from RAM to GPU, so it means that it will not work if we simply use a separate thread and queue to asynchronously keep saving the batch data moved to GPU and get data from the queue for the model training in the main thread (refer to How do I copy data to GPU in parallel?)
So my solution is that I use multiprocessing to get the batch and move them to GPU with 10+ cores. I simulate this solution and find it will not work. Could you please make it work?
import ctypes
import torch.multiprocessing as mp
import numpy as np
import os
os.environ["CUDA_VISIBLE_DEVICES"]= '0'
import time
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
train_df_np = np.random.normal(0, 1, size=(799200, 400)).astype(np.float32)
train_df_np = np.ravel(train_df_np)
train_df_np = mp.Array(ctypes.c_float, train_df_np) # **share data between different cores without copying**
def move_data(df_np):
x = np.array([df_np[i:i+400] for i in range(10)])
inputs = torch.from_numpy(x).to(device)
return inputs
if __name__ == "__main__":
#**parallel takes 90s**
ctx = mp.get_context('spawn')
pool = ctx.Pool(10)
time_start=time.time()
data = {}
for i in range(20):
res = pool.apply_async(move_data, args=(train_df_np))
data[i] = res
pool.close()
pool.join()
time_end=time.time()
print('time cost',time_end-time_start,'s') #
#**sequential takes 2s**
time_start=time.time()
data = {}
for i in range(20):
res = move_data(train_df_np)
data[i] = res
time_end=time.time()
print('time cost',time_end-time_start,'s')