Hi.
I’m training a model with two parts: part A randomly generates data, and part B consumes data. Part B reuses the data generated for several times, and the data will then be regenerated by part A. The pseudo code is shown as following:
for epoch in range(n_epochs):
if epoch % 10 == 0:
dataset = partA()
for i, (data, label) in enumerate(dataset):
partB(data) # training partB
The dataset generated by part A is pretty large (2,000,000x16x16x16 in total), so I think I need to save it somewhere. I try to use json/ujson. Is there any way to do this faster?
Thanks!
I think dataloader can serve as it is stream based loading, load objects one by one using next(iterator)
num_epochs=20
class partA(Dataset):
def init(self):
# data loading
xy = np.loadtxt(‘file.csv’,delimiter=“,”,dtype=np.float32,skiprows=1)
self.x = torch.from_numpy(xy[:,1:])
self.y = torch.from_numpy(xy[:,[0]])
self.n_samples = xy.shape[0]
def __getitem__(self,index):
# dataset[0]
return self.x[index], self.y[index]
def __len__(self):
return self.n_samples
dataset= partA()
dataloader= DataLoader(dataset=dataset, batch_size=40,shuffle=True,num_workers=2)
for epoch in range(num_epochs):
for i, (inputs,labels) in enumerate(dataloader):
# forward backward , update weights here
if (i+1) % 10 ==0:
print(f’epoch {epoch+1}/{num_epochs}, step{i+1}/{n_iterations},inputs {inputs.shape}’
Apologies , if i misunderstood your question. I am very naive in pytorch
I’m sorry but it is not the case.
Note that dataset
is dynamically generated by part A during training. It is not fixed.
In my description, dataset
changes its data every 10 epochs. However, in your code, dataset
is unchanged.
Thanks for your reply!