Hi.

I’m training a model with two parts: part A randomly generates data, and part B consumes data. Part B reuses the data generated for several times, and the data will then be regenerated by part A. The pseudo code is shown as following:

```
for epoch in range(n_epochs):
if epoch % 10 == 0:
dataset = partA()
for i, (data, label) in enumerate(dataset):
partB(data) # training partB
```

The dataset generated by part A is pretty large (2,000,000x16x16x16 in total), so I think I need to save it somewhere. I try to use json/ujson. Is there any way to do this faster?

Thanks!

I think dataloader can serve as it is stream based loading, load objects one by one using next(iterator)

num_epochs=20

class partA(Dataset):

def **init**(self):

# data loading

xy = np.loadtxt(‘file.csv’,delimiter=",",dtype=np.float32,skiprows=1)

```
self.x = torch.from_numpy(xy[:,1:])
self.y = torch.from_numpy(xy[:,[0]])
self.n_samples = xy.shape[0]
def __getitem__(self,index):
# dataset[0]
return self.x[index], self.y[index]
def __len__(self):
return self.n_samples
```

dataset= partA()

dataloader= DataLoader(dataset=dataset, batch_size=40,shuffle=True,num_workers=2)

for epoch in range(num_epochs):

for i, (inputs,labels) in enumerate(dataloader):

# forward backward , update weights here

if (i+1) % 10 ==0:

print(f’epoch {epoch+1}/{num_epochs}, step{i+1}/{n_iterations},inputs {inputs.shape}’

Apologies , if i misunderstood your question. I am very naive in pytorch

I’m sorry but it is not the case.

Note that `dataset`

is dynamically generated by part A **during training**. It is not fixed.

In my description, `dataset`

changes its data every 10 epochs. However, in your code, `dataset`

is unchanged.

Thanks for your reply!