why training with small data is ok but big data shut down

:bug: Bug

when I use my own data to train a BAM(bottleneck attention model source code : https://github.com/Jongchan/attention-module) , I wrote custom dataset in following:

`class TorchDataset(Dataset):
def init(self, utt2member, feats_scp, resize_height=256, resize_width=300, repeat=1):

    #image_label_list:: 1020 ('BAC009S0002W0122', [0]) 
    self.image_label_list = self.read_file_audio(utt2member)
    
    self.feats = feats_scp
    #
    self.arklist=self.getArklist(feats_scp)


    self.len = len(self.image_label_list)

    self.repeat = repeat

    self.resize_height = resize_height

    self.resize_width = resize_width

    #self.toTensor = transforms.ToTensor()
 
    #self.toTensor =torch.from_numpy()
def __getitem__(self, i):

    index = i % self.len

    uttid, label = self.image_label_list[index]
    audio = self.load_data_audio(uttid, self.resize_height, self.resize_width, normalization=False)
    audio = torch.from_numpy(audio)
    audio = torch.unsqueeze(audio,0)
    label = np.array(label)
    label = torch.from_numpy(label)
    return (audio, label)

def __len__(self):
    if self.repeat == None:
        data_len = 10000000
    else:
        data_len = len(self.image_label_list) * self.repeat
    return data_len

 #filename = utt2spk_no
def read_file_audio(self,filename):

    image_label_list = []
     
    with open(filename,"r") as f:

        lines =f.readlines()

        for line in lines:

            target=[]

            audio_name=line.split()[0]

            spkid = line.split()[1]

            target.append(int(spkid))

            image_label_list.append((audio_name,target))
            
    return image_label_list 


def getArklist(self,scp_path):
    dicts={}
    with open(scp_path) as f:
        line=f.readline()
        while line:
            items=line.split()
            dicts[items[0]]=items[1]
            line=f.readline()
    return dicts

def load_data_audio(self, uttid, resize_height, resize_width, normalization):
 
    #organize feats.scp into dictionary format
    #feats=self.getArklist(self.feats_scp)
       
    #get uttid's feats.scp path        
    arkpath=self.arklist.get(uttid)
    print("load_data_audio:::",uttid)
    #get uttid's feats mat
    mat=kaldi_io.read_mat(arkpath) 

    mat = self.resize(mat,resize_width)

    return mat

#width multiple
def resize(self,input_tensor,resize_width):

    shape = input_tensor.shape

    width = shape[0]
    multi_width = resize_width // width
    mat = input_tensor    

    if multi_width > 0 :

       dump = multi_width +1
       for i in range(0,dump) :

           mat = np.concatenate((mat,input_tensor),axis=0)

    mat = mat[:resize_width,:]
    return mat

`

train_image.py modified tow line as follow

` val_dataset = prepare_data.getEvalDataSet()

train_dataset = prepare_data.getDataSet()

train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),                                                    
    num_workers=args.workers, pin_memory=False, sampler=train_sampler,drop_last=True)
val_loader = torch.utils.data.DataLoader(val_dataset,
       batch_size=args.batch_size, shuffle=False,
       num_workers=args.workers, pin_memory=False,drop_last=True)

called by
python train_audio.py
–ngpu 2 \
–workers 10
–arch resnet --depth 50
–epochs 100
–batch-size 64 --lr 0.001
–att-type BAM
–prefix RESNET50_IMAGENET_BAM
data/speaker_model/
`
<-error->
when deal with 500000 audios the program will shut down ,even I change the batch_size samller or bigger ,num_workers 0 or bigger , always shut dow when deal with 500000 audio. when I change a samll dataset ,that about 110000 audios, it well work ok . Please help me , I troubled with this problem for two days , Thanks very much !

Hi,

What is the error you’re seeing?