EDIT: it was fixed by simply increasing limit of opened files (although it was one of the first solutions I tried, probably didn’t try hard enough). It was suggested in the github issue page: https://github.com/pytorch/pytorch/issues/14768
Hello,
I’m stuck for the past several days on a very strange and annoying error with the dataloader/dataset I use in my project. The error occurs when using my custom Dataset with a default DataLoader and using 1 or more workers, the error does not happen when number of workers is set to 0 (which makes it even harder to investigate the error, because the error output specifically states “Rerunning with num_workers=0 may give better error trace”)
At the beginning I thought the problem was because I used SQLite inside the custom dataset. The error had “OSError: [Errno 24] Too many open files” line so I assumed there was somehow too many open connections. I searched the keywords mentioned in the error output on this forum/SO/google and tried every solution I found but nothing helped. Later I found out SQLite is probably not the source of the error.
The error output is attached at the end of the post
To make it easier I made a minimal and simplified code to replicate the error. Only at this step I was able to locate the exact lines causing the error and to my surprise those were not the lines directly involving SQLite . From earlier testing I understand that the error occurs right before all the batches (from the multiple workers) are connected and returned by the dataloader as a minibatch. Unfortunately, I was not able to understand from the stack trace of the error what exactly triggers the error.
Simple explanation of the custom dataset:
The important part is the _getitem_() - Every batch consists of several entries from an SQLite database (the -bss flag controls how many enteries to get). After getting the required entries, a new list is created to store the “processed” entries. I then iterate over each entry and perform some calculations, then I add each processed entry to the new list. Finally I return the list as the batch.
The error occurs during the last two operations, you can see it in lines 94-102 of the code. For some reason, the error occurs only with high numbers of entries per batch, for example if the code is executed with -bss 60
there is no error.
As an afterthought - You might ask why I’m returning a list as a batch and not a tensor? I was planning on returning either a tensor or np array as a batch, but to test my dataset I went with a simple python list. I’m not sure what will happen if I change the list to a tensor/np array
EDIT: code replicating the error without any additional required packages is available in my second post
The code runs on python 3.6 and requires latest versions of torch and sqlite3 packages.
The flow of the code is:
Creating a temporary SQLite database.
Initiating the custom dataset.
Running over all the batches of the custom dataset with the default DataLoader.
Command to run the code and reproduce the error (remove one of the makeError flags and the error disappears):
./Test_OpenTooManyFiles_error.py -n 100000 -bss 6000 --makeError --makeError2
When running the code with smaller number of entries per batch there is no error:
./Test_OpenTooManyFiles_error.py -n 100000 -bss 60 --makeError --makeError2
Simple code to replicate the error:
#!/usr/bin/env python
import os, errno
import glob
import argparse
import random
import sqlite3
from torch.utils.data import Dataset, DataLoader
parser = argparse.ArgumentParser()
parser.add_argument('-n', '--number-of-db-lines')
parser.add_argument('-e', '--makeError', action='store_true')
parser.add_argument('-ee', '--makeError2', action='store_true')
parser.add_argument('-bss', '--batch-size-of-sample', type = int, default = 6000)
parser.add_argument('-b', '--db-batch-size', type = int, default = 1e6)
parser.add_argument('-s', '--sequence-read-length', type = int, default = 10)
args = parser.parse_args()
######################
###################### Code for creating temp sqlite database for testing
######################
def remove_file_if_exist(filename):
try:
os.remove(filename)
except OSError as e:
if e.errno != errno.ENOENT:
raise
def write_to_db(values_batch, con):
con.executemany('INSERT INTO `sequences` (read) VALUES(?)', values_batch)
con.commit()
def create_sqlite3_db():
temp_db_name = "temp_db_for_test.db"
remove_file_if_exist(temp_db_name)
print("Creating file:", temp_db_name)
con = sqlite3.connect(temp_db_name)
create_table_cmd = [ 'CREATE TABLE IF NOT EXISTS ',
"sequences",
'(',
'read BLOB',
')']
con.execute(''.join(create_table_cmd))
values_batch = []
i = 0
for line in range(int(args.number_of_db_lines)):
i += 1
values_batch.append([ "Db_line_"+str("ASDFASF"*10)+str(line)])
if values_batch and i % args.db_batch_size == 0:
print(".", end="", flush=True),
write_to_db(values_batch, con)
values_batch = []
print(str(i), 'reads processed')
write_to_db(values_batch, con)
con.close()
######################
###################### Code for dataset
######################
class Multi_line_Dataset(Dataset):
def __init__(self, batch_size_of_sample, sequence_read_length, makeError, makeError2):
self.batch_size_of_sample = batch_size_of_sample
self.makeError = makeError
self.makeError2 = makeError2
self.sequence_read_length = sequence_read_length
self.db_filename_list = ["temp_db_for_test.db"]
def __len__(self):
return int(args.number_of_db_lines)//int(self.batch_size_of_sample)
def __getitem__(self, idx):
## getting bunch of lines from the sqlite DB
bd_file_name = self.db_filename_list[0]
temp_db_connection = sqlite3.connect(bd_file_name)
temp_group_of_samples = random.sample(range(1, int(args.number_of_db_lines)), int(args.batch_size_of_sample))
query = "SELECT * FROM sequences where rowid IN (%s)" % ','.join(str(v) for v in temp_group_of_samples)
readstrs = temp_db_connection.execute(query).fetchall()
temp_db_connection.close()
## processing each line we got from the DB
processed_read_sequences = []
for read_string in readstrs:
sequence = read_string[0][:-1][0:self.sequence_read_length]
SomeNewSequence = [0]*len(sequence)
# doing somehting
# doing somehting
# doing somehting
# doing somehting
if (self.makeError):
processed_read_sequences.append(SomeNewSequence)
else:
pass
if (self.makeError2):
return processed_read_sequences
else:
return read_string
if __name__ == '__main__':
create_sqlite3_db()
ML_dataset = Multi_line_Dataset(batch_size_of_sample=args.batch_size_of_sample, sequence_read_length=args.sequence_read_length, makeError=args.makeError , makeError2=args.makeError2 )
dataloader = DataLoader(ML_dataset, batch_size=16,
shuffle=True, num_workers=1)
print("")
print("Getting batches")
for i_batch, sample_batched in enumerate(dataloader):
print(".", end="", flush=True),
print("")
remove_file_if_exist("temp_db_for_test.db")
print("Finished without errors")
Error output:
Creating file: temp_db_for_test.db
100000 reads processed
Getting batches
Traceback (most recent call last):
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 110, in _worker_loop
data_queue.put((idx, samples))
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/queues.py", line 341, in put
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 190, in reduce_storage
RuntimeError: unable to open shared memory object </torch_27548_937556507> in read-write mode
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/util.py", line 186, in __call__
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/shutil.py", line 476, in rmtree
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/shutil.py", line 474, in rmtree
OSError: [Errno 24] Too many open files: '/tmp/pymp-r_w0xw5b'
Process Process-1:
Traceback (most recent call last):
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 110, in _worker_loop
data_queue.put((idx, samples))
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/queues.py", line 341, in put
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 190, in reduce_storage
RuntimeError: unable to open shared memory object </torch_27548_937556507> in read-write mode
Traceback (most recent call last):
File "./Test_OpenTooManyFiles_error.py", line 114, in <module>
for i_batch, sample_batched in enumerate(dataloader):
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 330, in __next__
idx, batch = self._get_batch()
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 309, in _get_batch
return self.data_queue.get()
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/queues.py", line 335, in get
res = self._reader.recv_bytes()
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 227, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 27548) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.