Very strange DataLoader Error - Simplified code inside

EDIT: it was fixed by simply increasing limit of opened files (although it was one of the first solutions I tried, probably didn’t try hard enough). It was suggested in the github issue page: https://github.com/pytorch/pytorch/issues/14768

Hello,
I’m stuck for the past several days on a very strange and annoying error with the dataloader/dataset I use in my project. The error occurs when using my custom Dataset with a default DataLoader and using 1 or more workers, the error does not happen when number of workers is set to 0 (which makes it even harder to investigate the error, because the error output specifically states “Rerunning with num_workers=0 may give better error trace”)

At the beginning I thought the problem was because I used SQLite inside the custom dataset. The error had “OSError: [Errno 24] Too many open files” line so I assumed there was somehow too many open connections. I searched the keywords mentioned in the error output on this forum/SO/google and tried every solution I found but nothing helped. Later I found out SQLite is probably not the source of the error.

The error output is attached at the end of the post

To make it easier I made a minimal and simplified code to replicate the error. Only at this step I was able to locate the exact lines causing the error and to my surprise those were not the lines directly involving SQLite . From earlier testing I understand that the error occurs right before all the batches (from the multiple workers) are connected and returned by the dataloader as a minibatch. Unfortunately, I was not able to understand from the stack trace of the error what exactly triggers the error.

Simple explanation of the custom dataset:
The important part is the _getitem_() - Every batch consists of several entries from an SQLite database (the -bss flag controls how many enteries to get). After getting the required entries, a new list is created to store the “processed” entries. I then iterate over each entry and perform some calculations, then I add each processed entry to the new list. Finally I return the list as the batch.
The error occurs during the last two operations, you can see it in lines 94-102 of the code. For some reason, the error occurs only with high numbers of entries per batch, for example if the code is executed with -bss 60 there is no error.
As an afterthought - You might ask why I’m returning a list as a batch and not a tensor? I was planning on returning either a tensor or np array as a batch, but to test my dataset I went with a simple python list. I’m not sure what will happen if I change the list to a tensor/np array

EDIT: code replicating the error without any additional required packages is available in my second post
The code runs on python 3.6 and requires latest versions of torch and sqlite3 packages.
The flow of the code is:
Creating a temporary SQLite database.
Initiating the custom dataset.
Running over all the batches of the custom dataset with the default DataLoader.

Command to run the code and reproduce the error (remove one of the makeError flags and the error disappears):

./Test_OpenTooManyFiles_error.py -n 100000 -bss 6000 --makeError --makeError2

When running the code with smaller number of entries per batch there is no error:

./Test_OpenTooManyFiles_error.py -n 100000 -bss 60 --makeError --makeError2

Simple code to replicate the error:

#!/usr/bin/env python
import os, errno
import glob
import argparse
import random
import sqlite3

from torch.utils.data import Dataset, DataLoader


parser = argparse.ArgumentParser()
parser.add_argument('-n', '--number-of-db-lines')
parser.add_argument('-e', '--makeError', action='store_true')
parser.add_argument('-ee', '--makeError2', action='store_true')
parser.add_argument('-bss', '--batch-size-of-sample', type = int, default = 6000)
parser.add_argument('-b', '--db-batch-size', type = int, default = 1e6)
parser.add_argument('-s', '--sequence-read-length', type = int, default = 10)
args = parser.parse_args()


######################
###################### Code for creating temp sqlite database for testing
######################

def remove_file_if_exist(filename):
    try:
        os.remove(filename)
    except OSError as e: 
        if e.errno != errno.ENOENT: 
            raise 
def write_to_db(values_batch, con):
    con.executemany('INSERT INTO `sequences` (read) VALUES(?)', values_batch)
    con.commit()
def create_sqlite3_db():
    temp_db_name = "temp_db_for_test.db"
    remove_file_if_exist(temp_db_name)
    print("Creating file:",  temp_db_name)
    con = sqlite3.connect(temp_db_name)
    create_table_cmd = [    'CREATE TABLE IF NOT EXISTS ',
                "sequences",
                '(',
                'read BLOB',
                ')']
    con.execute(''.join(create_table_cmd))
    values_batch = []
    i = 0
    for line in range(int(args.number_of_db_lines)):
        i += 1
        values_batch.append([   "Db_line_"+str("ASDFASF"*10)+str(line)])
        if values_batch and i % args.db_batch_size == 0:
            print(".", end="", flush=True),
            write_to_db(values_batch, con)
            values_batch = []
    print(str(i), 'reads processed')
    write_to_db(values_batch, con)
    con.close()



######################
###################### Code for dataset
######################

class Multi_line_Dataset(Dataset):

    def __init__(self, batch_size_of_sample, sequence_read_length, makeError, makeError2):
        self.batch_size_of_sample = batch_size_of_sample
        self.makeError = makeError
        self.makeError2 = makeError2
        self.sequence_read_length = sequence_read_length
        self.db_filename_list = ["temp_db_for_test.db"]

    def __len__(self):
        return int(args.number_of_db_lines)//int(self.batch_size_of_sample)

    def __getitem__(self, idx):
        ## getting bunch of lines from the sqlite DB
        bd_file_name = self.db_filename_list[0]
        temp_db_connection = sqlite3.connect(bd_file_name)
        temp_group_of_samples = random.sample(range(1, int(args.number_of_db_lines)), int(args.batch_size_of_sample))
        query = "SELECT * FROM sequences where rowid IN (%s)" % ','.join(str(v) for v in temp_group_of_samples)
        readstrs = temp_db_connection.execute(query).fetchall()
        temp_db_connection.close()

        ## processing each line we got from the DB
        processed_read_sequences = []
        for read_string in readstrs:
            sequence = read_string[0][:-1][0:self.sequence_read_length]
            SomeNewSequence = [0]*len(sequence) 
            # doing somehting
            # doing somehting
            # doing somehting
            # doing somehting
            if (self.makeError):

                processed_read_sequences.append(SomeNewSequence)
            else:
                pass
        if (self.makeError2):
            return processed_read_sequences
        else:
            return read_string
    



if __name__ == '__main__':
    create_sqlite3_db()
    ML_dataset = Multi_line_Dataset(batch_size_of_sample=args.batch_size_of_sample, sequence_read_length=args.sequence_read_length, makeError=args.makeError , makeError2=args.makeError2 )
    dataloader = DataLoader(ML_dataset, batch_size=16,
                            shuffle=True, num_workers=1)
    print("")
    print("Getting batches")
    for i_batch, sample_batched in enumerate(dataloader):
        print(".", end="", flush=True),
    print("")
    remove_file_if_exist("temp_db_for_test.db")
    print("Finished without errors")

Error output:

Creating file: temp_db_for_test.db
100000 reads processed

Getting batches
Traceback (most recent call last):
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 110, in _worker_loop
    data_queue.put((idx, samples))
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/queues.py", line 341, in put
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 190, in reduce_storage
RuntimeError: unable to open shared memory object </torch_27548_937556507> in read-write mode

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/util.py", line 186, in __call__
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/shutil.py", line 476, in rmtree
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/shutil.py", line 474, in rmtree
OSError: [Errno 24] Too many open files: '/tmp/pymp-r_w0xw5b'
Process Process-1:
Traceback (most recent call last):
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 110, in _worker_loop
    data_queue.put((idx, samples))
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/queues.py", line 341, in put
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 190, in reduce_storage
RuntimeError: unable to open shared memory object </torch_27548_937556507> in read-write mode
Traceback (most recent call last):
  File "./Test_OpenTooManyFiles_error.py", line 114, in <module>
    for i_batch, sample_batched in enumerate(dataloader):
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 330, in __next__
    idx, batch = self._get_batch()
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 309, in _get_batch
    return self.data_queue.get()
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/queues.py", line 335, in get
    res = self._reader.recv_bytes()
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
  File "/home/user1/anaconda3/envs/PytorchEnv1/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 227, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 27548) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.

Maybe you are running out of shared memory.
Could you try to increase it as suggested in this issue?

1 Like

That’s one of the first things I tried, didn’t help.

I also modified the code to require only pytorch, so it might be easier to run now:

#!/usr/bin/env python
import os, errno
import glob
import argparse
import random
# import sqlite3

from torch.utils.data import Dataset, DataLoader


parser = argparse.ArgumentParser()
parser.add_argument('-n', '--number-of-db-lines')
parser.add_argument('-e', '--makeError', action='store_true')
parser.add_argument('-ee', '--makeError2', action='store_true')
parser.add_argument('-bss', '--batch-size-of-sample', type = int, default = 6000)
parser.add_argument('-b', '--db-batch-size', type = int, default = 1e6)
parser.add_argument('-s', '--sequence-read-length', type = int, default = 50)
args = parser.parse_args()


######################
###################### Code for creating temp sqlite database for testing
######################

# def remove_file_if_exist(filename):
#     try:
#         os.remove(filename)
#     except OSError as e: 
#         if e.errno != errno.ENOENT: 
#             raise 
# def write_to_db(values_batch, con):
#     con.executemany('INSERT INTO `sequences` (read) VALUES(?)', values_batch)
#     con.commit()
# def create_sqlite3_db():
#     temp_db_name = "temp_db_for_test.db"
#     remove_file_if_exist(temp_db_name)
#     print("Creating file:",  temp_db_name)
#     con = sqlite3.connect(temp_db_name)
#     create_table_cmd = [    'CREATE TABLE IF NOT EXISTS ',
#                 "sequences",
#                 '(',
#                 'read BLOB',
#                 ')']
#     con.execute(''.join(create_table_cmd))
#     values_batch = []
#     i = 0
#     for line in range(int(args.number_of_db_lines)):
#         i += 1
#         values_batch.append([   "Db_line_"+str("ASDFASF"*10)+str(line)])
#         if values_batch and i % args.db_batch_size == 0:
#             print(".", end="", flush=True),
#             write_to_db(values_batch, con)
#             values_batch = []
#     print(str(i), 'reads processed')
#     write_to_db(values_batch, con)
#     con.close()



######################
###################### Code for dataset
######################

class Multi_line_Dataset(Dataset):

    def __init__(self, batch_size_of_sample, sequence_read_length, makeError, makeError2):
        self.batch_size_of_sample = batch_size_of_sample
        self.makeError = makeError
        self.makeError2 = makeError2
        self.sequence_read_length = sequence_read_length
        # self.db_filename_list = ["temp_db_for_test.db"]

    def __len__(self):
        return int(args.number_of_db_lines)//int(self.batch_size_of_sample)

    def __getitem__(self, idx):
        # ## getting bunch of lines from the sqlite DB
        # bd_file_name = self.db_filename_list[0]
        # temp_db_connection = sqlite3.connect(bd_file_name)
        # temp_group_of_samples = random.sample(range(1, int(args.number_of_db_lines)), int(args.batch_size_of_sample))
        # query = "SELECT * FROM sequences where rowid IN (%s)" % ','.join(str(v) for v in temp_group_of_samples)
        # readstrs = temp_db_connection.execute(query).fetchall()
        # temp_db_connection.close()

        values_batch = []
        for line in range(int(args.batch_size_of_sample)):
            values_batch.append([   "Db_line_"+str("ASDFASF"*10)+str(line)])

        ## processing each line we got from the DB
        processed_read_sequences = []
        for read_string in values_batch:
            sequence = read_string[0][:-1][0:self.sequence_read_length]
            SomeNewSequence = [0]*len(sequence) 
            # doing somehting
            # doing somehting
            # doing somehting
            # doing somehting
            if (self.makeError):

                processed_read_sequences.append(SomeNewSequence)
            else:
                pass
        if (self.makeError2):
            return processed_read_sequences
        else:
            return read_string
    



if __name__ == '__main__':
    # create_sqlite3_db()
    ML_dataset = Multi_line_Dataset(batch_size_of_sample=args.batch_size_of_sample, sequence_read_length=args.sequence_read_length, makeError=args.makeError , makeError2=args.makeError2 )
    dataloader = DataLoader(ML_dataset, batch_size=16,
                            shuffle=True, num_workers=1, pin_memory = True)
    print("")
    print("Getting batches")
    for i_batch, sample_batched in enumerate(dataloader):
        print(".", end="", flush=True),
    print("")
    # remove_file_if_exist("temp_db_for_test.db")
    print("Finished without errors")

I’ve also opened an issue on github here: https://github.com/pytorch/pytorch/issues/14768

I think it’s probably due to the fact that multiple workers are trying to create a connection object to the same SQLite file. Perhaps, if you open the SQLite in the constructor __init__(self) and then each worker can access self.db_connection in the __getitem__(self) method, that may resolve the issue.

That was my first thought too, many hours were spent researching this possibility.
Though, in my second post here I posted a code without any SQLite producing the same error.

The second code does not have any problems, except the following error:

TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

which was due to parsed argument number-of-db-lines being None, so I added some default values:

parser.add_argument('-n', '--number-of-db-lines', type=int, default=60000)

and it works fine:


Getting batches
.
Finished without errors

That’s what I was afraid of, that we have some hardware/software configuration problem.
What if you increase the “batch_size_of_sample” variable? low values like 60 didn’t produce an error for me either but higher values did.
If you’ve set this value to 60,000 or higher and still didn’t get an error I will probably a hard time identifying and solving the underlying issue :frowning:

I see! Well, for me even with 60000000 worked. What error do you get in the second case, same as the first code?

Unfortunately yes, I get the same error I mentioned in the first post.

I found several other posts here and on github with a similar problem:
https://github.com/facebookresearch/maskrcnn-benchmark/issues/103
https://discuss.pytorch.org/t/runtimeerror-unable-to-open-shared-memory-object/22641
https://github.com/pytorch/pytorch/issues/2706
https://github.com/pytorch/pytorch/issues/2926

In all posts there is no apparent solution besides either turning off the multiprocessing by setting num_workers to 0 or suggestion to increase shared memory.

But I already tried increasing shared memory and it didn’t help, there are probably different methods of increasing shared memory and if anyone will suggest some particular steps I will happily try additional methods.

Though, the workstation hardware and software were set up with default configuration so I’m not sure why I will get this error while others don’t. Also, if I have this problem with the default setup other might have the same error (which is supported by the various post I’ve linked earlier).

PS - it might be relevant that I’m dual booting windows with linux. The first OS was windows and later I added linux. The code is running on linux.

EDIT: it was fixed by simply increasing limit of opened files (although it was one of the first solutions I tried, probably didn’t try hard enough). It was suggested in the github issue page: https://github.com/pytorch/pytorch/issues/14768

1 Like

Reduce the Data Loader batch size to 1 in order to identify the problematic row(s). Check each instance of a column containing an Id for the affected rows by opening up the related record in Salesforce by pasting in the Id after the .com/ in the browser’s URL. Note: Some Ids such as record types or objects in Lightning have unique URLs. In these circumstances open a similar record in the user interface first to construct the proper URL.

Regards,
Rachel Gomez