Pytorch dataloaders : OSError: [Errno 9] Bad file descriptor

zhaoworking · July 31, 2022, 7:17am

Description of the problem

The error will occur if the num_workers > 0 , But when I set num_workers = 0 , the error disappeared, though, this will slow down the trainning speed. I think the multiprocessing really matters here .How can I solve this problem?

env

docker python3.8 Pytorch 1.11.0+cu113

error output

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 149, in _serve
    send(conn, destination_pid)
  File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 50, in send
    reduction.send_handle(conn, new_fd, pid)
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 184, in send_handle
    sendfds(s, [handle])
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 149, in sendfds
  File "save_disp.py", line 85, in <module>
    sock.sendmsg([msg], [(socket.SOL_SOCKET, socket.SCM_RIGHTS, fds)])
OSError: [Errno 9] Bad file descriptor

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 151, in _serve
    test()
  File "save_disp.py", line 55, in test
    close()
    for batch_idx, sample in enumerate(TestImgLoader):
  File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 52, in close
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    os.close(new_fd)
OSError: [Errno 9] Bad file descriptor
    data = self._next_data()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
    idx, data = self._get_data()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
    success, data = self._try_get_data()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 116, in get
    return _ForkingPickler.loads(res)
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
    fd = df.detach()
  File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 189, in recv_handle
    return recvfds(s, 1)[0]
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 159, in recvfds
    raise EOFError
EOFError

srishti-git1110 · July 31, 2022, 12:28pm

Hey,
please share the code and also maybe the format of file you are loading the data from.

zhaoworking · August 1, 2022, 2:26am

Dataloader

TrainImgLoader = DataLoader(train_dataset, args.batch_size, shuffle=True, num_workers=0, drop_last=True)

TestImgLoader = DataLoader(test_dataset, args.test_batch_size, shuffle=False, num_workers=0, drop_last=False)

    def __getitem__(self, index):
        left_img = self.load_image(os.path.join(self.datapath, self.left_filenames[index]))
        right_img = self.load_image(os.path.join(self.datapath, self.right_filenames[index]))
        disparity = self.load_disp(os.path.join(self.datapath, self.disp_filenames[index]))
        roi = self.load_mask(os.path.join(self.datapath, self.mask_filenames[index]))
        try:
            if self.training:
                w, h = left_img.size
                crop_w, crop_h = 512, 256

                x1 = random.randint(0, w - crop_w)
                y1 = random.randint(0, h - crop_h)

                # random crop
                left_img = left_img.crop((x1, y1, x1 + crop_w, y1 + crop_h))
                right_img = right_img.crop((x1, y1, x1 + crop_w, y1 + crop_h))
                disparity = disparity[y1:y1 + crop_h, x1:x1 + crop_w]
                roi = roi[y1:y1 + crop_h,x1:x1 + crop_w]
                # to tensor, normalize
                processed = get_transform()
                left_img = processed(left_img)
                right_img = processed(right_img)

                return {"left": left_img,
                        "right": right_img,
                        "disparity": disparity,
                        "left_filename": self.left_filenames[index],
                        "right_filename": self.right_filenames[index],
                        "roi":roi}
            else:
                w, h = left_img.size
                # crop_w, crop_h = 1024, 1024

                # left_img = left_img.crop((w - crop_w, h - crop_h, w, h))
                # right_img = right_img.crop((w - crop_w, h - crop_h, w, h))
                # disparity = disparity[h - crop_h:h, w - crop_w: w]
                # roi = roi[h - crop_h:h, w - crop_w: w]

                processed = get_transform()
                left_img = processed(left_img)
                right_img = processed(right_img)

                return {"left": left_img,
                        "right": right_img,
                        "disparity": disparity,
                        "top_pad": 0,
                        "right_pad": 0,
                        "left_filename": self.left_filenames[index],
                        "right_filename": self.right_filenames[index],
                        "roi":roi}
        except Exception as e:
            print(e.args)
            print(str(e))
            print(repr(e))
            print('here is get_item error')

File format
TIFF image and txt file

srishti-git1110 · August 1, 2022, 5:26am

From what I am able to see, it’s an EOF error and even before that it’s a bad file descriptor error which in this case looks like stemming from an attempt to close a file that wasn’t open in the first place. See this-

File “/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py”, line 530, in next
os.close(new_fd)
OSError: [Errno 9] Bad file descriptor

this might help. Multiprocessing on windows is error prone due to many reasons like pickling etc. Potentially, check this.

Not sure though, if these links would solve your error.

zhaoworking · August 1, 2022, 5:57am

I am using ubuntu to run my program , and I also do if-clause operation which put my trian() code wrapped into if name == ‘main ’. I think it is not the source of error.

esaracino · September 26, 2022, 4:09pm

I’m using torch 1.12.1 with pytorch-lightning 1.7.6 and getting the exact same issue, sporadically. I have 96 vCPUs available, and when I allow the dataloader to use them, training takes ~1 hour, but I will get Bad File descriptors (no matter what sharing strategy I set my workers to use (file-system, or file-descriptor)).

Seems to occur with any num_workers > 0, and when num_workers == 0, training times explodes (4+ hours from the 1 with multiprocessing).

bo_xu · October 12, 2022, 1:09pm

Hi, have you solved this problem? I also had this problem.

bo_xu · October 13, 2022, 7:30am

I have sloved this problem.Adding this configuration to the dataset script works:

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

karaspd · November 4, 2022, 5:12pm

Hi @zhaoworking , I am getting similar error with similar file extension. I was wondering how you load the image, are you using Pillow? if yes what version of Pillow you used? Also, have you been able to fix the issue? Thanks