Pytorch not using GPU . Worked on Fastai

Hi,

My GPU Nvidia gtx1050 Ti

I am trying to train it on GPU but I only see CPU utilization 60-90 percent and GPU around 5 percent during training may be due to the copying of tensors to GPU I don’t know. But it just goes up 5 percent and comes down.

I tried increasing batch size to 64 or 128 based on some solutions online but it just gives me Cuda out of memory error. I have used like 2.3gb out of 4gb and says need 120 MB and has only 110 mb.---- I dont even understand how it works at this point

Then tried decreasing batch size to 16 then also got Cuda out of memory error saying need like 40 MB and has 16mb same 2.3 GB used out of 4gb

In the end it worked for 8 batch size but it only uses CPU no GPU

I used fastai with batch size 128 it works fine and uses gpu . I dont know where I did wrong. Any help appreciated. Below is my code wrote based on pytorch image classifier tutorial

Model Resnet pretrained True
has 205 labels with 117000 images approx as data for training.
Just using pretrained weights such that weights will modify accordingly without freezing weights and not training from scratch I assume is what the code does. Feel free to correct me if I did something wrong or better solution. In the end I am just noob in pytorch … My first code in pytorch

trainset… PIL Image format with rgb converted to prevent error of 4 channel and removed if not opened by PIL --invalid images: corrupted image

device = torch.device("cuda:0") #0 device is my nvidia gtx 1050 ti when printed

model.fc=nn.Linear(2048, 205) 

from torchvision import transforms
t = transforms.Compose([            
 transforms.Resize(256),                    #[2]
 transforms.CenterCrop(224),                #[3]
 transforms.ToTensor(),                     #[4]
 transforms.Normalize(                      #[5]
 mean=[0.485, 0.456, 0.406],                #[6]
 std=[0.229, 0.224, 0.225] )                 #[7]
 ])
trainloader = torch.utils.data.DataLoader(trainset, batch_size=8,
                                          shuffle=True)

if torch.cuda.is_available():
    print('yes gpu')
    torch.set_default_tensor_type('torch.cuda.FloatTensor')
    model = model.cuda()

import torch.optim as optim
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

for epoch in range(6):  # loop over the dataset multiple times
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels,img_name = data
        inputs = inputs.to(device) 
        labels = labels.to(device)
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        total=0
        correct=0
        if i % 2000 == 1999: 
            # print every 2000 mini-batches

            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            outputs = model(inputs)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            
            print('Accuracy of the network on the batch size  images: %d %%' % (
                100 * correct / total)) #training accuraccy
            
            running_loss = 0.0

print('Finished Training')

You might face a data loading bottleneck, especially since you are loading the data in the main process.
Try to increase num_workers in your DataLoader to e.g. 4 and check the GPU utilization again.

When I try to increase workers to 4 in my trainloader …torch.utils.data.DataLoader.

I get broken pipe error. My cpu is i7 4core 2.8Ghz
Also using Jupyter notebook for interactivity and checking progress saving etc
Also tried:-Number of worker =1
Error no 32 ----broken pip error

     3     running_loss = 0.0
----> 4     for i, data in enumerate(trainloader, 0):
      5         # get the inputs; data is a list of [inputs, labels]
      6         inputs, labels,img_name = data

~\Documents\Python\lib\site-packages\torch\utils\data\dataloader.py in __iter__(self)
    191 
    192     def __iter__(self):
--> 193         return _DataLoaderIter(self)
    194 
    195     def __len__(self):

~\Documents\Python\lib\site-packages\torch\utils\data\dataloader.py in __init__(self, loader)
    467                 #     before it starts, and __del__ tries to join but will get:
    468                 #     AssertionError: can only join a started process.
--> 469                 w.start()
    470                 self.index_queues.append(index_queue)
    471                 self.workers.append(w)

~\Documents\Python\lib\multiprocessing\process.py in start(self)
    110                'daemonic processes are not allowed to have children'
    111         _cleanup()
--> 112         self._popen = self._Popen(self)
    113         self._sentinel = self._popen.sentinel
    114         # Avoid a refcycle if the target function holds an indirect

~\Documents\Python\lib\multiprocessing\context.py in _Popen(process_obj)
    221     @staticmethod
    222     def _Popen(process_obj):
--> 223         return _default_context.get_context().Process._Popen(process_obj)
    224 
    225 class DefaultContext(BaseContext):

~\Documents\Python\lib\multiprocessing\context.py in _Popen(process_obj)
    320         def _Popen(process_obj):
    321             from .popen_spawn_win32 import Popen
--> 322             return Popen(process_obj)
    323 
    324     class SpawnContext(BaseContext):

~\Documents\Python\lib\multiprocessing\popen_spawn_win32.py in __init__(self, process_obj)
     63             try:
     64                 reduction.dump(prep_data, to_child)
---> 65                 reduction.dump(process_obj, to_child)
     66             finally:
     67                 set_spawning_popen(None)

~\Documents\Python\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
     58 def dump(obj, file, protocol=None):
     59     '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60     ForkingPickler(file, protocol).dump(obj)
     61 
     62 #

BrokenPipeError: [Errno 32] Broken pipe

This is my dataset custom loader through csv file .


from PIL import Image

class Dataset(Dataset):
    """Face Landmarks dataset."""

    def __init__(self, csv_file, root_dir, transforms=None):
        """
        Args:
            csv_file (string): Path to the csv file with annotations.
            root_dir (string): Directory with all the images.
            transform (callable, optional): Optional transform to be applied
                on a sample.
        """
        self.df_frame = pd.read_csv(csv_file)
        self.root_dir = root_dir
        self.transforms = transforms

    def __len__(self):
        return len(self.df_frame)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        img_name = os.path.join(self.root_dir,self.df_frame.iloc[idx, 0])
        img = Image.open(img_name).convert('RGB')
        label= self.df_frame.iloc[idx, 2]
#         sample = {'img': img, 'label': label}
        
        if self.transforms is not None:
            img = self.transforms(img)
        
        return (img, label,img_name)
transformed_dataset = Dataset(csv_file='images_data.csv',
                                    root_dir='./',transforms=t)

train_size = int(0.8 * len(transformed_dataset))
test_size = len(transformed_dataset) - train_size
trainset, testset = torch.utils.data.random_split(transformed_dataset, [train_size, test_size])

Try to wrap your code using the if-clause protection as described here.

Same error no change I guess multiprocessing is not supported …

import torch

def main():
    for i, data in enumerate(trainloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels,img_name = data
            inputs = inputs.to(device) 
            labels = labels.to(device)
            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            total=0
            correct=0
            if i % 2000 == 1999: 
                # print every 2000 mini-batches

                print('[%d, %5d] loss: %.3f' %
                      (epoch + 1, i + 1, running_loss / 2000))

    print('Finished Training')

if __name__ == '__main__':
    main()

I guess I should give up running on Windows and just go for linux …

Error I get is for above code is …

BrokenPipeError                           Traceback (most recent call last)
<ipython-input-103-4ec33088ead4> in <module>
     29 
     30 if __name__ == '__main__':
---> 31     main()

<ipython-input-103-4ec33088ead4> in main()
      2 
      3 def main():
----> 4     for i, data in enumerate(trainloader, 0):
      5             # get the inputs; data is a list of [inputs, labels]
      6             inputs, labels,img_name = data

~\Documents\Python\lib\site-packages\torch\utils\data\dataloader.py in __iter__(self)
    191 
    192     def __iter__(self):
--> 193         return _DataLoaderIter(self)
    194 
    195     def __len__(self):

~\Documents\Python\lib\site-packages\torch\utils\data\dataloader.py in __init__(self, loader)
    467                 #     before it starts, and __del__ tries to join but will get:
    468                 #     AssertionError: can only join a started process.
--> 469                 w.start()
    470                 self.index_queues.append(index_queue)
    471                 self.workers.append(w)

~\Documents\Python\lib\multiprocessing\process.py in start(self)
    110                'daemonic processes are not allowed to have children'
    111         _cleanup()
--> 112         self._popen = self._Popen(self)
    113         self._sentinel = self._popen.sentinel
    114         # Avoid a refcycle if the target function holds an indirect

~\Documents\Python\lib\multiprocessing\context.py in _Popen(process_obj)
    221     @staticmethod
    222     def _Popen(process_obj):
--> 223         return _default_context.get_context().Process._Popen(process_obj)
    224 
    225 class DefaultContext(BaseContext):

~\Documents\Python\lib\multiprocessing\context.py in _Popen(process_obj)
    320         def _Popen(process_obj):
    321             from .popen_spawn_win32 import Popen
--> 322             return Popen(process_obj)
    323 
    324     class SpawnContext(BaseContext):

~\Documents\Python\lib\multiprocessing\popen_spawn_win32.py in __init__(self, process_obj)
     63             try:
     64                 reduction.dump(prep_data, to_child)
---> 65                 reduction.dump(process_obj, to_child)
     66             finally:
     67                 set_spawning_popen(None)

~\Documents\Python\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
     58 def dump(obj, file, protocol=None):
     59     '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60     ForkingPickler(file, protocol).dump(obj)
     61 
     62 #

BrokenPipeError: [Errno 32] Broken pipe

Fastai code for small dataset I remmber using it which worked on gpu

import torch
print("cuda Available",torch.cuda.is_available())
# torch.cuda.current_device()
s=time.time()
print("time started at:",s)
tfms = get_transforms(do_flip=False)
#path = '/content/drive/My Drive/Colab Notebooks/fastai/'
path = './'
data = ImageDataBunch.from_folder(path,ds_tfms=tfms, size=224)
# data.show_batch(rows=3, figsize=(10,10))
my_trained_mod = create_cnn(data, models.resnet50, metrics=error_rate)
my_trained_mod.unfreeze()
my_trained_mod.fit_one_cycle(6,max_lr=slice(1e-5,1e-3))

my_trained_mod.save("trained_model")

my_trained_mod.export()
print(time.time()-s,'seconds')

Can you paste the exact error message. Sometimes when we kill the process inbetween, the gpu memory is not released right away. That could lead to the model complaining about only ~100 mb available memory.

This is the error message I get for 64 batch size.

trainloader = torch.utils.data.DataLoader(trainset, batch_size=64,num_workers=0,
                                          shuffle=True)

RuntimeError: CUDA out of memory. Tried to allocate 98.00 MiB (GPU 0; 4.00 GiB total capacity; 2.76 GiB already allocated; 72.80 MiB free; 67.93 MiB cached)

and this is for 16 batch size

RuntimeError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 4.00 GiB total capacity; 2.82 GiB already allocated; 22.80 MiB free; 58.68 MiB cached)

only working for 8 batch size and 0 worker

Hi,

Could you check total size of parameters and input and output? If you have xxxMiB memory then divide it by the total size is approximately available size for batch, indeed.

I’ve recently found the same issue re multi-processing under Windows from Jupyter Notebook. It seems that __name__ is always __main__, and that multiprocessing just doesn’t work in a notebook. The trick with __main__ only works with a python program, not in a notebook. However, see this article re overcoming the infinite recursion you are getting with multiprocessing from a notebook : https://medium.com/@grvsinghal/speed-up-your-python-code-using-multiprocessing-on-windows-and-jupyter-or-ipython-2714b49d6fac . The solution is to locate the code / worker you are running out into a python program, which you can then import and call from your notebook with multiprocessing. This might also explain why fast.ai code has worked as inevitably you are importing one of their libraries located in external python programs, though it doesn’t address your lack of GPU usage…

This is the issue - it seems there is some other process holding on to the gpu memory. Sometimes, when I force kill torch process on the server, I get this error. Try restarting the system once.