Training on the multi-GPUs but stuck in loss.backward()

Hi! I ran my code on a single GPU and it worked well. But when I tried to run it on the server that has 2 GPUs, it hang on the loss.backward(). I use torch.nn.DataParallel to train on multi-GPUs. The train code is as follows:

def train_batch(
        model,
        optimizer,
        baseline,
        epoch,
        batch_id,
        step,
        batch,
        tb_logger,
        opts
):
    x, bl_val = baseline.unwrap_batch(batch)
    x = move_to(x, opts.device)
    bl_val = move_to(bl_val, opts.device) if bl_val is not None else None
    # Evaluate model, get costs and log probabilities
    cost, log_likelihood = model(x)
    # Evaluate baseline, get baseline loss if any (only for critic)
    bl_val, bl_loss = baseline.eval(x, cost) if bl_val is None else (bl_val, 0)
    # Calculate loss
    reinforce_loss = ((cost - bl_val) * log_likelihood).mean()
    loss = reinforce_loss + bl_loss

    # Perform backward pass and optimization step
    optimizer.zero_grad()
    print('1')
    with torch.autograd.set_detect_anomaly(True):
        print('in')
        loss.backward()
        print('out')
    print('2')
    # Clip gradient norms and get (clipped) gradient norms for logging
    grad_norms = clip_grad_norms(optimizer.param_groups, opts.max_grad_norm)
    print('3')
    optimizer.step()
    print('7')

The output is:
1
in
and maintain this situation.
The state of the GPUs:

Pytorch: 1.12.1
Python: 3.8.11
GPU: Nvidia GeForce RTX 2080 Ti

Was this setup working before with multi-GPUs or is it new?
In the latter case, could you check if disabling IOMMU might help as described here?

Thanks for your reply! But I’m a liitle confused about what u say :sob:. Do u mean that whether the train with the same setting worked well on multi-GPUs? Actually, I have an original code with the same setting (GPU,Pytorch,Python) and it worked well. But then I made some change (didn’t change the train code) amd it got problems.

What kind of changes did you do that caused the issues?

The original work is a seq2seq model. I just add some states and change the mask rule. But it worked well on the single GPU. Could the added variables cause the issue?

I ran the command sudo lspci -vvv | grep ACSCtland it seems that ASC is not enabled.

I don’t think IOMMU is at fault since the original code was already working while your changes seem to have broken the code as you have explained.
Therefore, I would recommend checking each change separately to narrow down which exactly broke the code.

Thank you for your reply! I considered your suggestion and I found that two new inputs caused the hang. But I still son;t know why this happened. The code include the two inputs (ltw, rtw )is down below.

class StateCVRP(NamedTuple):
    # Fixed input
    coords: torch.Tensor
    demand: torch.Tensor
    ltw: torch.Tensor  
    rtw: torch.Tensor  
    ids: torch.Tensor  # Keeps track of original fixed data index of rows

    # State
    prev_a: torch.Tensor  
    used_capacity: torch.Tensor   
    visited_: torch.Tensor  # Keeps track of tasks that have been visited
    lengths: torch.Tensor  
    cur_index:torch.Tensor
    i: torch.Tensor  # Keeps track of step
    VEHICLE_CAPACITY = 1.0  # Hardcoded
    VEHICLE_V = 3   

    def initialize(input, visited_dtype=torch.uint8):

        demand = input['demand']
        depot = input['depot']
        loc_id = input['loc_id']
        ltw = input['ltw']
        rtw = input['rtw']

        batch_size, n_loc, _ = loc_id.size()  
        return StateCVRP(
            coords=torch.cat((depot[:, None, :], loc_id), -2).type(torch.long),
            demand=demand,
            ltw=ltw,
            rtw=rtw,
            ids=torch.arange(batch_size, dtype=torch.int64, device=loc_id.device)[:, None],  # Add steps dimension
            prev_a=torch.zeros([batch_size, 10], dtype=torch.long, device=loc_id.device),
            used_capacity=demand.new_zeros(batch_size, 10, 1),
            cur_index=input['depot'][:,None,:].type(torch.long),  # 1024,1,2
            visited_=(  
                torch.zeros(
                    batch_size, 1, n_loc + 1,
                    dtype=torch.uint8, device=loc_id.device
                )
                if visited_dtype == torch.uint8
                else torch.zeros(batch_size, 1, (n_loc + 63) // 64, dtype=torch.int64, device=loc_id.device)  
            ),
            lengths=torch.zeros(batch_size, 1, device=loc_id.device),
            i=torch.zeros(1, dtype=torch.int64, device=loc_id.device)  # Vector with length num_steps
        )
def make_instance(args):
    depot = args['depot']
    loc_id = args['loc_id']
    demand = args['demand']
    ltw = args['ltw']
    rtw = args['rtw']
    capacity = args['CAPACITIES']
    grid_size = 1

    return {
        'loc_id': torch.tensor(loc_id, dtype=torch.float) / grid_size,
        'demand': torch.tensor(demand, dtype=torch.float) / capacity,
        'depot': torch.tensor(depot, dtype=torch.float) / grid_size,
        'ltw' : torch.tensor(ltw, dtype=torch.float) / grid_size,
        'rtw' : torch.tensor(rtw, dtype=torch.float) / grid_size,
    }
class VRPDataset(Dataset):
    
    def __init__(self, filename=None, size=50, num_samples=1000000, offset=0, distribution=None):
        super(VRPDataset, self).__init__()

        self.data_set = []
        if filename is not None:
            assert os.path.splitext(filename)[1] == '.pkl'

            with open(filename, 'rb') as f:
                data = pickle.load(f)
                a = data[offset:offset+num_samples]
            self.data = [make_instance(args) for args in data[offset:offset+num_samples]]

        else:

            CAPACITIES = {
                10: 20.,
                20: 30.,
                50: 40.,
                100: 50.
            }

            self.data = [
                {
                    'depot': torch.full([2], 2, dtype=torch.float),
                    'loc_id': torch.randint(0, 57, (size,2)),
                    'demand': (torch.FloatTensor(size).uniform_(0, 9).int() + 1).float() / CAPACITIES[size],
                    'ltw': torch.randint(0, 100, (size,)),
                    'rtw': torch.randint(500, 600, (size,)),
                }
                for i in range(num_samples)
            ]

        self.size = len(self.data)
    def __len__(self):
        return self.size

    def __getitem__(self, idx):
        return self.data[idx]
 def _init_embed(self, input):
     features = ('demand', 'ltw', 'rtw')
     return torch.cat(
         (
             self.init_embed_depot(input['depot'])[:, None, :],
             self.init_embed(torch.cat((
                 input['loc_id'],
                 *(input[feat][:, :, None] for feat in features)
             ), -1))
         ),
         1
     )

OMGGG!Finally I find that the os.environ code part seems to be conflict with torch.nn.DataParallel.So when I delete the os.environ code It worked!!

Could you share what exactly you have set via os.environ which apparently caused the hang?

It’s
os.environ['CUDA_LAUNCH_BLOCKING']='1'
I added it before to debug but forget to delete it.

2 Likes

I have stuck on this for hours! And this post saved me!