How to make sure Tensors stay in the same GPU

Manuel_Alejandro_Dia · February 25, 2021, 9:11pm

Hi everyone!

I just have a question regarding how can I make sure that the tensors will stay in the same device.

A bit of context: I am training a YOLOv3-based detector and the code runs perfectly on one GPU. But I want to change it in order to use 2+ GPUs in the same machine using the nn.Dataparallel module in order to shorten training time.

Wrapping the model for parallel GPUs and doing inference on it seems to be working well, but the problem arises when building the targets to calculate the loss.

I use the following code to build the targets:

def build_targets(pred_boxes, pred_cls, target, anchors, ignore_thres):

    BoolTensor = torch.cuda.BoolTensor if pred_boxes.is_cuda else torch.BoolTensor
    FloatTensor = torch.cuda.FloatTensor if pred_boxes.is_cuda else torch.FloatTensor

    nB = pred_boxes.size(0)
    nA = pred_boxes.size(1)
    nC = pred_cls.size(-1)
    nG = pred_boxes.size(2)

    # Output tensors
    obj_mask = BoolTensor(nB, nA, nG, nG).fill_(0)
    noobj_mask = BoolTensor(nB, nA, nG, nG).fill_(1)
    class_mask = FloatTensor(nB, nA, nG, nG).fill_(0)
    iou_scores = FloatTensor(nB, nA, nG, nG).fill_(0)
    tx = FloatTensor(nB, nA, nG, nG).fill_(0)
    ty = FloatTensor(nB, nA, nG, nG).fill_(0)
    tw = FloatTensor(nB, nA, nG, nG).fill_(0)
    th = FloatTensor(nB, nA, nG, nG).fill_(0)
    tcls = FloatTensor(nB, nA, nG, nG, nC).fill_(0)

    # Convert to position relative to box
    target_boxes = target[:, 2:6] * nG
    gxy = target_boxes[:, :2]
    gwh = target_boxes[:, 2:]
    # Get anchors with best iou
    ious = torch.stack([bbox_wh_iou(anchor, gwh) for anchor in anchors])
    best_ious, best_n = ious.max(0)
    # Separate target values
    b, target_labels = target[:, :2].long().t()
    gx, gy = gxy.t()
    gw, gh = gwh.t()
    gi, gj = gxy.long().t()

    ############################################
    # Problems from here in nn.Dataparallel
    ############################################

    # Set masks
    obj_mask[b, best_n, gj, gi] = 1
    noobj_mask[b, best_n, gj, gi] = 0

    # Set noobj mask to zero where iou exceeds ignore threshold
    for i, anchor_ious in enumerate(ious.t()):
        noobj_mask[b[i], anchor_ious > ignore_thres, gj[i], gi[i]] = 0

    # Coordinates
    tx[b, best_n, gj, gi] = gx - gx.floor()
    ty[b, best_n, gj, gi] = gy - gy.floor()
    # Width and height
    tw[b, best_n, gj, gi] = torch.log(gw / anchors[best_n][:, 0] + 1e-16)
    th[b, best_n, gj, gi] = torch.log(gh / anchors[best_n][:, 1] + 1e-16)
    # One-hot encoding of label
    tcls[b, best_n, gj, gi, target_labels] = 1
    # Compute label correctness and iou at best anchor
    class_mask[b, best_n, gj, gi] = (pred_cls[b, best_n, gj, gi].argmax(-1) == target_labels).float()
    iou_scores[b, best_n, gj, gi] = bbox_iou(pred_boxes[b, best_n, gj, gi], target_boxes, x1y1x2y2=False)

    tconf = obj_mask.float()
    return iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf

When trying to run the training on dual GPUs, a device-side assertion is triggered when performing the line:

obj_mask[b, best_n, gj, gi] = 1

This error seems to originate here and not before, I tested line by line the whole function, and executing this line always triggers the following device-side assertion error:

/opt/conda/conda-bld/pytorch_1603729096996/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [15,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1603729096996/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [16,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1603729096996/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [17,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1603729096996/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [18,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1603729096996/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [19,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1603729096996/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [20,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1603729096996/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [21,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1603729096996/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [22,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1603729096996/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [23,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1603729096996/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [24,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1603729096996/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [25,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1603729096996/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [26,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1603729096996/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [27,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1603729096996/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [28,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1603729096996/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [29,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1603729096996/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [30,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

Doing some debugging I found that in said line, sometimes tensors with different device IDs are used, which I think it’s the thing triggering the assertion error.

My question here is if there is a way of making sure that the tensors will stay all in the same device somehow?

Any help is appreciated.

Thank you!

ptrblck · February 26, 2021, 9:38am

The error is not pointing to a device mismatch, but to invalid indices.
Make sure that all index tensors used in obj_mask[b, best_n, gj, gi] contain valid values for the shape of obj_mask.
You could e.g. print the min. and max. values of them and compare it to the the obj_mask.shape.

Manuel_Alejandro_Dia · March 1, 2021, 3:07pm

Hello @ptrblck , thanks for the response!

Could you please help me with clues on how to debug this?

What I don’t understand is, why is this error being triggered only when I wrap the model under nn.Dataparallel ? and not when I run inference on a single GPU.

If it is of any help, my only call to put the model for multi-GPU usage is a simple conditional:

if torch.cuda.device_count()>1:
    # Multi-GPU Training
    model = nn.DataParallel(model)

Thank you for any help that can be provided.

ptrblck · March 1, 2021, 11:27pm

To debug this, I would try to check the obj_mask as well as the indices passed to it as described before.
Based on the raised error, one (or multiple) indices seem to contain invalid values.

I don’t know how e.g. b is created, but if it’s the “global” batch size, it would create an error, since each model replica in nn.DataParallel will use a split of the original input tensor.

Manuel_Alejandro_Dia · March 4, 2021, 12:47pm

EDIT: I added a clearer question as a reply to this comment. I decided to leave this just to give some context to the previously shared code and rationale on why the question I asked here would not help to solve the original question.

Looking at the code, b refers to the batch to which each target belongs to.

I think I found the problem, but I don’t have a clear idea on how to solve it. After more debugging, I understand that since the targets are being split, I needed to modify them somehow.

The targets are built as follows, here image_number refers to the batch index for each box:

[image_number, class_label, x, y, w, h]

And since there can be different amounts of objects in each image, the target tensor size varies.

In the targets, b is an array that can have values between [0, gB-1] where gB is the global batch size, and obj_mask second dimension will have a size of nB which is the batch size in the current GPU. In single GPU mode b is equal to nB and everything works fine.

But when using multiple GPUs, the values of b can still be between [0, gB-1] where the obj_mask second dimension will have a size gB/2. Here is where the assertion error is being triggered.

As a first solution, I thought to just check in which device the code is running. If it is running in the second GPU, just shift b's values by the batch size, here nB is the split batch size:

if obj_mask.device.index > 0:
    for i in range(len(b[i])):
        if b[i] > nB-1:
            b[i] = target[i,0]-nB

But the problem I encountered with this approach is that there is no way of knowing if one image may have many more objects than the others. So when target is split, this unbalance may result in b having a range of values [0, gB-1], and I couldn’t do the offset because then different batches may have the same index in the targets.

Here is a pseudo-code example of one of the cases where this is true:

#These are the global values
gB = 16
target[:,0] = 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 
 0, 0, 0, 1, 1, 1, 2, 2, 3, 4, 5, 5, 6, 7, 8, 8, 8, 8, 9, 9, 10, 10, 10, 11, 11, 11, 12, 12, 13, 13, 14, 14, 14, 15]
#len(target[:,0]) = 67

#In GPU 0, after the split
nB = 8
obj_mask = BoolTensor(nB, 3, 22, 22).fill_(0)

b = target[:,0][ : len(target[:,0])//2]
b = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
#len(b) = 33
obj_mask[b, 0, 0, 0] = 1

#In GPU 1, after the split
nB = 8
obj_mask = BoolTensor(nB, 3, 22, 22).fill_(0)
b = target[:,0][len(target[:,0])//2 : ]
b = [0, 0, 0, 1, 1, 1, 2, 2, 3, 4, 5, 5, 6, 7, 8, 8, 8, 8, 9, 9, 10, 10, 10, 11, 11, 11, 12, 12, 13, 13, 14, 14, 14, 15]
#len(b) = 34
obj_mask[b, 0, 0, 0] = 1

Here it can be seen that since image zero has the most amount of boxes, when the split is done, the range of values of b in GPU 1 will trigger the assertion. But if I apply the solution I came up with, the following will happen:

#In GPU 1
b = [0, 0, 0, 1, 1, 1, 2, 2, 3, 4, 5, 5, 6, 7, 8, 8, 8, 8, 9, 9, 10, 10, 10, 11, 11, 11, 12, 12, 13, 13, 14, 14, 14, 15]

if obj_mask.device.index > 0:
    for i in range(len(b[i])):
        if b[i] > nB-1:
            b[i] = target[i,0]-nB

b = [0, 0, 0, 1, 1, 1, 2, 2, 3, 4, 5, 5, 6, 7, 0, 0, 0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6, 6, 6, 7]

My new question now is, how could I evade this problem if the targets do not have always a specific shape?

My main idea would be to modify the way the targets are built by creating them in with a shape of [gB, max_Boxes, 6], where I would have to check the maximum amount of boxes (max_Boxes) in a batch and fill with zeroes on the other ones so that the tensors can be appended to form the targets. I am not a fan of this approach since it would require me to modify a big amount of code, but I will do it if I have to.

So if anyone could come up with a different strategy, it would be most appreciated!

Thanks for taking the time of reading such long post.

Manuel_Alejandro_Dia · March 4, 2021, 4:20pm

Actually, after some thinking I realized that the problem is not about the size of the target tensor.

What I think would help me solve this problem is to know if, is there any way I could override the split of some tensors done by nn.DataParallel? @ptrblck

So that the corresponding targets for the images are sent to the specific GPU which is handling their input image.

ptrblck · March 4, 2021, 10:33pm

You could check the internal implementation and probably use the functional distributed API to send the appropriate splits to the desired device.
From a general point of view: nn.DataParallel will split the tensors in dim0, so would it be possible to make sure the data and target tensors are constructed in this way?

Also, you might take a look at DistributedDataParallel using one process per GPU, which should be faster, and could even simplify your use case, as each process would load its data.

Manuel_Alejandro_Dia · March 5, 2021, 11:16am

DistributedDataParallel was the solution, thank you so much for all of your help!

I found this amazing tutorial that explains step by step how to use DistributedDataParallel:

https://yangkky.github.io/2019/07/08/distributed-pytorch-tutorial.html

I hope it can help others.

@ptrblck Is there any way I could add the tag #distributed to the original post? So it would be easier for people to use it?

ptrblck · March 5, 2021, 10:46pm

Good to hear, it’s working!
I’ve moved it to the distributed category.