Batch sample with variable instance sizes

I have a use case where each sample in a batch contains variable input instances. For example, an image with objects each has its own attributes needs to be encoded(shared network for object level encoding) separately and in the meanwhile need to concat cropped features from the image feature map from the object location. So there will be multiple object instances corresponds to one image. And one image is just one sample in batch.

What’s the best way to structure the input tensor?
Images can be structured as NxCxHxW and goes through a cnn network to obtain feature map.
What about object level tensors?
Option 1:
structure the objects tensor as NxCxLxS where N is batch number, C is channel, L is length of input of each object, S is the objects number in each sample ?
structure the objects tensor as MxCxL where M = N*S, and slice later based on start and end index to concat with image feature map? Will autograd work in this case ?

What’s the recommendation here. Thank you a lot,