Trained a Rectangular CNN that is broken because of pytorch and numpy version update

Hi,

I am learning a deep network regressor that accepts a rectangular image as the input and predicts a pixel location on the input image. The initial convolution layers have rectangular kernels. After one inception layer, the input image is square and the subsequent layers have square kernels.

I had learnt a feed forward CNN in Pytorch 0.2.0_3. The system had restarted and the numpy and pytorch versions had updated. The rows and columns had interchanged in my new numpy version. So an image was now (h,w,c) in numpy while previously it was (w,h,c). I do not remember the previous version of numpy.

I changed my input data format to the network accordingly. But the predictions from my previously trained network is now completely wrong. It is predicting random points in and around a central region in the image.

When I try retraining the model with theerfitting and the previously trained model is predicted random points in a central region same exact parameters, it is overfitting and not generalizing to the test dataset. I tried reinstalling the older version of pytorch (0.2.0_3) and the newer version (1.0.0). Both cases, a new training is overfitting and the previously trained model is predicting random points in a central region of the image.

Has anyone experienced such issues with a rectangular CNN? Is there some internal data processing when using rectangular kernels that could have changed? The input to my pytorch network is (b,c,h,w).

How did you load your images?
As far as I know, numpy doesn’t support image loading out of the box, instead other libraries like scipy support it.
What happens if you just swap your axes? Does the model still yield bad results?
I understand that after downgrading to the previous setup, you can’t reproduce the original results?

Hi,

I am reading the images using opencv. I have created a list of list that contains the data in the format ((image1, point1), (image2,point2),…). The image data is normalized to [-1,1]. The pixel point is set as floating point. I have saved this as numpy file (.npy). When training the network, I read the numpy file, then pass it to dataloader function and then read the data for training. As I am reading custom dataset, I have to typecast the data as a torch variable and float. Below I have the code for reading the data and training the network. Please let me know if that is correct.

training_data = np.load(“training_data.npy”)

# numpy reads it as array, convert to list for compatibility with dataloader

# training data format is a list of list with the image and the point

training_data = training_data.tolist()

trainloader = torch.utils.data.DataLoader(training_data, batch_size=64, shuffle=True)

net.train()
batch_loss = 0.0
for i, data in enumerate(trainloader):
	inputs, labels = data

	inputs, labels = Variable(inputs).float(), Variable(labels).float()
	inputs = inputs.cuda()
	labels = labels.cuda()

	optimizer.zero_grad()

	outputs = net(inputs)

	loss = criterion(outputs, labels)
	loss.backward()
	optimizer.step()

	batch_loss += loss.cpu().data.numpy()

The model does not learn anything when I swap axis. The previously trained model doesnt work either. And yes, reverting to the older version does not help in training as well.

I find it strange that a model that was learning and predicting very accurately, now is suddenIy predicting random points in a central region of the image. I understand that pytorch converts the python code to C++ internally for faster computations. Could it be that some package related to that has been updated? If so, what are the relevant packages?

Could you print the shape and maybe some examples of training_data before and after converting it to a list and also inside your training loop?
Which criterion are you using?

Hi,

I am using mean square error ( nn.MSELoss() ). Optimizer is Adam ( optim.Adam(net.parameters(), lr=1e-4,weight_decay=1e-2) ).

Please find the code below about the data shape. Unfortunately I cannot post the images due to privacy and security concerns. But I did read the torch image (inputs) and then convert to numpy using these commands (inputs.numpy(); inputs.transpose(1,2,0)) . I saved the image using opencv and plotted the corresponding labels. It is correct.

training_data = np.load("training_data.npy")

print "data shape:", training_data.shape

a,b = training_data[0]
print "image shape:", a.shape
print "label:", b
print "label shape:", b.shape

training_data = training_data.tolist()
print "length of training data:", len(training_data)
a,b = training_data[0]
print "image data:", a.shape
print "label:",b




for i, data in enumerate(trainloader):
    # get the inputs
    inputs, labels = data

    print "torch image:", inputs.shape
    print "torch labels shape:", labels.shape
    print "torch labels", labels

data shape: (5000,2)
image shape: (3,68,240)
label: [206.8865, 10.2172]

length of training data: 5000
image data: (3,68,240)
label: [206.8865, 10.2172]

torch image: (8,3,68,240) (when I run this on ipython it shows as torch array: torch.Size([8, 3, 68, 240]) )

torch labels shape: (8, 2) (shows as torch.Size([8,2]) on ipython )
torch labels:
tensor([[206.8865, 10.2172],
[ 30.5665, 49.0771],
[191.4570, 14.4499],
[171.6556, 30.7590],
[157.3364, 20.8971],
[ 75.7549, 13.3485],
[ 68.8032, 17.6255],
[107.8486, 23.5131]], dtype=torch.float64)

Thanks for the information!
Could you additionally check the shape of output before passing it to the criterion?

Hi,

Output is [8,2] where 8 is batch size. When I set batch size to 1, the “output” reads something like this.

tensor([[144.8181, 24.1730]], device=‘cuda:0’, grad_fn=)

I can’t find any mistake in your code/shapes.

Do you have any old model snapshots you could reload?

Yeah I have tried that. All of them are corrupted. They are predicting constant points pretty much.

I have an implementation of the same network using a square input image instead of rectangular. That network is fine. I am getting the same errors when loading a previously trained model. That’s why I was wondering if there if something has changed internally when loading a trained model using state_dict function. This is the command I use to load a trained model for inference: net.load_state_dict(torch.load(trained_model_path)).

If everything seems okay, I think it might be a major bug in the code. Will look at that. Thank you for your help!

It should be alright to load the state_dict. 0.2 is pretty old by now and a lot of things have changed.
However, you should get a warning/error, if something goes wrong.

Sure, I hope you can figure it out! :slight_smile: