Validation and testing result accuracy much higher than production

Sorry for such a long post. I tried to provide all the relevant information. This is my first attempt at neural networks so I’ve done a few things the long way to allow for some debugging. It mostly works so I haven’t gone through and done any cleanup.

I’m using four 7 channel images in an attempt to classify wetlands from upland and water. Training, validation, and testing is showing very promising results with accuracy around 90% in all classes. When I save the model, load it, and classify one of the training images I’m getting around 60% accuracy.

I break up the original files into 128x128 images and do some data augmentation which includes flipping, rotating, and transposing. The paths are all saved in a csv file. There are around 35k tiny images.

I’m using:
python 3.6.6
pytorch 0.4.1
UNet model
CrossEntropyLoss
Adam optimizer

Setting up the environment to get reproducible results

np.random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed_all(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
model = unet.UNet(num_classes, in_channels=in_channels+extra_indices, depth=6, start_filts=64)
if torch.cuda.is_available:
    model.cuda()

I get population mean and stdev as well as class weights based off of a shuffled sample of my training data. I have some areas of nodata which I ignore.

dataset_train = data_utils.SatIn(data_path, csvfile, 'train', transform=transforms.Compose([aug.RemoveZeros(), aug.RemoveNoData(), aug.WetlandOnly(), aug.SpecificBands(usebands), aug.NDVI(bands=[1,2])]))
train_dataloader = DataLoader(dataset_train, batch_size=2000, num_workers=1, shuffle=True)
if normalize or weighting:
    for i, data in enumerate(train_dataloader):
        if i == 0:
            if normalize:
                numpy_image = data['sat_img'].numpy()
                popmean = torch.from_numpy(np.mean(np.multiply(numpy_image, (numpy_image < -3.5028235e+4) | (numpy_image > -3.3028235e+4)), axis=(0,2,3))).float()
                popstd = torch.from_numpy(np.std(np.multiply(numpy_image, (numpy_image < -3.5028235e+4) | (numpy_image > -3.3028235e+4)), axis=(0,2,3))).float()
            if weighting:
                unique, counts = np.unique(data['map_img'].numpy(), return_counts=True)
                mapunique = unique
                mapcount = counts
                print('Unique:', unique)
                print('Counts:', counts)
                if weighting:
                    weights = 1 - (counts/sum(counts))
            break

Setup loss and optimizer

#criterion
if weighting:
    criterion = nn.CrossEntropyLoss(weight = weight)
else:
    criterion = nn.CrossEntropyLoss()

# optimizer
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# decay LR
lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)

Load data

dataset_train = data_utils.SatIn(data_path, csvfile, 'train', transform=transforms.Compose([aug.RemoveZeros(), aug.RemoveNoData(), aug.WetlandOnly(), aug.SpecificBands(usebands), aug.NDVI(bands=[1,2]), aug.ToTensorTarget(), aug.NormalizeTarget(mean=popmean, std=popstd)]))
dataset_val = data_utils.SatIn(data_path, csvfile, 'valid',  transform=transforms.Compose([aug.RemoveZeros(), aug.RemoveNoData(), aug.WetlandOnly(), aug.SpecificBands(usebands), aug.NDVI(bands=[1,2]), aug.ToTensorTarget(), aug.NormalizeTarget(mean=popmean, std=popstd)]))
dataset_test = data_utils.SatIn(data_path, csvfile, 'test',  transform=transforms.Compose([aug.RemoveZeros(), aug.RemoveNoData(), aug.WetlandOnly(), aug.SpecificBands(usebands), aug.NDVI(bands=[1,2]), aug.ToTensorTarget(), aug.NormalizeTarget(mean=popmean, std=popstd)]))
train_dataloader = DataLoader(dataset_train, batch_size=batch_size, num_workers=4, shuffle=True)
val_dataloader = DataLoader(dataset_val, batch_size=3, num_workers=4, shuffle=False)
test_dataloader = DataLoader(dataset_test, batch_size=3, num_workers=4, shuffle=False)

Training, validation, and testing have very similar methods. I call model.eval() and don’t do any of the optimizer of backwards propogation in validation and testing.

def train(train_loader, model, criterion, optimizer, scheduler, epoch_num):
    model.train()
    correct = 0
    totalcount = 0
    scheduler.step()

    # iterate over data
    for idx, data in enumerate(tqdm(train_loader, desc="training")):
        # get the inputs and wrap in Variable
        if torch.cuda.is_available():
            inputs = Variable(data['sat_img'].cuda())
            labels = Variable(data['map_img'].cuda())
        else:
            inputs = Variable(data['sat_img'])
            labels = Variable(data['map_img'])
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels.long())
        loss.backward()
        optimizer.step()
        _, predicted = torch.max(outputs.data, 1)
        test = predicted == labels.long()
        correct += test.sum().item()
        totalcount += test.size()[0] * test.size()[1] * test.size()[2]

    print('Training Loss: {:.4f}, Accuracy: {:.4f}'.format(loss.data[0], correct/totalcount))
    return {'train_loss': loss.data[0], 'train_acc' : correct/totalcount}

def validation(valid_loader, model, criterion, epoch_num):
    correct = 0
    totalcount = 0
    model.eval()

    # Iterate over data.
    for idx, data in enumerate(tqdm(valid_loader, desc='validation')):
        # get the inputs and wrap in Variable
        if torch.cuda.is_available():
            inputs = Variable(data['sat_img'].cuda(), volatile=True)
            labels = Variable(data['map_img'].cuda(), volatile=True)
        else:
            inputs = Variable(data['sat_img'], volatile=True)
            labels = Variable(data['map_img'], volatile=True)
        # forward
        outputs = model(inputs)
        loss = criterion(outputs, labels.long())
        _, predicted = torch.max(outputs.data, 1)
        test = predicted == labels.long()
        correct += test.sum().item()
        totalcount += test.size()[0] * test.size()[1] * test.size()[2]

    print('Validation Loss: {:.4f} Acc: {:.4f}'.format(loss.data[0], correct/totalcount))
    return {'valid_loss': loss.data[0], 'valid_acc': correct/totalcount}

Epoch loop

for epoch in range(start_epoch, num_epochs): 
    lr_scheduler.step()
    training= train(train_dataloader, model, criterion, optimizer, lr_scheduler, epoch)
    vallidating = validation(val_dataloader, model, criterion, epoch)

Save model

torch.save(model.state_dict(),os.path.join(rootdir,'3Image_8bands.pth'))

At this point everything looks great. Now to load the model

#Set seeds
np.random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed_all(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.enabled = False
# get model
model = unet.UNet(num_classes, in_channels=in_channels+extra_indices, depth=6, start_filts=64)
if torch.cuda.is_available():
    model.cuda()
model.load_state_dict(torch.load(os.path.join(rootdir, modelName)))
model.eval()

I breakup the image to be classified in the same way I did for training. The original image is broken down into 128x128 chunks and each of those chunks run through the classifier.

collectpredictions = []
dataset_test = classify.ClassOut(data_path, csvfilename, 'test',  transform=transforms.Compose([aug.RemoveZeros(), aug.RemoveNoData(), aug.WetlandOnly(), aug.SpecificBands(usebands), aug.NDVI(bands=[1,2]), aug.ToTensorTarget(), aug.NormalizeTarget(mean=popmean, std=popstd)]))
test_dataloader = DataLoader(dataset_test, batch_size=1, num_workers=3, shuffle=False)
for idx, data in enumerate(test_dataloader):
    # get the inputs and wrap in Variable
    if torch.cuda.is_available():
        inputs = Variable(data['sat_img'].cuda(), volatile=True)
    else:
        inputs = Variable(data['sat_img'], volatile=True) 
    outputs = model(inputs)
    _, predicted = torch.max(outputs.data, 1)
    collectpredictions.append(predicted[0].cpu().numpy())

Where might I be going wrong?
Thanks!

I tested with:
torch.backends.cudnn.enabled = False

and still get terrible results. So much slower.

I tried classifying the image in the same script as my classification to avoid saving and loading the model. Still getting bad results so that’s not the issue.

Hi Mike!

At the risk of asking the obvious, what happens if you classify a
different training image (or one of your test or validation images)?
Could your “one of the training images” be an unlucky image?

As I understand it, you don’t data-augment the image for which
you get 60% accuracy. (Do you data-augment your valiation
and test images?) What happens if you train without data
augmentation? Do you still see a discrepancy between your
training accuracy and your single-image accuracy?

Best regards.

K. Frank

Thanks K. Frank.

I do augment my training data. I’ll attempt removing that. My accuracy shoots up to 90% in the first epoch. I believe I may be overfitting.

Hi Mike!

I’m still not completely clear on what you’re are doing and what
your problem is.

Just to make sure that we’re on the same page regarding context
and terminology, here are a few comments:

Training, validation, and test data: You train on your training data.
That is you fit your network to get good results on your training
data. You then run your network on your test data to see if you
get similar results. It’s okay if your test results are a little worse
than your training results – after all, you did fit your training data.
But if your test results are significantly worse, then you probably
optimized too far, and overfit your training data.

If you want to get a little fancier, while you’re training, you can
run your partially-trained network on your so-called validation
data. You might then stop training when your validation results
get good enough, or when they start to get worse, or when they
start to diverge too much from your training results.

Your validation and test results ought to be quite similar and in
many cases you don’t really need separate validation and test
data. But if you use your validation results as a more substantive
part of your model building, for example to tweak your learning
rate or explore modified network architectures, then it would
be prudent to run your final model on some “clean” test data
that hadn’t been used at all when training your model.

You said in your original post that when you reran your model
on some training data, your accuracy fell from 90% to 60%.
This would be very surprising and suggest an outright error.

Even if you overfit your training data, you should still get
good results on your training data. Of course, you might
get significantly worse results on your test data, but that’s not
what you said in your original post.

The point is that whether or not you overfit, or whether or not
you augment your training data, if you rerun your model on
training data, you should get the same results as when you
ran it on the training data before. If you get significantly
different results (i.e., 60% accuracy instead of 90%) something
looks very fishy.

So I only see two choices: You have an error in your calculation
somewhere, or your 60% accuracy came from running your model
on what turned out to be an unusual subsample of your training
data. This latter seems quite unlikely, but not impossible.

If you are still having this issue, perhaps you could clarify what
you did, and add a few more details about which, and which kinds
of data give you which accuracies.

Best.

K. Frank

Thanks for being patient and that explanation. Here is a bit more detail.

I have 4 large satellite images that include multiple channels. I have a script that tiles each satellite image separately into 128x128xchannels (red, nir, slope, and a few indices). Each of these tiles are rotated 3 times, transposed, and rotated again. So one tile becomes 8 tiles. The script randomly assigns 72% of the tiles as training, 8% as validation, and 20% as test.

I do a little augmentation in the form of setting no data to a specific values in my channels and in my mask. I also normalize my data. I do this for ALL tiles (training, validation, and test).

I load the data and check it with a few image comparisons to make sure my training data aligns properly with my mask. Everything here looks good.

I start training my model with the training tiles. Currently there are about 30k total 128x128 tiles (72% of those as training) and I’m using a batch size of 16. I run out of memory if I jump to 32. After the first epoch my training accuracy is ~90%. I run validation and also get about 90% accuracy. I’ve been letting it run for a few epochs then stop. Each time training and validation accuracy is 90%.

I now apply the model to the test tiles. I check a few test predictions to the mask and a band of satellite imagery to make sure everything is looking good. Sure enough it all lines up as it should and reporting about ~90% accuracy as well. I actually run a confusion matrix on it and all 3 classes are performing great (88% accuracy for the worst one). I said 4 classes before because I actually have some nodata that I classify as a 4th class. This nodata class typically gets 99 or 100% accuracy.

At this point I’m happy with how things are looking and everything makes sense.

To run my “production” model I’ve written another script that does tiling, a little augmentation described above (no data removal and normalization), applies the model, then outputs the prediction to a tile. I merge all tiles and it’s done.

I figured I could feed this production model classification script one of my initial satellite images and I should get pretty good results. This is where I’m getting 60% accuracy. As I would imagine, and you said, it should be much higher. Maybe I need to dig into my classification script more.

Hi Mike!

Well, as you say, since your classification script is separate from
your training script, it would certainly make sense to check that no
outright error has crept into one or the other.

You do, however, say one thing that caught my eye. You normalize
your training (and test) data, but you don’t normalize your data in
the classification script. I could well believe that training on
normalized data would give you a model that performs much less
well on unnormalized data. (Without seeing your data or how you
normalize I don’t really have an expectation one way or the other,
but it wouldn’t surprise me for this to be the issue.)

(The same comments apply to the “data removal” you mentioned.)

By the way, there’s nothing the matter with having your final
production classification pipeline start with a normalization step
so that it matches how you did your training. If normalization
helps you train your model, I would normalize the production
data to be classified, as well.

Best.

K. Frank

K Frank,

You were absolutely right. My data input had class mask as channel 1 and during my data input I ignored that and thought about a production scenario. All my input channels were off! After adding a single number everything worked much better. Gotta love debugging.

It’s hard to keep all this stuff straight sometime! I do normalize my training/validation/test data using the mean and standard deviation derived from the training samples. I normalize my “production” test data with values derived from the input test data.

It may be a better idea to normalize based on samples of the dataset going in. Training normalization uses mean/stdev derived from itself while validation derives mean/stdev from validation samples. Same for test.

Thank you for much for your help.

Mike

After using same dataset used in train and validation but giving different accuracy both in training time and evaluating time
when training It gives accuracy: 46.658333 ... but when validation on same epoch and dataset gave 55.625000 .. Do you think there is any wrong in my model, Here is summary when training

Epoch: 0	Training Loss: 1.138581	Validation Loss: 1.013368
 Training Accuracy: 46.658333	Validation Accuracy: 55.625000


Epoch: 1	Training Loss: 0.932076	Validation Loss: 0.648011
 Training Accuracy: 59.308334	Validation Accuracy: 67.849998

Epoch: 2	Training Loss: 0.704100	Validation Loss: 0.423650
 Training Accuracy: 71.991669	Validation Accuracy: 80.199997

Epoch: 3	Training Loss: 0.447224	Validation Loss: 0.471296
 Training Accuracy: 82.716667	Validation Accuracy: 89.824997

Epoch: 4	Training Loss: 0.274694	Validation Loss: 0.292243
 Training Accuracy: 89.974998	Validation Accuracy: 93.008331
Validation loss decreased...........
saving model