i dont think this is because of momentum. It is probably because of the way new samples are selected in the dataset.
PyTorch selects samples from dataset without replacement.
Which means, at the beginning of a new epoch, it is likely that you saw a sample in training set that you saw at the end of the last epoch. But over the epoch, you will never see a repeated sample.
Caffe probably samples with replacement, which means it is equally likely to see the same sample at any part of the epoch.
To verify this theory, you can write a with-replacement sampler, and see if that removes the sawtooth-shape from the loss:
class WithReplacementRandomSampler(Sampler):
"""Samples elements randomly, with replacement.
Arguments:
data_source (Dataset): dataset to sample from
"""
def __init__(self, data_source):
self.data_source = data_source
def __iter__(self):
# generate samples of `len(data_source)` that are of value from `0` to `len(data_source)-1`
samples = torch.LongTensor(len(self.data_source))
samples.random_(0, len(self.data_source))
return iter(samples)
def __len__(self):
return len(self.data_source)
# then change the constructor of train_loader this way
self.train_loader = torch.utils.data.Dataloader(dataset, ..., sampler=WithReplacementRandomSampler(dataset), shuffle=False)