I copy the code from https://github.com/pytorch/examples/blob/master/mnist/main.py
And add these lines:
def reinforce():
model.train()
reward = torch.zeros([64,10]).cuda()
for batch_idx, (data, target) in enumerate(train_loader):
if len(target)<64:
continue
temp_target = target.clone()
if args.cuda:
data, target = data.cuda(), target.cuda()
data, target = Variable(data), Variable(target)
optimizer.zero_grad()
output = model(data)
pred = output.data.max(1, keepdim=True)[1]
for i in range(64):
temp1 = temp_target[i]
temp2 = pred[i].cpu().numpy()
if temp1!=temp2[0]:
reward[i][temp_target[i]] = -1
sample = torch.multinomial(output,10)
sample.reinforce(reward)
sample.backward()
# loss = F.nll_loss(output, target)
# loss.backward()
optimizer.step()
if batch_idx % args.log_interval == 0:
print(batch_idx)
for epoch in range(1, args.epochs + 1):
train(epoch)
test()
reinforce()
test()
The core idea is here
for i in range(64):
temp1 = temp_target[i]
temp2 = pred[i].cpu().numpy()
if temp1!=temp2[0]:
reward[i][temp_target[i]] = -1
I want to penalize if the predicted result and ground truth does not equal.
The result is: After first train()
, the accuracy is 94% and after the first reinforce()
the accuracy is 11%