What's wrong with my reinforce and mnist?

I copy the code from https://github.com/pytorch/examples/blob/master/mnist/main.py

And add these lines:

    def reinforce():
        model.train()
        reward = torch.zeros([64,10]).cuda()
        for batch_idx, (data, target) in enumerate(train_loader):
            if len(target)<64:
                continue
            temp_target = target.clone()
            if args.cuda:
                data, target = data.cuda(), target.cuda()
            data, target = Variable(data), Variable(target)
            optimizer.zero_grad()
            output = model(data)
            pred = output.data.max(1, keepdim=True)[1]
    
            for i in range(64):
                temp1 = temp_target[i]
                temp2 = pred[i].cpu().numpy()
                if temp1!=temp2[0]:
                    reward[i][temp_target[i]] = -1
    
            sample = torch.multinomial(output,10)
            sample.reinforce(reward)
            sample.backward()
            # loss = F.nll_loss(output, target)
            # loss.backward()
            optimizer.step()
            if batch_idx % args.log_interval == 0:
                print(batch_idx)
    
    for epoch in range(1, args.epochs + 1):
        train(epoch)
        test()
        reinforce()
        test()

The core idea is here

            for i in range(64):
                temp1 = temp_target[i]
                temp2 = pred[i].cpu().numpy()
                if temp1!=temp2[0]:
                    reward[i][temp_target[i]] = -1

I want to penalize if the predicted result and ground truth does not equal.

The result is: After first train(), the accuracy is 94% and after the first reinforce() the accuracy is 11%