GAN based caption generation (how to propagate Adversarial loss gradients correctly)?

I am looking at these papers (GAN based image to caption generation):

that produce multiple captions for a single image using GAN based architecture, they encode the predicted caption from generator (using RNN) for feeding it into the discriminator.

My Question:

  1. When encoding the predicted caption, torch.max indices/ argmax of generator outputs will convert the logits into a sentence. Which can be encoded and used in discriminator. The discriminator output is used in the adversarial loss for the generator. But the torch.max is not differentiable, So it doesn’t make sense (not sure how these papers implemented it).
  2. Instead of taking argmax, if we encode the logits directly from the generator, wouldn’t it be very easy for the discriminator to know these are (not ones and zeros) so it must be from generator/fake and (these are ones and zeros) must be from ground truth.

Anyone with ideas or knowledge in this topic (generator with a classification outputs in GAN). please share your ideas.

Thank you,