Maininting Gradient While Connecting Two Model

I am connecting two models, one model that extracts entities out of a document and the other uses these entities to generate a text summary. Both models are originally trained disjointly in a pipeline. I am trying to train them jointly in a way that gradients can flow all the way from the end model to the first. However, I am having some problems maintaining the gradients in such an integration.

For example, the first model generates the candidate entity starts, ends, and their corresponding scores. The candidate starts and ends they don’t have any gradient function attached to it since they are just all possible candidate combinations up to a length. However, the candidate scores are the result of scoring every candidate using a FNN. I would like to feed the best candidate to the next model without detaching the gradients. What I have now is a normal python function that takes the scores, and returns the indices of the best candidates. I then use these indices to gather the corresponding candidate starts and ends. However, that doesn’t preserve the gradient since originally both candidate_starts and candidate_ends matrices don’t have a grad function attached to it.

###Model 1
# [num-sentences, #candidates]
candidate_starts = [
[1,1,3],
...
]

# [num-sentences, #candidates]
candidate_ends = [
[1,2,4],
...
]

# [num-sentences, #candidates]
candidate_scores = FNN(...) = [
[0.1, 0, 0.3],
...
]

# where candidate_scores.grad_fn != None

###Model 2
# best entities (entity 1 and entity 3)
entity_input = [
   [1], # entities expands for the range from entity_start to entity end, in that case 1 to 1
   [3,4] # 3 to 4
]

I tried gathering the candidate_starts and candidate_ends with the best indices returned from the function but that doesn’t keep gradients. Is there a way that I can gather the indices from the scores tensor instead (in a way similar to max) so that I can use it later in the backward path?

Thanks!

Hi,

You will need to use the scores in a differentiable manner and give a differentiable input to the second network if you want gradients to be able to flow back.
Note that if the inputs of the second network are integers, this most likely won’t be possible as you can’t get gradients for non-continuous types.

1 Like