Different results using jit models

Hi all, I have a custom regression model implemented in pytorch and I am noticing some discrepancies in performance that seem larger than random drift.

First let me describe my data. I have a training set(generated ‘randomly’), a dummy test set(generated ‘randomly’ in the same way as the training set), and a real world test set(significantly smaller than the dummy test set, not generated in the same way)

There are 3 cases I am comparing:
case 1: training model(custom nn.Module) with no jit at all
case 2: training torch.jit.script(model) where modules are custom nn.Module
case 3: training model where nn.Modules have been rewritten as torch.jit.ScriptModules with all methods decorated with @torch.jit.script_method.

In case 1 and 2 the results are similar enough in terms of test/predict error and run times. Case 3 is where things get interesting. In case 3 I obtain significantly worse results with respect to a subset of the output features( ~10% worse MAPE than case 1 and 2) on the dummy test set. This is fine, it probably means there’s a bug in my ScriptModule code. BUT, using this ScriptModule to predict on the real world test set yields a result ~10% better than case 1 or 2. The issue here is twofold. First, I don’t actually care about my results on the dummy test set. Second, in this case I have ground truth knowledge for my real world test data, but in reality I will not have access to this knowledge and can only compare models in terms of their performance on dummy test data. How do I proceed in situations like this? I feel like I’ve opened pandora’s box here because I only started using jit to obtain some performance gains and was not expecting large differences in model results.

You could try to narrow down which module might create the different outputs between model 1/2 and 3 to further isolate the issue as it seems your custom jit ScriptModule approach might cause the issues.
Once you’ve isolated the layer, could you post it here so that we could take a look at it?