Problem saving nn.Module as a TorchScript module (DLRM model)

pollo_loco · July 23, 2020, 11:23pm

Hi, I am trying to create a TorchScript module of Facebook’s deep learning recommendation model (DLRM) using torch.jit.script() method. The conversion fails owing to the following runtime error:

RuntimeError: 
cannot call a value of type 'Tensor':
  File "dlrm_s_pytorch.py", line 275
        # return x
        # approach 2: use Sequential container to wrap all layers
        return layers(x)
               ~~~~~~ <--- HERE
'DLRM_Net.apply_mlp' is being compiled since it was called from 'DLRM_Net.sequential_forward'
  File "dlrm_s_pytorch.py", line 343
    def sequential_forward(self, dense_x, lS_o, lS_i):
        # process dense features (using bottom mlp), resulting in a row vector
        x = self.apply_mlp(dense_x, self.bot_l)
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        # debug prints
        # print("intermediate")
'DLRM_Net.sequential_forward' is being compiled since it was called from 'DLRM_Net.forward'
  File "dlrm_s_pytorch.py", line 337
    def forward(self, dense_x, lS_o, lS_i):
        if self.ndevices <= 1:
            return self.sequential_forward(dense_x, lS_o, lS_i)
                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        else:
            return self.parallel_forward(dense_x, lS_o, lS_i)

To recreate the error:

Clone the DLRM repository and install the requirements.

<activate virtual environment>
git clone https://github.com/facebookresearch/dlrm.git
cd dlrm
pip install requirements.txt

Add the following line in dlrm_s_pytorch.py at after line 179 to solve a type conversion issue:

n = n.item()

Add the following snippet in dlrm_s_pytorch.py after the architecture object is initialized:

dlrm_jit = torch.jit.script(dlrm)
sys.exit() # successful exit after compiling, no need to train

Run the below command:

python  dlrm_s_pytorch.py --arch-sparse-feature-size=32 --arch-embedding-size="70446-298426-33086-133729-61823" --data-size=20480  --arch-mlp-bot="256-256-128-32" --arch-mlp-top="256-64-1" --max-ind-range=400000 --data-generation=random --loss-function=bce --nepochs=5 --round-targets=True --learning-rate=1.0  --mini-batch-size=2048

1chimaruGin · March 7, 2024, 6:47pm

Did you solve it?

I’m also facing that issue.