Function 'Scaled Dot Product Efficient Attention Backward0' returned nan values in its 0th output

Hello,

I got an error “Function ‘ScaledDotProductEfficientAttentionBackward0’ returned nan values in its 0th output.” when training custom model.

This exception will not be triggered immediately during training, but will be triggered after training for many epochs.

At first I thought it was a problem with my input data. I checked the data input to the model for invalid values (x.isnan().any() or x.isinf().any()), and there were no invalid values. I also checked that the forward process, as well as the loss value, did not contain invalid values too.

BUT, when loss.backward() run ,some of model’s params grad change to nan. I don’t know why? So I enable the set_detect_anaomly(True) to track the exception, the debug tool show “Function ‘ScaledDotProductEfficientAttentionBackward0’ returned nan values in its 0th output.”. Its a sublayer of TransformerEncoderLayer. This is an official module.

My code snippet as below:


class PositionalEncodingLayer(nn.Module):
    def __init__(self, d_model, max_len=10000):
        super(PositionalEncodingLayer, self).__init__()
        self.d_model = d_model
        self.max_len = max_len
        self.pos_encoding = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )
        self.pos_encoding[:, 0::2] = torch.sin(position * div_term)
        self.pos_encoding[:, 1::2] = torch.cos(position * div_term)
        self.pos_encoding = self.pos_encoding.unsqueeze(0)

    def forward(self, x):
        x = x * math.sqrt(self.d_model+1e-8)
        seq_len = x.size(1)
        if seq_len > self.max_len:
            raise ValueError("Sequence length exceeds maximum length")
        else:
            pos_enc = self.pos_encoding[:, :seq_len, :]
            x = x + pos_enc.to(x.device)
            return x


class TransformerNet(nn.Module):
    def __init__(
        self, input_size, d_model, output_size, activation: str = "leaky_relu"
    ):
        super(TransformerNet, self).__init__()
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.input_size = input_size
        self.output_size = output_size
        self.activation = activation
        self.negative_slope = 0.01
        self.batch_norm = torch.nn.BatchNorm2d(
            num_features=self.input_size, affine=False
        )
        self.encoder_layer = nn.Linear(self.input_size, d_model)
        self.pos_encoding = PositionalEncodingLayer(d_model)
        self.transformer_encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=8,
            dim_feedforward=4 * d_model,
            dropout=0.1,
            activation="gelu",
            layer_norm_eps=1e-05,
            batch_first=True,
        )
        self.fc_layer = nn.Linear(d_model, output_size)


    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """

        Args:
            x (torch.Tensor): (batch_size, seq_len, features)

        Returns:
            output (torch.Tensor): (batch_size, output_size)
        """
        # Normalize
        _shape = x.shape  # (batch_size, seq_len, features)
        x = x.contiguous().view(-1, _shape[-1])  # (batch_size * seq_len, features)
        x = x.unsqueeze(-1).unsqueeze(-1)  # (batch_size * seq_len, features, 1, 1)
        x = (
            self.batch_norm(x).squeeze(-1).squeeze(-1)
        )  # shape: (batch_size * seq_len, features)
        x = x.contiguous().view(_shape)  # (batch_size, seq_len, features)
        x = self.encoder_layer(x)  # (batch_size, seq_len, d_model)
        x = self.pos_encoding(x)  # (batch_size, seq_len, d_model)
        seq_len = x.shape[1]
        attn_mask = nn.Transformer.generate_square_subsequent_mask(
            seq_len, self.device
        ).bool()
        # Transformer Encoder Layer
        x = self.transformer_encoder_layer(
            x, src_mask=attn_mask, is_causal=True
        )  # (batch_size, seq_len, d_model)
        x = x[:, -1, :]  # (batch_size, d_model)

        # FC layer
        if self.activation == "relu":
            output = F.relu(self.fc_layer(x))
        elif self.activation == "leaky_relu":
            output = F.leaky_relu(self.fc_layer(x), self.negative_slope)
        else:
            raise ValueError("Unknown activation function " + str(self.activation))

        return output


class FCNet(nn.Module):
    def __init__(self, input_size, output_size, activation: str = "leaky_relu"):
        super(FCNet, self).__init__()
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.input_size = input_size
        self.output_size = output_size
        self.fc_layer = nn.Linear(self.input_size, self.output_size)
        self.activation = activation
        self.negative_slope = 0.01
        self.init_weights()

    def zscore(self, x):
        mean = x.mean()
        std = x.std()
        z_score = (x - mean) / (std+1e-8)
        return z_score

    def init_weights(self):
        for name, param in self.fc_layer.named_parameters():
            if "weight" in name:
                if self.activation == "relu":
                    nn.init.kaiming_normal_(param, nonlinearity=self.activation)
                elif self.activation == "leaky_relu":
                    nn.init.kaiming_normal_(
                        param, a=self.negative_slope, nonlinearity=self.activation
                    )
                else:
                    raise ValueError(
                        "Unknown activation function " + str(self.activation)
                    )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """

        Args:
            x (torch.Tensor): (batch_size, features)

        Returns:
            output (torch.Tensor): (batch_size, output_size)
        """
        # Z-Score
        x = self.zscore(x)  # shape: (batch_size, features)
        # FC layer
        if self.activation == "relu":
            output = F.relu(self.fc_layer(x))
        elif self.activation == "leaky_relu":
            output = F.leaky_relu(self.fc_layer(x), self.negative_slope)
        else:
            raise ValueError("Unknown activation function " + str(self.activation))

        return output


class QActor(nn.Module):
    def __init__(
        self,
        num_actions: int,
        feature_5_size: int,
        feature_1_size: int,
        feature_4_size: int,
        feature_2_size: int,
        feature_3_size: int,
        feature_extract_size: int = 1024,
        hidden_layers: tuple = (1024,),
        d_model: int = 512,
        activation: str = "leaky_relu",
        **kwargs,
    ):
        super(QActor, self).__init__()

        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.num_actions = num_actions
        self.feature_5_size = feature_5_size
        self.feature_1_size = feature_1_size
        self.feature_4_size = feature_4_size
        self.feature_2_size = feature_2_size
        self.feature_3_size = feature_3_size
        self.activation = activation
        self.negative_slope = 0.01

        self.feature_extractor_1 = TransformerNet(
            input_size=self.feature_1_size,
            d_model=d_model,
            output_size=feature_extract_size,
            activation=activation,
        )

        self.feature_extractor_2 = TransformerNet(
            input_size=self.feature_2_size,
            d_model=d_model,
            output_size=feature_extract_size,
            activation=activation,
        )

        self.feature_extractor_3 = nn.Sequential(
            nn.Flatten(start_dim=1, end_dim=-1),
            FCNet(
                input_size=self.feature_3_size,
                output_size=512,
                activation=activation,
            ),
            FCNet(
                input_size=512,
                output_size=feature_extract_size,
                activation=activation,
            ),
        )

        self.feature_extractor_4 = FCNet(
            input_size=self.feature_4_size,
            output_size=feature_extract_size,
            activation=activation,
        )

        self.feature_extractor_5 = FCNet(
            input_size=self.feature_5_size,
            output_size=feature_extract_size,
            activation=activation,
        )

        # FC layer
        self.fc_layers = nn.ModuleList()
        input_size = feature_extract_size
        last_hidden_layer_size = input_size
        if hidden_layers is not None:
            nh = len(hidden_layers)
            self.fc_layers.append(nn.Linear(input_size, hidden_layers[0]))
            for i in range(1, nh):
                self.fc_layers.append(nn.Linear(hidden_layers[i - 1], hidden_layers[i]))
            last_hidden_layer_size = hidden_layers[nh - 1]
        self.fc_layers.append(nn.Linear(last_hidden_layer_size, self.num_actions))

        self.init_weights()

    def init_weights(self):
        for name, param in self.fc_layers.named_parameters():
            if "weight" in name:
                if self.activation == "relu":
                    nn.init.kaiming_normal_(param, nonlinearity=self.activation)
                elif self.activation == "leaky_relu":
                    nn.init.kaiming_normal_(
                        param, a=self.negative_slope, nonlinearity=self.activation
                    )
                else:
                    raise ValueError(
                        "Unknown activation function " + str(self.activation)
                    )

    def forward(
        self,
        feature_1: torch.Tensor,
        feature_4: torch.Tensor,
        feature_2: torch.Tensor,
        feature_3: torch.Tensor,
        feature_5: torch.Tensor,
    ) -> torch.Tensor:
        """

        Args:
            feature_1 (torch.Tensor): (batch_size, seq_len, features1)
            feature_5 (torch.Tensor): (batch_size, features2)

        Returns:
        """
        # ---------------------------------------------------------------- #
        feature = self.feature_extractor_1(
            feature_1
        )  # (batch_size, feature_extract_size)

        # --------------------------------------------------------------- #
        # Create a mask that marks all-zero samples as False and non-zero samples as True
        mask = ~feature_2.eq(0).all(dim=1).all(dim=1)
        # Apply a mask to select samples to process
        feature_2_state_selected = feature_2[mask]
        feature_2_state_selected_output = self.feature_extractor_2(
            feature_2_state_selected
        )
        feature_2_output = torch.zeros(
            feature_2.shape[0],
            feature_2_state_selected_output.shape[1],
            device=self.device,
        )  # (batch_size, feature_extract_size)
        feature_2_output[mask] = feature_2_state_selected_output
        feature = feature + feature_2_output

        # -------------------------------------------------------------- #
        mask = ~feature_3.eq(0).all(dim=1).all(dim=1)
        feature_3_selected = feature_3[mask]
        feature_3_selected_output = self.feature_extractor_3(feature_3_selected)
        feature_3_output = torch.zeros(
            feature_3.shape[0], feature_3_selected_output.shape[1], device=self.device
        )  # (batch_size, feature_extract_size)
        feature_3_output[mask] = feature_3_selected_output
        feature = feature + feature_3_output

        # ------------------------------------------------------------------ #
        feature_4_output = self.feature_extractor_4(
            feature_4
        )  # (batch_size, feature_extract_size)
        feature = feature + feature_4_output

        # ------------------------------------------------------------------ #
        feature_5_output = self.feature_extractor_5(
            feature_5
        )  # (batch_size, feature_extract_size)
        feature = feature + feature_5_output

        # ----------------------------------------------------------------------- #
        x = feature  # (batch_size, feature_extract_size)
        num_layers = len(self.fc_layers)
        for i in range(0, num_layers - 1):
            if self.activation == "relu":
                x = F.relu(self.fc_layers[i](x))
            elif self.activation == "leaky_relu":
                x = F.leaky_relu(self.fc_layers[i](x), self.negative_slope)
            else:
                raise ValueError("Unknown activation function " + str(self.activation))
        Q = self.fc_layers[-1](x)
        return Q

    def get_batchnorm_params(self):
        params = [
            self.feature_extractor_1.batch_norm.running_mean,
            self.feature_extractor_1.batch_norm.running_var,
            self.feature_extractor_2.batch_norm.running_mean,
            self.feature_extractor_2.batch_norm.running_var,
        ]
        return params

class TransformerAgent(object):
    def __init__(
        self,
        num_actions,
        feature_5_size,
        feature_1_size,
        feature_4_size,
        feature_2_size,
        feature_3_size,
        window_size,
        save_dir,
        actor_class=QActor,
        actor_kwargs={},
        epsilon_initial=1.0,
        epsilon_final=0.05,
        epsilon_steps=1000000,
        batch_size=64,
        gamma=0.99,
        tau_actor=0.01,
        replay_memory_size=2048,
        learning_rate_actor=0.001,
        initial_memory_threshold=0,
        loss_func=F.smooth_l1_loss,
        clip_grad=10.0,
        device="cuda" if torch.cuda.is_available() else "cpu",
        name="TransformerAgent",
        ckpt_path=None,
        seed=None,
    ):
        #######......######

        self.actor = actor_class(
            self.num_actions,
            feature_5_size,
            feature_1_size,
            feature_4_size,
            feature_2_size[1],
            feature_3_size[1] * feature_3_size[0],
            **actor_kwargs,
        ).to(device)
        
        self.actor_target = actor_class(
            self.num_actions,
            feature_5_size,
            feature_1_size,
            feature_4_size,
            feature_2_size[1],
            feature_3_size[1] * feature_3_size[0],
            **actor_kwargs,
        ).to(device)

        hard_update_target_network(self.actor, self.actor_target)
        self.actor_target.eval()

        # l1_smooth_loss performs better but original paper used MSE
        self.loss_func = loss_func
        self.actor_optimiser = optim.AdamW(
            self.actor.parameters(), lr=self.learning_rate_actor, weight_decay=0.01
        )

        episodes = epsilon_steps * 10
        self.actor_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
            self.actor_optimiser, T_max=episodes * window_size, eta_min=0.000001
        )

    def _optimize_td_loss(self, pipe_child):
        if self._step < self.batch_size or self._step < self.initial_memory_threshold:
            return

        (
            feature_1,
            feature_4,
            feature_2,
            feature_3,
            feature_5,
            actions,
            rewards,
            next_feature_1,
            next_feature_4,
            next_feature_2,
            next_feature_3,
            next_feature_5,
            terminals,
        ) = self.replay_memory.sample(self.batch_size, random_machine=self.np_random)

        rewards_tensor = torch.from_numpy(rewards).to(self.device).squeeze()
        feature_1_tensor = torch.from_numpy(feature_1).to(self.device)
        feature_4_tensor = torch.from_numpy(feature_4).to(self.device)
        feature_2_tensor = torch.from_numpy(feature_2).to(self.device)
        feature_3_tensor = torch.from_numpy(feature_3).to(self.device)
        feature_5_tensor = torch.from_numpy(feature_5).to(self.device)

        next_feature_1_tensor = torch.from_numpy(next_feature_1).to(self.device)
        next_feature_4_tensor = torch.from_numpy(next_feature_4).to(
            self.device
        )
        next_feature_2_tensor = torch.from_numpy(next_feature_2).to(
            self.device
        )
        next_feature_3_tensor = torch.from_numpy(next_feature_3).to(self.device)
        next_feature_5_tensor = torch.from_numpy(next_feature_5).to(
            self.device
        )
        actions_tensor = torch.from_numpy(actions).to(self.device, dtype=torch.int64)

        with torch.no_grad():
            pred_Q_a = self.actor_target(
                next_feature_1_tensor,
                next_feature_4_tensor,
                next_feature_2_tensor,
                next_feature_3_tensor,
                next_feature_5_tensor,
            )
            Qprime = torch.max(pred_Q_a, 1, keepdim=True)[0].squeeze()
            target = rewards_tensor + self.gamma * Qprime

        q_values = self.actor(
            feature_1_tensor,
            feature_4_tensor,
            feature_2_tensor,
            feature_3_tensor,
            feature_5_tensor,
        )
        y_predicted = q_values.gather(1, actions_tensor.view(-1, 1)).squeeze()
        y_expected = target
        loss_Q = self.loss_func(y_predicted, y_expected)
        
        self.actor_optimiser.zero_grad()
        ###############################
        ###############################
        ###############################
        loss_Q.backward()
        ###############################
        ###############################
        ###############################
        
        if self.clip_grad > 0:
            clip_grad_value_(self.actor.parameters(), self.clip_grad)

        self.actor_optimiser.step()
        self.actor_scheduler.step()

Could anyone tell me how to solve this problem? Thanks!!

Update: I print the params grad max and min value when loss.backward() trigger exception.

feature_extractor_1.encoder_layer.weight None
feature_extractor_1.encoder_layer.bias None
feature_extractor_1.transformer_encoder_layer.self_attn.in_proj_weight None
feature_extractor_1.transformer_encoder_layer.self_attn.in_proj_bias None
feature_extractor_1.transformer_encoder_layer.self_attn.out_proj.weight Max Grad: 8.951514610089362e-05 Min Grad: -8.035175414988771e-05
feature_extractor_1.transformer_encoder_layer.self_attn.out_proj.bias Max Grad: 3.7594652724237676e-08 Min Grad: -3.374162460545449e-08
feature_extractor_1.transformer_encoder_layer.linear1.weight Max Grad: 2.5694633222883567e-05 Min Grad: -2.45029223151505e-05
feature_extractor_1.transformer_encoder_layer.linear1.bias Max Grad: 6.5152853494510055e-06 Min Grad: -5.0420130719430745e-06
feature_extractor_1.transformer_encoder_layer.linear2.weight Max Grad: 0.00013401411706581712 Min Grad: -0.00012108684313716367
feature_extractor_1.transformer_encoder_layer.linear2.bias Max Grad: 4.693609753303463e-06 Min Grad: -5.495152436196804e-06
feature_extractor_1.transformer_encoder_layer.norm1.weight Max Grad: 4.1498555219732225e-05 Min Grad: -4.8228852392639965e-05
feature_extractor_1.transformer_encoder_layer.norm1.bias Max Grad: 4.146261198911816e-05 Min Grad: -3.850227585644461e-05
feature_extractor_1.transformer_encoder_layer.norm2.weight Max Grad: 0.0020644201431423426 Min Grad: -0.0006027113413438201
feature_extractor_1.transformer_encoder_layer.norm2.bias Max Grad: 0.0003106297517661005 Min Grad: -0.0003301977994851768
feature_extractor_1.fc_layer.weight Max Grad: 0.016715578734874725 Min Grad: -0.011968234553933144
feature_extractor_1.fc_layer.bias Max Grad: 0.00209520454518497 Min Grad: -0.0003535989671945572
feature_extractor_2.encoder_layer.weight Max Grad: 0.0001694482343737036 Min Grad: -0.00017870690498966724
feature_extractor_2.encoder_layer.bias Max Grad: 3.4907207009382546e-05 Min Grad: -3.513778938213363e-05
feature_extractor_2.transformer_encoder_layer.self_attn.in_proj_weight Max Grad: 4.703424565377645e-05 Min Grad: -4.74515873065684e-05
feature_extractor_2.transformer_encoder_layer.self_attn.in_proj_bias Max Grad: 4.548175525087572e-07 Min Grad: -3.601705031996971e-07
feature_extractor_2.transformer_encoder_layer.self_attn.out_proj.weight Max Grad: 6.503146141767502e-05 Min Grad: -6.849414785392582e-05
feature_extractor_2.transformer_encoder_layer.self_attn.out_proj.bias Max Grad: 9.9262919661669e-08 Min Grad: -1.081510134781638e-07
feature_extractor_2.transformer_encoder_layer.linear1.weight Max Grad: 7.056924368953332e-05 Min Grad: -5.751454227720387e-05
feature_extractor_2.transformer_encoder_layer.linear1.bias Max Grad: 1.1851476301671937e-05 Min Grad: -1.3480660527420696e-05
feature_extractor_2.transformer_encoder_layer.linear2.weight Max Grad: 0.00014439018559642136 Min Grad: -0.00018459925195202231
feature_extractor_2.transformer_encoder_layer.linear2.bias Max Grad: 4.5472838792193215e-06 Min Grad: -6.726608262397349e-06
feature_extractor_2.transformer_encoder_layer.norm1.weight Max Grad: 5.6497643527109176e-05 Min Grad: -0.00011246895883232355
feature_extractor_2.transformer_encoder_layer.norm1.bias Max Grad: 5.521976709133014e-05 Min Grad: -5.4526339226868004e-05
feature_extractor_2.transformer_encoder_layer.norm2.weight Max Grad: 0.0023929523304104805 Min Grad: -0.0012163110077381134
feature_extractor_2.transformer_encoder_layer.norm2.bias Max Grad: 0.0003384850570000708 Min Grad: -0.0003447728231549263
feature_extractor_2.fc_layer.weight Max Grad: 0.014368354342877865 Min Grad: -0.018293900415301323
feature_extractor_2.fc_layer.bias Max Grad: 0.0019083074294030666 Min Grad: -0.0006002900190651417
feature_extractor_3.1.fc_layer.weight Max Grad: 0.0009951384272426367 Min Grad: -0.0012072987155988812
feature_extractor_3.1.fc_layer.bias Max Grad: 1.6049791156547144e-05 Min Grad: -1.7335705706500448e-05
feature_extractor_3.2.fc_layer.weight Max Grad: 0.007816547527909279 Min Grad: -0.021214749664068222
feature_extractor_3.2.fc_layer.bias Max Grad: 0.0012966389767825603 Min Grad: -0.004363502841442823
feature_extractor_4.fc_layer.weight Max Grad: 0.05917682498693466 Min Grad: -0.0565340518951416
feature_extractor_4.fc_layer.bias Max Grad: 0.002984903287142515 Min Grad: -0.007486139424145222
feature_extractor_5.fc_layer.weight Max Grad: 0.01591765321791172 Min Grad: -0.016907459124922752
feature_extractor_5.fc_layer.bias Max Grad: 0.005399928893893957 Min Grad: -0.004758366383612156
fc_layers.0.weight Max Grad: 0.0259133018553257 Min Grad: -0.03804416581988335
fc_layers.0.bias Max Grad: 0.0037254367489367723 Min Grad: -0.005426391027867794
fc_layers.1.weight Max Grad: 0.5714874863624573 Min Grad: -0.40693944692611694
fc_layers.1.bias Max Grad: 0.0858272835612297 Min Grad: -0.054877158254384995

And the model params min max value:

market_feature_extractor.encoder_layer.weight Max: 0.13042187690734863 Min: -0.1363317370414734
market_feature_extractor.encoder_layer.bias Max: 0.07871631532907486 Min: -0.06503499299287796
market_feature_extractor.transformer_encoder_layer.self_attn.in_proj_weight Max: 0.15611980855464935 Min: -0.1491391658782959
market_feature_extractor.transformer_encoder_layer.self_attn.in_proj_bias Max: 0.05269873887300491 Min: -0.06690569967031479
market_feature_extractor.transformer_encoder_layer.self_attn.out_proj.weight Max: 0.12881654500961304 Min: -0.14617609977722168
market_feature_extractor.transformer_encoder_layer.self_attn.out_proj.bias Max: 0.029221799224615097 Min: -0.030345002189278603
market_feature_extractor.transformer_encoder_layer.linear1.weight Max: 0.13619253039360046 Min: -0.1354689598083496
market_feature_extractor.transformer_encoder_layer.linear1.bias Max: 0.08933115005493164 Min: -0.06349475681781769
market_feature_extractor.transformer_encoder_layer.linear2.weight Max: 0.11837390810251236 Min: -0.1328926831483841
market_feature_extractor.transformer_encoder_layer.linear2.bias Max: 0.05904746055603027 Min: -0.06853967159986496
market_feature_extractor.transformer_encoder_layer.norm1.weight Max: 1.0596601963043213 Min: 0.9467008113861084
market_feature_extractor.transformer_encoder_layer.norm1.bias Max: 0.04963892325758934 Min: -0.05117850750684738
market_feature_extractor.transformer_encoder_layer.norm2.weight Max: 0.9734431505203247 Min: 0.8732218742370605
market_feature_extractor.transformer_encoder_layer.norm2.bias Max: 0.035640131682157516 Min: -0.043161049485206604
market_feature_extractor.fc_layer.weight Max: 0.11654757708311081 Min: -0.15153095126152039
market_feature_extractor.fc_layer.bias Max: 0.04450284689664841 Min: -0.07761233299970627
one_minute_feature_extractor.encoder_layer.weight Max: 0.4350268244743347 Min: -0.4258303642272949
one_minute_feature_extractor.encoder_layer.bias Max: 0.44492506980895996 Min: -0.4592067003250122
one_minute_feature_extractor.transformer_encoder_layer.self_attn.in_proj_weight Max: 0.16726215183734894 Min: -0.15999849140644073
one_minute_feature_extractor.transformer_encoder_layer.self_attn.in_proj_bias Max: 0.08657839149236679 Min: -0.06956083327531815
one_minute_feature_extractor.transformer_encoder_layer.self_attn.out_proj.weight Max: 0.13722757995128632 Min: -0.14118105173110962
one_minute_feature_extractor.transformer_encoder_layer.self_attn.out_proj.bias Max: 0.049497347325086594 Min: -0.051223743706941605
one_minute_feature_extractor.transformer_encoder_layer.linear1.weight Max: 0.14259777963161469 Min: -0.13675884902477264
one_minute_feature_extractor.transformer_encoder_layer.linear1.bias Max: 0.11711406707763672 Min: -0.11487700790166855
one_minute_feature_extractor.transformer_encoder_layer.linear2.weight Max: 0.1463344693183899 Min: -0.12965907156467438
one_minute_feature_extractor.transformer_encoder_layer.linear2.bias Max: 0.08169647306203842 Min: -0.06548243016004562
one_minute_feature_extractor.transformer_encoder_layer.norm1.weight Max: 1.0735344886779785 Min: 0.9051686525344849
one_minute_feature_extractor.transformer_encoder_layer.norm1.bias Max: 0.0857841745018959 Min: -0.08571480959653854
one_minute_feature_extractor.transformer_encoder_layer.norm2.weight Max: 0.9974909424781799 Min: 0.8585764765739441
one_minute_feature_extractor.transformer_encoder_layer.norm2.bias Max: 0.052284181118011475 Min: -0.0555570013821125
one_minute_feature_extractor.fc_layer.weight Max: 0.12667956948280334 Min: -0.14535324275493622
one_minute_feature_extractor.fc_layer.bias Max: 0.06178232654929161 Min: -0.08643589913845062
tick_feature_extractor.1.fc_layer.weight Max: 0.14137594401836395 Min: -0.15506727993488312
tick_feature_extractor.1.fc_layer.bias Max: 0.0733787938952446 Min: -0.04333639517426491
tick_feature_extractor.2.fc_layer.weight Max: 0.29540860652923584 Min: -0.3080158233642578
tick_feature_extractor.2.fc_layer.bias Max: 0.055248893797397614 Min: -0.09371976554393768
finance_feature_extractor.fc_layer.weight Max: 0.46307581663131714 Min: -0.5139583945274353
finance_feature_extractor.fc_layer.bias Max: 0.10385173559188843 Min: -0.1452408730983734
position_state_extractor.fc_layer.weight Max: 1.9500083923339844 Min: -2.0138347148895264
position_state_extractor.fc_layer.bias Max: 0.4354550838470459 Min: -0.49683696031570435
fc_layers.0.weight Max: 0.2654631733894348 Min: -0.26132845878601074
fc_layers.0.bias Max: 0.06119512394070625 Min: -0.06702601909637451
fc_layers.1.weight Max: 0.11877717077732086 Min: -0.14617018401622772
fc_layers.1.bias Max: 0.03824847191572189 Min: -0.0007357856375165284

My pytorch version is:

torch 2.1.0+cu121

same issue here, did u solve it?

Try to add LayerNorm before TransformerEncoderLayer to stabilize the input tensor. I’m testing on this way, it ok for now.

Hey @guoyaohua, same issue here! I’ve been trying several things (i.e. gradient clipping, regularization, early stopping, added LayerNorm as you suggested, and others…) but nothing. Were you able to solve it?

I have the same issue, and trying very hard to solve it. I feel I am dying.

I have the same issue, I tried implementing LayerNorm before the TransformerEncoder but it didn’t solve the problem.

Anyone has any idea as to what cause this problem?

For what it’s worth, I retrained my model using half-precision (16bit float instead of 32) and it solved the problem. Still no idea as to what is causing the problem though

I think 16 bit float helps because for 16bit float flash attention kernel is possible to use and torch chose it instead of ScaledDotProductEfficientAttention. Also when automated mixed precision is enabled, batches with not finite gradients are automatically skipped.

I managed to reproduce it and submitted a bug

1 Like