Index out of range from trainer.fit

El-Hassan-Hajbi · July 24, 2023, 3:09pm

Hello All,

I am using google TFT model for multi-step timeseries forecasting . However when fitting my trainer object :

I get the following error : index out of range

Does anyone know what is the problem?
Thanks in advance

ptrblck · July 24, 2023, 6:55pm

You are not giving enough information to give you any valid suggestion besides that apparently an indexing operation fails.
Check the stacktraces to see where the error is coming from.

El-Hassan-Hajbi · July 25, 2023, 8:31am

Yes, in fact I am sorry I didn’t provide more informations … I have already tried to check the stacktraces to see where the error is coming from.

IndexError                                Traceback (most recent call last)
Cell In[326], line 2
      1 # fit network
----> 2 trainer.fit(
      3     tft,
      4     train_dataloaders=train_dataloader,
      5     val_dataloaders=val_dataloader,
      6 )

File ~\AppData\Local\anaconda3\lib\site-packages\lightning\pytorch\trainer\trainer.py:529, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    527 model = _maybe_unwrap_optimized(model)
    528 self.strategy._lightning_module = model
--> 529 call._call_and_handle_interrupt(
    530     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    531 )

File ~\AppData\Local\anaconda3\lib\site-packages\lightning\pytorch\trainer\call.py:42, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     40     if trainer.strategy.launcher is not None:
     41         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
---> 42     return trainer_fn(*args, **kwargs)
     44 except _TunerExitException:
45     _call_teardown_hook(trainer)

File ~\AppData\Local\anaconda3\lib\site-packages\lightning\pytorch\trainer\trainer.py:568, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    558 self._data_connector.attach_data(
    559     model, train_dataloaders=train_dataloaders, val_dataloaders=val_dataloaders, datamodule=datamodule
    560 )
    562 ckpt_path = self._checkpoint_connector._select_ckpt_path(
    563     self.state.fn,
    564     ckpt_path,
    565     model_provided=True,
    566     model_connected=self.lightning_module is not None,
567 )
--> 568 self._run(model, ckpt_path=ckpt_path)
    570 assert self.state.stopped
    571 self.training = False

File ~\AppData\Local\anaconda3\lib\site-packages\lightning\pytorch\trainer\trainer.py:973, in Trainer._run(self, model, ckpt_path)
    968 self._signal_connector.register_signal_handlers()
    970 # ----------------------------
    971 # RUN THE TRAINER
    972 # ----------------------------
--> 973 results = self._run_stage()
    975 # ----------------------------
    976 # POST-Training CLEAN UP
    977 # ----------------------------
    978 log.debug(f"{self.__class__.__name__}: trainer tearing down")

File ~\AppData\Local\anaconda3\lib\site-packages\lightning\pytorch\trainer\trainer.py:1014, in Trainer._run_stage(self)
   1012 if self.training:
1013     with isolate_rng():
-> 1014         self._run_sanity_check()
   1015     with torch.autograd.set_detect_anomaly(self._detect_anomaly):
   1016         self.fit_loop.run()

File ~\AppData\Local\anaconda3\lib\site-packages\lightning\pytorch\trainer\trainer.py:1043, in Trainer._run_sanity_check(self)
   1040 call._call_callback_hooks(self, "on_sanity_check_start")
   1042 # run eval step
-> 1043 val_loop.run()
   1045 call._call_callback_hooks(self, "on_sanity_check_end")
   1047 # reset logger connector

File ~\AppData\Local\anaconda3\lib\site-packages\lightning\pytorch\loops\utilities.py:177, in _no_grad_context.<locals>._decorator(self, *args, **kwargs)
    175     context_manager = torch.no_grad
    176 with context_manager():
--> 177     return loop_run(self, *args, **kwargs)

File ~\AppData\Local\anaconda3\lib\site-packages\lightning\pytorch\loops\evaluation_loop.py:122, in _EvaluationLoop.run(self)
    120         self._restarting = False
    121 self._store_dataloader_outputs()
--> 122 return self.on_run_end()
File ~\AppData\Local\anaconda3\lib\site-packages\lightning\pytorch\loops\evaluation_loop.py:244, in _EvaluationLoop.on_run_end(self)
    241 self.trainer._logger_connector._evaluation_epoch_end()
    243 # hook
--> 244 self._on_evaluation_epoch_end()
    246 logged_outputs, self._logged_outputs = self._logged_outputs, []  # free memory
    247 # include any logged outputs on epoch_end

File ~\AppData\Local\anaconda3\lib\site-packages\lightning\pytorch\loops\evaluation_loop.py:326, in _EvaluationLoop._on_evaluation_epoch_end(self)
    324 hook_name = "on_test_epoch_end" if trainer.testing else "on_validation_epoch_end"
    325 call._call_callback_hooks(trainer, hook_name)
--> 326 call._call_lightning_module_hook(trainer, hook_name)
    328 trainer._logger_connector.on_epoch_end()

File ~\AppData\Local\anaconda3\lib\site-packages\lightning\pytorch\trainer\call.py:144, in _call_lightning_module_hook(trainer, hook_name, pl_module, *args, **kwargs)
    141 pl_module._current_fx_name = hook_name
    143 with trainer.profiler.profile(f"[LightningModule]{pl_module.__class__.__name__}.{hook_name}"):
--> 144     output = fn(*args, **kwargs)
146 # restore current_fx when nested context
    147 pl_module._current_fx_name = prev_fx_name

File ~\AppData\Local\anaconda3\lib\site-packages\pytorch_forecasting\models\base_model.py:636, in BaseModel.on_validation_epoch_end(self)
    635 def on_validation_epoch_end(self):
--> 636     self.on_epoch_end(self.validation_step_outputs)
    637     self.validation_step_outputs.clear()

File ~\AppData\Local\anaconda3\lib\site-packages\pytorch_forecasting\models\temporal_fusion_transformer\__init__.py:539, in TemporalFusionTransformer.on_epoch_end(self, outputs)
    535 """
536 run at epoch end for training or validation
    537 """
    538 if self.log_interval > 0 and not self.training:
--> 539     self.log_interpretation(outputs)

File ~\AppData\Local\anaconda3\lib\site-packages\pytorch_forecasting\models\temporal_fusion_transformer\__init__.py:796, in TemporalFusionTransformer.log_interpretation(self, outputs)
    789 """
    790 Log interpretation metrics to tensorboard.
    791 """
    792 # extract interpretations
    793 interpretation = {
794     # use padded_stack because decoder length histogram can be of different length
    795     name: padded_stack([x["interpretation"][name].detach() for x in outputs], side="right", value=0).sum(0)
--> 796     for name in outputs[0]["interpretation"].keys()
    797 }
    798 # normalize attention with length histogram squared to account for: 1. zeros in attention and
    799 # 2. higher attention due to less values
    800 attention_occurances = interpretation["encoder_length_histogram"][1:].flip(0).float().cumsum(0)

IndexError: list index out of range

Also, I think It will be helpful to see how did I define my validation_dataloader which causes the problem I think :

training = TimeSeriesDataSet(
    data=data_PA6[lambda x: x.time_idx <= training_cutoff],
    group_ids=['group'],
    time_idx="time_idx",
    target = "best_price_compound",
    min_encoder_length= 1,  # keep encoder length long (as it is in the validation set)
    max_encoder_length=3,
    min_prediction_length=6,
    max_prediction_length=6,
    time_varying_unknown_reals=['best_price_compound'],
    predict_mode=False,
)

# create validation set (predict=True) which means to predict the last max_prediction_length points in time
# for each series
validation = TimeSeriesDataSet.from_dataset(training, data_PA6, predict=True, stop_randomization=True)
# create dataloaders for model
batch_size = 32  # set this between 32 to 128
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0, shuffle=False)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size, num_workers=0, shuffle=False)

and here is my trainer object :

# configure network and trainer
early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=10, verbose=False, mode="min")
lr_logger = LearningRateMonitor()  # log the learning rate
logger = TensorBoardLogger("lightning_logs")  # logging results to a tensorboard

trainer = pl.Trainer(
    max_epochs=100,
    accelerator="cpu",
    enable_model_summary=True,
    gradient_clip_val=0.1,
    limit_train_batches=10,  # coment in for training, running valiation every 30 batches
    # fast_dev_run=True,  # comment in to check that networkor dataset has no serious bugs
    callbacks=[lr_logger, early_stop_callback],
    logger=logger,
    #num_sanity_val_steps=3,
)```

Thanks in advance for the help

El-Hassan-Hajbi · July 25, 2023, 8:59am

from the traceback : we see that output is empty (thus output[0] return list index out of range), also output = self.validation_step_outputs … but how does this help ??

El-Hassan-Hajbi · July 25, 2023, 10:22am

After debugging, I think the problem comes from validation_step which is not called, since it calls on_validation_epoch directly :

def on_validation_epoch_end(self):
        self.on_epoch_end(self.validation_step_outputs)
        self.validation_step_outputs.clear()

here I get self.validation_step_outputs empty list as mentioned before.

while this function is responsible of adding the val_loss into the list I think

def validation_step(self, batch, batch_idx):
        x, y = batch
        log, out = self.step(x, y, batch_idx)
        log.update(self.create_log(x, y, out, batch_idx))
        self.validation_step_outputs.append(log)
        return log

Does it sound correct ? and why is it not called then ? why does the list stay empty ?

ptrblck · July 25, 2023, 6:14pm

I don’t know, but it sounds more like a lightning-related question as I’m not deeply familiar with this higher-level API.

chess2021 · July 25, 2023, 7:54pm

@El-Hassan-Hajbi, this is pytorch forecasting, can you share how you define your data, I am talking about TimeSerieDataset part. There are a couple of things that potentially could be wrong.

El-Hassan-Hajbi · July 25, 2023, 9:05pm

training = TimeSeriesDataSet(
    data=data_PA6[lambda x: x.time_idx <= training_cutoff],
    group_ids=['group'],
    time_idx="time_idx",
    target = "best_price_compound",
    min_encoder_length= 1,  # keep encoder length long (as it is in the validation set)
    max_encoder_length=3,
    min_prediction_length=6,
    max_prediction_length=6,
    time_varying_unknown_reals=['best_price_compound'],
    predict_mode=False,
)

# create validation set (predict=True) which means to predict the last max_prediction_length points in time
# for each series
validation = TimeSeriesDataSet.from_dataset(training, data_PA6, predict=True, stop_randomization=True)

chess2021 · July 25, 2023, 9:25pm

Add this in your TimeSeriesDataSet:

add_relative_time_idx = True

Let me know if this solves your problem.

El-Hassan-Hajbi · July 25, 2023, 9:34pm

Thanks ! It works

I am interested on knowing more about how it solved the problem

chess2021 · July 25, 2023, 9:43pm

Yes, TFT is based on transformer model, which means encoder-decoder with attention. Now you have some features for encoder but you need to have the same for decoder, and if the relative_time_idx is set to False, then list/dataframe of output is empty and you are trying to access its elements, hence the error message.

El-Hassan-Hajbi · July 26, 2023, 7:25am

Okayy Thanks. Also, I didn’t understand when to set add_encoder_length = True