Slow libtorch/tch-rs - Output Archive trace?

Kaiyotech · September 8, 2023, 2:05am

Hello,

I’m working on moving some Python code into Rust, using tch-rs. I’m getting better performance than going back to Python using pyO3 bindings into my Rust, but only barely, which seems wrong.

When I do a trace, my thread is spending almost all of its time doing torch::serialize::OutputArchive::save_to and I’m not sure why, since I’m not intending to output or save anything. I’m going to attempt to attach a call stack example.

Struct that includes the network and then the important bit of the new and then the bit with the forward pass.

pub struct GymWrapper { 
     gym: Gym, 
     net: Box<dyn nn::Module>, 
 }

impl GymWrapper{
pb fn new{
// other code to set up the gym removed...
        let var_store = nn::VarStore::new(Device::Cpu); 
        let net = network::net(&var_store.root()); 
        GymWrapper { gym , net: Box::new(net)} 
     }

    pub fn step_episode(&mut self, seed: Option<u64>) -> bool { 
  
         tch::set_num_threads(1); 
  
         let mut steps = 0; 
         let mut done = false; 
         for _i in 0..10{ 
             done = false; 
             let mut obs = self.gym.reset(Some(false), seed); 
             
             while !done{ 
                 let tens_obs = Tensor::from_slice2(&obs); 
                 let actions: Tensor = tch::no_grad(|| self.net.forward(&tens_obs)); 
                 let act_vec: Vec<Vec<f32>> = Tensor::try_into(actions).expect("error from tensor to vector"); 
                 let result = self.gym.step(act_vec); 
                 obs = result.0;
                 done = result.2; 
                 steps += 1; 
             } 
  
         }

The test is to set up the gym wrapper and then time running the step episode. While step episode is running I took the attached trace. Like 30% of the exclusive time is in the torch::serialize::OutputArchive::save_to.

Any ideas what I’m doing wrong or what it’s trying to do?

Network is below because it’s long and ugly.

const HIDDEN_NODES: i64 = 256; 
 const INPUT_DIM: i64 = 231; 
 const OUTPUT_DIM: i64 = 90;


pub fn net(my_net: &nn::Path) -> impl Module { 
     nn::seq().add(nn::linear(my_net / "layer1", 
                 INPUT_DIM, 
                 HIDDEN_NODES, 
                 nn::LinearConfig{ws_init:Kaiming{dist: Normal, 
                     fan: FanIn, 
                    non_linearity: ReLU},bs_init:Some(Kaiming{dist: Normal, 
                      fan: FanIn, 
                     non_linearity: ReLU}),bias:true})) 
             .add_fn(|xs| xs.leaky_relu()) 
             .add(nn::linear(my_net / "layer2", 
                 HIDDEN_NODES, 
                 HIDDEN_NODES, 
                 nn::LinearConfig{ws_init:Kaiming{dist: Normal, 
                     fan: FanIn, 
                    non_linearity: ReLU},bs_init:Some(Kaiming{dist: Normal, 
                      fan: FanIn, 
                     non_linearity: ReLU}),bias:true})) 
             .add_fn(|xs| xs.leaky_relu()) 
             .add(nn::linear(my_net / "layer3", 
                 HIDDEN_NODES, 
                 HIDDEN_NODES, 
                 nn::LinearConfig{ws_init:Kaiming{dist: Normal, 
                     fan: FanIn, 
                    non_linearity: ReLU},bs_init:Some(Kaiming{dist: Normal, 
                      fan: FanIn, 
                     non_linearity: ReLU}),bias:true})) 
             .add_fn(|xs| xs.leaky_relu()) 
             .add(nn::linear(my_net / "layer4", 
                 HIDDEN_NODES, 
                 OUTPUT_DIM, 
                 nn::LinearConfig{ws_init:Kaiming{dist: Normal, 
                     fan: FanIn, 
                    non_linearity: ReLU},bs_init:Some(Kaiming{dist: Normal, 
                      fan: FanIn, 
                     non_linearity: ReLU}),bias:true})) 
 }

Trace here