How to store device information per thread?

I want to launch a function using multi thread, each thread may run in different device context. So I want to use the thread_local keyword in C++.

I wrote the code as following. The code will get an undefined symbol error. Which is basically as folloing :

ImportError: /home/user/anaconda3/.../_my_ext.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN10foo14MyDeviceInfoD1Ev

But this code is quite similar with the thread_local use from ThreadLocalDebugInfo.cpp.

Could you give me some hint on how to solve this problem? How could I store device information per thread?

// mydevice.h
namespace foo{
class MyDeviceInfo{
private:
     c10::Device device_; 
      //...
public:
      MyDeviceInfo() : device_(c10::Device(c10::DeviceType::CPU, 0)) {}
      MyDeviceInfo(c10::Device device) : device_(device) {}
      c10::Device& get_device() { return device_ ; }
...
};

MyDeviceInfo& get_tls_MyDeviceInfo();
void set_tls_MyDeviceInfo(c10::Device device);
} // namespace foo

//-------------------------

// mydevice.cpp
thread_local std::shared_ptr<foo::MyDeviceInfo> device_info;

namespace foo{
   void set_tls_MyDeviceInfo(c10::Device device) {
          device_info = std::make_shared<MyDeviceInfo> (device) ;
   }
   MyDeviceInfo& get_tls_MyDeviceInfo(){
           return device_info->get_device();
    }



} // namespace foo

The problem seems to lies on c10::Device . If I separate the initialization of c10::device to two part as :

class foo {
  private :
    c10::DeviceType device_type_;
    int16_t index_;

  public:
    foo() : index_(-1), device_type_(c10::DeviceType::CPU){}
    foo(c10::Device device) : index_(device.index()), device_type_(device.type()){}
    foo(c10::DeviceType device_type, int index) : index_(index), device_type_(device_type){}
    int get_index() { return index_; }
    c10::DeviceType get_device_type() { return device_type_; }
};

This would works fine. So where is the document about this? Why using c10::Device is wrong?