Ray tune and ImplicitFunc is very large error

ko7129 · January 27, 2022, 2:12pm

Greetings to the community!!

I am trying to grid search some parameters of my training function using ray tune.
The input data to train_cifar() used for training and testing are 2 lists of dimensions
400x13000 and 40x13000, respectively.

Due to size I cannot produce a reproducible example, but below I show three different
ways I have tried to ray tune my model.

In each case I receive the following error:

The actor ImplicitFunc is very large (95 MiB). Check that its definition is not implicitly
capturing a large array or other object in scope. Tip: use ray.put() to put large objects
in the Ray object store.

or this one:

debug_error_string = “{“created”:”@1643300850.335447653",“description”:
“Error received from peer ipv4:172.28.0.2:45437”, “file”:“src/core/lib/surface/call.cc”, “file_line”:1074, “grpc_message”: “Received message larger than max (137418486 vs. 104857600)”,“grpc_status”:8}"

I don’t understand what the limit of 95 MiB is since my lists are really small.

Any ideas of what am I doing wrong?

I am running the following codes to google’s Colab.

Kostas

CODE I

X_scaled_train = ...
X_scaled_test = ...

ray.init()
X_scaled_train1 = ray.put(X_scaled_train)
X_scaled_test1 = ray.put(X_scaled_test)

def train_cifar(config, data = None, checkpoint_dir=None):
  X_scaled_train_tmp = config["data1"]
  X_scaled_train2 = ray.get(X_scaled_train_tmp)

  X_scaled_test_tmp = config["data2"]
  X_scaled_test2 = ray.get(X_scaled_test_tmp)

def tunerTrain():
      config = {
        "data1" : X_scaled_train1,
        "data2" : X_scaled_test1,        
      }          
      scheduler = ASHAScheduler(
              ...
          )
      reporter = CLIReporter(
              ...
          )
      result = tune.run(
              partial(train_cifar, data_dir=data_dir),  
              ...
          )

tunerTrain()

CODE II

X_scaled_train = ...
X_scaled_test = ...

ray.init()
X_scaled_train1 = ray.put(X_scaled_train)
X_scaled_test1 = ray.put(X_scaled_test)

def train_cifar(config, data = None, checkpoint_dir=None):
  
  X_scaled_train2 = ray.get(data[0])
  X_scaled_test2 = ray.get(data[2])

def tunerTrain():
      config = {
              ...        
      }          
      scheduler = ASHAScheduler(
              ...
          )
      reporter = CLIReporter(
              ...
          )
      result = tune.run(
               tune.with_parameters(train_cifar, data=[X_scaled_train1, X_scaled_train_trait, 
                                                       X_scaled_test1, X_scaled_test_trait]),
              ...
          )

tunerTrain()

CODE III

X_scaled_train = ...
X_scaled_test = ...

def train_cifar(config, data = None, checkpoint_dir=None):
  
  X_scaled_train2 = data[0]
  X_scaled_test2 = data[2]

def tunerTrain():
      config = {
              ...        
      }          
      scheduler = ASHAScheduler(
              ...
          )
      reporter = CLIReporter(
              ...
          )
      result = tune.run(
               tune.with_parameters(train_cifar, data=[X_scaled_train, X_scaled_train_trait, 
                                                       X_scaled_test, X_scaled_test_trait]),
              ...
          )

tunerTrain()

H-Huang · February 1, 2022, 4:43am

This forum section is meant for the PyTorch distributed questions. It seems the code and error you are interested in is coming from Ray, so it would be best to ask on that forum (https://discuss.ray.io/)