Help with tee, redirects configuration

Hello,

In my use case, instead of running torchrun, we need to invoke the module as a python function call. My set up looks like this:

In the below strip down, func is exactly same as any other train function which gets called my invoking torchrun script.py

Instead of sending the script args, the wrapper function reads from a config file and calls the main function.

def training_subprocess():
   # get the parameters from some conf file like json or yaml
   parameters = load_parameters()
   func(**parameters)

Torch run configuration and set up:

Since we have to call torchrun programmatically, I created a dataclass/pydantic object which captures the config: This is inspired from args from torch.

class StartMethod(str, Enum):
    spawn = "spawn"
    fork = "fork"
    forkserver = "forkserver"

class TorchConfig(BaseModel):
    nnodes: str = Field(default="1:1")
    nproc_per_node: int = Field(default=1)

    rdzv_backend: str | None = Field(default="static")
    rdzv_endpoint: str | None = Field(default="")
    rdzv_id: str | None = Field(default="none")
    rdzv_conf: str | None = Field(default="")

    max_restarts: int | None = Field(default=None)
    monitor_interval: float | None = Field(default=0.1)
    start_method: str | None = Field(default=StartMethod.spawn)
    role: str | None = Field(default="default")
    log_dir: str | None = Field(default="torch_logs")
    redirects: str | None = Field(default="0")
    tee: str | None = Field(default="0")
    master_addr: str | None = Field(default="localhost")
    master_port: str | None = Field(default="29500")
    training_script: str = Field(default="dummy_training_script")
    training_script_args: str = Field(default="")
    logs_specs: str | None = None
    local_addr: str | None = Field(default=None)

    # Optional fields
    local_ranks_filter: str | None = Field(default="0")
    node_rank: int | None = Field(default=0)
    standalone: bool | None = Field(default=None)
    module: bool | None = Field(default=False)
    no_python: bool | None = Field(default=False)
    run_path: bool | None = Field(default=False)

Using the function in: config_from_args

I can convert this into a lauch_config by:

launch_config, _, _ = config_from_args(**torch_config)

I can then call, [elastic_launch]

launcher = elastic_launch(
     launch_config,
   training_subprocess,
 )

This set up works as expected and is similar to running torchrun.

I want to redirect stdout and stderr from the workers to but am struggling to find the right configuration. The default parameters do not capture any output during the model training phase.

I played around with the redirects and tee parameters but most of the time I get errors of "io.UnsupportedOperation: fileno".

Could you please help in getting the right configuration values to stream both stdout and stderr? I think I can get them via the torch logs at the end of the run but it would be nice to see the progress as it happens.

Cheers,

1 Like