Error at starting point of textual inversion - stable diffusion

Lorenzo · February 14, 2023, 6:44pm

Hello !
I was trying to use an EC2 to do textual inversion (for stable diffusion).
I executed the following command :

export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export DATA_DIR="cropped_selfies"

accelerate launch textual_inversion.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$DATA_DIR \
  --learnable_property="object" \
  --placeholder_token="<Lorenzo>" --initializer_token="man" \
  --resolution=512 \
  --train_batch_size=5 \
  --gradient_accumulation_steps=1 \
  --max_train_steps=2000 \
  --learning_rate=0.01 --scale_lr \
  --lr_scheduler="linear" \
  --lr_warmup_steps=10 \
  --output_dir="embeddings"

NB : I stole textual_inversion.py file from here : diffusers/textual_inversion.py at main · huggingface/diffusers · GitHub

I got as output :

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_processes` was set to a value of `1`
        `--num_machines` was set to a value of `1`
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/accelerator.py:233: FutureWarning: `logging_dir` is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use `project_dir` instead.
  FutureWarning,
02/14/2023 18:26:24 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cpu
Mixed precision type: no

{'variance_type', 'prediction_type'} was not found in config. Values will be initialized to default values.
{'scaling_factor'} was not found in config. Values will be initialized to default values.
{'conv_in_kernel', 'time_cond_proj_dim', 'conv_out_kernel', 'time_embedding_type', 'num_class_embeds', 'upcast_attention', 'resnet_time_scale_shift', 'timestep_post_act', 'use_linear_projection', 'only_cross_attention', 'mid_block_type', 'dual_cross_attention', 'class_embed_type'} was not found in config. Values will be initialized to default values.
02/14/2023 18:27:00 - INFO - __main__ - ***** Running training *****
02/14/2023 18:27:00 - INFO - __main__ -   Num examples = 3400
02/14/2023 18:27:00 - INFO - __main__ -   Num Epochs = 3
02/14/2023 18:27:00 - INFO - __main__ -   Instantaneous batch size per device = 5
02/14/2023 18:27:00 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 5
02/14/2023 18:27:00 - INFO - __main__ -   Gradient Accumulation steps = 1
02/14/2023 18:27:00 - INFO - __main__ -   Total optimization steps = 2000
Steps:   0%|                                                                                   | 0/2000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/ec2-user/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/commands/launch.py", line 552, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'textual_inversion.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--train_data_dir=cropped_selfies', '--learnable_property=object', '--placeholder_token=<Lorenzo>', '--initializer_token=man', '--resolution=512', '--train_batch_size=5', '--gradient_accumulation_steps=1', '--max_train_steps=2000', '--learning_rate=0.01', '--scale_lr', '--lr_scheduler=linear', '--lr_warmup_steps=10', '--output_dir=embeddings']' died with <Signals.SIGKILL: 9>.

I got no idea where the problem comes from. Any ideas ?

suraj.pt · February 14, 2023, 6:50pm

The actual line throwing the error might be further below in the stacktrace. What’s the full error output?

Lorenzo · February 14, 2023, 7:23pm

no. The error ends at : <Signals.SIGKILL: 9>.
You see the full error. Hence my confusion.

Lorenzo · February 14, 2023, 7:42pm

OH sorry ! I juste though about it, but I choose a t2.large EC2 instance. I’m not even sure I’ve got a GPU on the machine. Maybe my problem comes from that. I’ll try with an other EC2 instance, for which I’m sure about the GPU, and I’ll update this post.

ptrblck · February 14, 2023, 8:15pm

died with <Signals.SIGKILL: 9> sometimes indicates an OOM issue on the host, as also described here. You might find a similar message in dmesg which could explain what exactly killed the process.