Error at starting point of textual inversion - stable diffusion

Hello !
I was trying to use an EC2 to do textual inversion (for stable diffusion).
I executed the following command :

export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export DATA_DIR="cropped_selfies"

accelerate launch textual_inversion.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$DATA_DIR \
  --learnable_property="object" \
  --placeholder_token="<Lorenzo>" --initializer_token="man" \
  --resolution=512 \
  --train_batch_size=5 \
  --gradient_accumulation_steps=1 \
  --max_train_steps=2000 \
  --learning_rate=0.01 --scale_lr \
  --lr_scheduler="linear" \
  --lr_warmup_steps=10 \
  --output_dir="embeddings"

NB : I stole textual_inversion.py file from here : diffusers/textual_inversion.py at main · huggingface/diffusers · GitHub

I got as output :

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_processes` was set to a value of `1`
        `--num_machines` was set to a value of `1`
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/accelerator.py:233: FutureWarning: `logging_dir` is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use `project_dir` instead.
  FutureWarning,
02/14/2023 18:26:24 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cpu
Mixed precision type: no

{'variance_type', 'prediction_type'} was not found in config. Values will be initialized to default values.
{'scaling_factor'} was not found in config. Values will be initialized to default values.
{'conv_in_kernel', 'time_cond_proj_dim', 'conv_out_kernel', 'time_embedding_type', 'num_class_embeds', 'upcast_attention', 'resnet_time_scale_shift', 'timestep_post_act', 'use_linear_projection', 'only_cross_attention', 'mid_block_type', 'dual_cross_attention', 'class_embed_type'} was not found in config. Values will be initialized to default values.
02/14/2023 18:27:00 - INFO - __main__ - ***** Running training *****
02/14/2023 18:27:00 - INFO - __main__ -   Num examples = 3400
02/14/2023 18:27:00 - INFO - __main__ -   Num Epochs = 3
02/14/2023 18:27:00 - INFO - __main__ -   Instantaneous batch size per device = 5
02/14/2023 18:27:00 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 5
02/14/2023 18:27:00 - INFO - __main__ -   Gradient Accumulation steps = 1
02/14/2023 18:27:00 - INFO - __main__ -   Total optimization steps = 2000
Steps:   0%|                                                                                   | 0/2000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/ec2-user/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/commands/launch.py", line 552, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'textual_inversion.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--train_data_dir=cropped_selfies', '--learnable_property=object', '--placeholder_token=<Lorenzo>', '--initializer_token=man', '--resolution=512', '--train_batch_size=5', '--gradient_accumulation_steps=1', '--max_train_steps=2000', '--learning_rate=0.01', '--scale_lr', '--lr_scheduler=linear', '--lr_warmup_steps=10', '--output_dir=embeddings']' died with <Signals.SIGKILL: 9>.

I got no idea where the problem comes from. Any ideas ?

The actual line throwing the error might be further below in the stacktrace. What’s the full error output?

no. The error ends at : <Signals.SIGKILL: 9>.
You see the full error. Hence my confusion.

OH sorry ! I juste though about it, but I choose a t2.large EC2 instance. I’m not even sure I’ve got a GPU on the machine. Maybe my problem comes from that. I’ll try with an other EC2 instance, for which I’m sure about the GPU, and I’ll update this post.

died with <Signals.SIGKILL: 9> sometimes indicates an OOM issue on the host, as also described here. You might find a similar message in dmesg which could explain what exactly killed the process.