How do I determine the longest sequence length that fits into memory?

I am trying to fine tune a transformer embedding model to embed long documents of around 8000-10000 tokens. However, I keep running into CUDA out of memory. The embedding model is only about a few hundred megabytes and can fit inside my 40gb gpu memory. I tried using deepspeed and fsdp distributed training in order to split the model across 8 gpus, but i still run out of memory.

I am wondering how can I determine the longest sequence length I can train with my model, either through calculations or code?
Additionally, shouldn’t fsdp/deepspeed be supplying enough memory? My intuition says that 8*40Gb should be plenty enough for 10000 sequence length and 200 mb model, but I don’t know how to check this statement.