Flash attention with padding mask or nested tensors

SlavPsh · November 22, 2024, 8:49pm

Hi everyone,
I’m trying to find out how to use flash attention for large sequences of variable length in training. Flash attention currently doesn’t support (padding) masks.
People suggested nested tensors but those seem to only work in evaluation with flash attention. Then there’s a possibility to manually set key/query/value elements to -inf or 0, imitating padding. Are there any other options for flash attention for variable length sequences?