|
Support for Ulysses/Ring distributed attention for long-context training (32k) for 32B dense models
|
|
0
|
154
|
September 15, 2025
|
|
WebDataset Multi-GPU Single-Node
|
|
3
|
278
|
September 15, 2025
|
|
DDP overwriting a buffer with random values
|
|
1
|
33
|
September 15, 2025
|
|
Low-level errors when retrying training after OOMs
|
|
3
|
107
|
September 12, 2025
|
|
Proper way to combine Tensor subclass with FSDP
|
|
2
|
60
|
September 8, 2025
|
|
Cannot execute loss.backward() for training a specific layer
|
|
1
|
40
|
September 8, 2025
|
|
DDP does not work with custom gradient (backward) computations
|
|
3
|
117
|
September 5, 2025
|
|
Avoid OOM due to optimizer state in DDP
|
|
6
|
115
|
September 4, 2025
|
|
Work vs. Future sync primitives for Distributed Torch backends
|
|
2
|
75
|
September 4, 2025
|
|
Does FSDP2 support shared modules
|
|
1
|
95
|
September 2, 2025
|
|
OOM When Resuming From Checkpoint XLA
|
|
0
|
37
|
September 1, 2025
|
|
Multi-GPU training hangs: Watchdog caught collective operation timeout
|
|
16
|
16736
|
August 31, 2025
|
|
Zero optimizer.consolidate_state_dict(to=0) hangs
|
|
3
|
54
|
August 31, 2025
|
|
Switching between multi-processing and main process
|
|
3
|
1012
|
August 30, 2025
|
|
[Distributed w/ TorchTitan] Breaking Barriers: Training Long Context LLMs with 1M Sequence Length in PyTorch Using Context Parallel
|
|
11
|
9085
|
August 29, 2025
|
|
FSDP model with non-standard gradient computation
|
|
2
|
70
|
August 29, 2025
|
|
DDP Training Issue
|
|
0
|
44
|
August 28, 2025
|
|
Handling signals in distributed train loop
|
|
5
|
351
|
August 25, 2025
|
|
Support for Overlapping AllGather and ReduceScatter in FSDP
|
|
2
|
97
|
August 25, 2025
|
|
In NVIDIA container environments, PyTorch's NCCL allreduce operation exhibits extremely poor performance
|
|
3
|
110
|
August 23, 2025
|
|
Torch.distributed.dcp.save does not save on all ranks
|
|
2
|
86
|
August 22, 2025
|
|
Capture training graph with collectives via TorchTitan
|
|
8
|
231
|
August 15, 2025
|
|
Using DistributedDataParallel with dataloader num_workers > 0
|
|
2
|
3880
|
August 15, 2025
|
|
In multi-processing, when one process exits unexpectedly, how to get others out of hang?
|
|
0
|
36
|
August 13, 2025
|
|
FSDP.HYBRID_SHARD leads to parameter inconsistency between two DP replicas
|
|
0
|
38
|
August 6, 2025
|
|
Purpose and communication of set reshard_after_forward=int in fsdp2
|
|
0
|
66
|
August 6, 2025
|
|
Variable batch size in Multi-GPU trainings
|
|
3
|
87
|
July 31, 2025
|
|
Question about GPU memory usage when using pipeline parallelism training under larger micro batch count
|
|
4
|
150
|
July 30, 2025
|
|
Can I shard a subset of weights and replicate others in FSDP2?
|
|
0
|
39
|
July 30, 2025
|
|
Gradient not accumulated across nodes in deepspeed code
|
|
2
|
126
|
July 27, 2025
|