|
Cannot execute loss.backward() for training a specific layer
|
|
1
|
65
|
September 8, 2025
|
|
DDP does not work with custom gradient (backward) computations
|
|
3
|
202
|
September 5, 2025
|
|
Avoid OOM due to optimizer state in DDP
|
|
6
|
216
|
September 4, 2025
|
|
Work vs. Future sync primitives for Distributed Torch backends
|
|
2
|
95
|
September 4, 2025
|
|
Does FSDP2 support shared modules
|
|
1
|
139
|
September 2, 2025
|
|
OOM When Resuming From Checkpoint XLA
|
|
0
|
64
|
September 1, 2025
|
|
Multi-GPU training hangs: Watchdog caught collective operation timeout
|
|
16
|
17315
|
August 31, 2025
|
|
Zero optimizer.consolidate_state_dict(to=0) hangs
|
|
3
|
83
|
August 31, 2025
|
|
Switching between multi-processing and main process
|
|
3
|
1055
|
August 30, 2025
|
|
FSDP model with non-standard gradient computation
|
|
2
|
109
|
August 29, 2025
|
|
DDP Training Issue
|
|
0
|
68
|
August 28, 2025
|
|
Handling signals in distributed train loop
|
|
5
|
472
|
August 25, 2025
|
|
Support for Overlapping AllGather and ReduceScatter in FSDP
|
|
2
|
157
|
August 25, 2025
|
|
In NVIDIA container environments, PyTorch's NCCL allreduce operation exhibits extremely poor performance
|
|
3
|
163
|
August 23, 2025
|
|
Torch.distributed.dcp.save does not save on all ranks
|
|
2
|
155
|
August 22, 2025
|
|
Capture training graph with collectives via TorchTitan
|
|
8
|
312
|
August 15, 2025
|
|
Using DistributedDataParallel with dataloader num_workers > 0
|
|
2
|
3941
|
August 15, 2025
|
|
In multi-processing, when one process exits unexpectedly, how to get others out of hang?
|
|
0
|
54
|
August 13, 2025
|
|
FSDP.HYBRID_SHARD leads to parameter inconsistency between two DP replicas
|
|
0
|
51
|
August 6, 2025
|
|
Purpose and communication of set reshard_after_forward=int in fsdp2
|
|
0
|
99
|
August 6, 2025
|
|
Variable batch size in Multi-GPU trainings
|
|
3
|
134
|
July 31, 2025
|
|
Question about GPU memory usage when using pipeline parallelism training under larger micro batch count
|
|
4
|
223
|
July 30, 2025
|
|
Can I shard a subset of weights and replicate others in FSDP2?
|
|
0
|
59
|
July 30, 2025
|
|
Gradient not accumulated across nodes in deepspeed code
|
|
2
|
176
|
July 27, 2025
|
|
Ddp training and eval question
|
|
2
|
79
|
July 26, 2025
|
|
Continued pre-training large models with FSDP2?
|
|
2
|
128
|
July 26, 2025
|
|
FullyShardedDataParallel hangs depending on wrap policy for Llama-3.2-1B
|
|
1
|
272
|
July 26, 2025
|
|
Why would functional and non-functional broadcast use `src` with different semantics?
|
|
1
|
64
|
July 26, 2025
|
|
Why does init_device_mesh() or DeviceMesh() have to be called globally?
|
|
3
|
148
|
July 22, 2025
|
|
[DCP] how to load dcp ckpts?
|
|
3
|
103
|
July 21, 2025
|