About the distributed category
|
|
1
|
2795
|
January 20, 2021
|
Variable batch size in Multi-GPU trainings
|
|
3
|
28
|
July 31, 2025
|
Question about GPU memory usage when using pipeline parallelism training under larger micro batch count
|
|
4
|
61
|
July 30, 2025
|
Can I shard a subset of weights and replicate others in FSDP2?
|
|
0
|
9
|
July 30, 2025
|
Gradient not accumulated across nodes in deepspeed code
|
|
2
|
56
|
July 27, 2025
|
Ddp training and eval question
|
|
2
|
18
|
July 26, 2025
|
Continued pre-training large models with FSDP2?
|
|
2
|
40
|
July 26, 2025
|
FullyShardedDataParallel hangs depending on wrap policy for Llama-3.2-1B
|
|
1
|
43
|
July 26, 2025
|
Why would functional and non-functional broadcast use `src` with different semantics?
|
|
1
|
23
|
July 26, 2025
|
Why does init_device_mesh() or DeviceMesh() have to be called globally?
|
|
3
|
46
|
July 22, 2025
|
Work vs. Future sync primitives for Distributed Torch backends
|
|
1
|
41
|
July 21, 2025
|
[DCP] how to load dcp ckpts?
|
|
3
|
29
|
July 21, 2025
|
Dist.all_gather with uneven tensor sizes
|
|
1
|
38
|
July 20, 2025
|
Does elastic torch support model parallelism
|
|
1
|
22
|
July 20, 2025
|
FSDP2 and gradient w.r.t. inputs
|
|
1
|
25
|
July 20, 2025
|
Gathering dictionaries of DistributedDataParallel
|
|
11
|
4013
|
July 17, 2025
|
Copying params between 2 identically sharded (FSDP) networks
|
|
2
|
47
|
July 16, 2025
|
Split backward into multiple gpus
|
|
2
|
56
|
July 16, 2025
|
NCCL timeout when reducing batch size
|
|
1
|
42
|
July 15, 2025
|
How to Efficiently Gather Python Objects Across GPUs Without GPU-to-CPU-to-GPU-to-CPU Overhead in torch.distributed?
|
|
1
|
31
|
July 11, 2025
|
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3368, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.25.1
|
|
2
|
108
|
July 8, 2025
|
FSDP1/FSDP2 sharding across CPUs?
|
|
2
|
41
|
July 8, 2025
|
NotImplementedError: could not run 'aten::as_strided'
|
|
1
|
27
|
July 7, 2025
|
BatchNorm for multi GPU Training
|
|
9
|
4661
|
July 7, 2025
|
DTensor TP collectives missing?
|
|
0
|
29
|
July 3, 2025
|
Capture training graph with collectives via TorchTitan
|
|
6
|
94
|
June 28, 2025
|
[Distributed w/ TorchTitan] FLUX is Here: Experience Diffusion Model Training on TorchTitan
|
|
0
|
705
|
June 27, 2025
|
[Distributed w/ TorchTitan] Breaking Barriers: Training Long Context LLMs with 1M Sequence Length in PyTorch Using Context Parallel
|
|
9
|
6936
|
June 27, 2025
|
Socket error - broken pipe during rendezvous
|
|
7
|
162
|
June 24, 2025
|
Using IterableDataset with DistributedDataParallel
|
|
9
|
10203
|
June 23, 2025
|