About the distributed category
|
|
1
|
2786
|
January 20, 2021
|
Copying params between 2 identically sharded (FSDP) networks
|
|
1
|
22
|
July 9, 2025
|
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3368, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.25.1
|
|
2
|
11
|
July 8, 2025
|
Gradient not accumulated across nodes in deepspeed code
|
|
0
|
12
|
July 8, 2025
|
FSDP1/FSDP2 sharding across CPUs?
|
|
2
|
20
|
July 8, 2025
|
NotImplementedError: could not run 'aten::as_strided'
|
|
1
|
12
|
July 7, 2025
|
BatchNorm for multi GPU Training
|
|
9
|
4619
|
July 7, 2025
|
How to Efficiently Gather Python Objects Across GPUs Without GPU-to-CPU-to-GPU-to-CPU Overhead in torch.distributed?
|
|
0
|
10
|
July 6, 2025
|
DTensor TP collectives missing?
|
|
0
|
15
|
July 3, 2025
|
Capture training graph with collectives via TorchTitan
|
|
6
|
63
|
June 28, 2025
|
[Distributed w/ TorchTitan] FLUX is Here: Experience Diffusion Model Training on TorchTitan
|
|
0
|
458
|
June 27, 2025
|
[Distributed w/ TorchTitan] Breaking Barriers: Training Long Context LLMs with 1M Sequence Length in PyTorch Using Context Parallel
|
|
9
|
6535
|
June 27, 2025
|
Work vs. Future sync primitives for Distributed Torch backends
|
|
0
|
24
|
June 25, 2025
|
Socket error - broken pipe during rendezvous
|
|
7
|
130
|
June 24, 2025
|
Using IterableDataset with DistributedDataParallel
|
|
9
|
10092
|
June 23, 2025
|
Model.to(device) vs. tensor.to(device)
|
|
2
|
91
|
June 18, 2025
|
Distributed learning in windows
|
|
0
|
11
|
June 18, 2025
|
Tensor parallel numeric mismatch
|
|
1
|
25
|
June 18, 2025
|
Unexplained behaviour in accumulate gradients vs in a ddp setting - why are the gradients different?
|
|
2
|
29
|
June 17, 2025
|
When variables are transferred between GPUs, their values change
|
|
6
|
68
|
June 10, 2025
|
How to interpret the output of CommDebugMode
|
|
0
|
10
|
June 9, 2025
|
Using torch rpc to connect to remote machine
|
|
2
|
1021
|
June 7, 2025
|
WebDataset Multi-GPU Single-Node
|
|
1
|
38
|
June 2, 2025
|
NCCL+Torch Distributed Error
|
|
2
|
31
|
June 2, 2025
|
Potential Bug with HYBRID_SHARD and (n, 1) Device Mesh Falling Back to NO_SHARD
|
|
1
|
26
|
June 2, 2025
|
FSDP hybrid sharding on multiple nodes
|
|
1
|
133
|
March 31, 2025
|
PyTorch Multiprocessing: Train only some parameters at some epochs
|
|
2
|
45
|
May 29, 2025
|
Stalling on Simple Distributed Barrier
|
|
12
|
145
|
May 30, 2025
|
Unable to complete training on Multi-GPU setup
|
|
5
|
119
|
May 29, 2025
|
[Distributed w/ TorchTitan] Introducing Async Tensor Parallelism in PyTorch
|
|
11
|
12496
|
May 29, 2025
|