Skip to content

[PyTorch] Expert Parallelism: PyTorch wrapper + autograd ops with symm-mem zero-copy#3035

Open
phu0ngng wants to merge 39 commits into
NVIDIA:mainfrom
phu0ngng:phuong/ep-3-pytorch-on-commwindow
Open

[PyTorch] Expert Parallelism: PyTorch wrapper + autograd ops with symm-mem zero-copy#3035
phu0ngng wants to merge 39 commits into
NVIDIA:mainfrom
phu0ngng:phuong/ep-3-pytorch-on-commwindow

Conversation

@phu0ngng

@phu0ngng phu0ngng commented May 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

Second PR in the TE Expert Parallelism (EP) series. Adds the PyTorch binding on top of the common C API (#3034): exposes EP dispatch/combine as torch.library custom ops with autograd, and plumbs NCCL symmetric-memory windows through for the zero-copy path.

Payload tensors allocated via te.pytorch.ep.symm_mem_alloc take the one-sided zero-copy path when ep_bootstrap(zero_copy=True); anything else falls back to staged-copy, so the API stays drop-in compatible with any allocator.

Implementation

Public Python API (transformer_engine/pytorch/ep.py)

    EpBuffer, ep_bootstrap, ep_finalize,                                                                                                                                                                                                                                                        ep_dispatch, ep_combine,
    symm_mem_alloc,                                                                                                                                                                                                                                                                         )
  • ep_bootstrap / ep_finalize - one-time per-process init/teardown. Borrows the NCCL comm from ep_group via ProcessGroupNCCL._comm_ptr() (no separate ncclUniqueId bootstrap). ep_finalize is optional - an atexit handler covers normal shutdown; call it explicitly before dist.destroy_process_group(). Requires ep_group.size() >= 2.
  • symm_mem_alloc(shape, dtype, ep_group) - per-rank tensor backed by NCCL symmetric memory, rendezvoused on ep_group.
  • EpBuffer - per-layer state: routing handle + persistent payload slots (recv_tokens, combine_in, grad buffers). One per concurrently-in-flight call (e.g. PP-1F1B microbatch). Symm-mem-backed when zero_copy=True.
  • ep_dispatch / ep_combine - autograd-aware per-step ops, registered as torch.library.custom_op with correct mutates_args, so they compose with torch.compile fullgraph and CUDA graphs.
    Current payload dtype is restricted to bfloat16; FP8 quantize/dequantize stays outside the EP boundary.

C++ bindings (transformer_engine/pytorch/csrc/extensions/ep.cpp)

  • POD-only pybind boundary (primitives + pybind11::object for dtype) - no c10d ABI on the boundary. - maybe_make_window() resolves each payload tensor to an NVTECommWindow via c10d::symmetric_memory::rendezvous; non-symm-mem tensors return kNoWindow and the backend picks staged-copy automatically.
  • Zero-copy toggle captured at ep_initialize and forwarded into NVTEEpGroupConfig.zero_copy.

Build

build_tools/pytorch.py propagates -DNVTE_WITH_NCCL_EP (gated on NVTE_BUILD_WITH_NCCL_EP=1, default on) and -DUSE_NCCL so PyTorch's symm-mem feature macros are visible. When NCCL EP is off, ep.cpp no-ops behind the #ifdef.

Testing

  • tests/pytorch/distributed/run_ep.py - 8-test suite: prepare correctness, raw dispatch/combine identity round-trip, dispatch fwd+bwd VJP, full fwd+bwd round-trip, multi-iter bit-stability, CUDA graph capture, PP-1F1B 3-buffer interleave, int64 topk_idx validation. Launcher run_test_ep.sh auto-detects GPUs (skips with <4). Pytest driver: tests/pytorch/distributed/test_ep.py.
  • Example: examples/pytorch/ep/ep_moe.py - minimal end-to-end MoE fwd+bwd driver with --check against an analytical reference.
  • Bench: examples/pytorch/ep/bench/ep_bench.py - times raw + autograd dispatch/combine, optional --cuda-graph capture and --kineto/--nsys profiling.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@phu0ngng phu0ngng requested review from ksivaman and ptrendx as code owners May 22, 2026 02:54
@greptile-apps

greptile-apps Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds the PyTorch binding layer for Expert Parallelism (EP) on top of the C API from #3034, exposing ep_dispatch/ep_combine as torch.library.custom_op with full autograd support, and plumbing NCCL symmetric-memory windows for the zero-copy path.

  • transformer_engine/pytorch/ep.py: Implements EpBuffer, ep_bootstrap/ep_finalize, and autograd-aware ep_dispatch/ep_combine wrappers using torch.autograd.Function subclasses; zero-copy mode allocates persistent symm-mem slots via symm_mem_alloc.
  • transformer_engine/pytorch/csrc/extensions/ep.cpp: C++ pybind layer with per-op contiguity/shape validation, maybe_make_window for symm-mem window resolution (staged-copy fallback for non-symm-mem tensors), and safe borrowing of torch's NCCL comm pointer.
  • Tests and examples: 8-test suite covering identity round-trip, VJP correctness, multi-iter stability, CUDA graph capture, PP-1F1B interleaving, and an end-to-end MoE driver with analytical reference check.

Confidence Score: 4/5

The core forward dispatch/combine path is solid, but the backward pass of _EpDispatch can crash with AttributeError when recv_topk_weights does not contribute to the loss — the .contiguous() call is not guarded against a None gradient.

The C++ layer is well-validated with contiguity checks, shape invariants, and staged-copy fallback for non-symm-mem grads. The Python autograd wrappers are structurally correct for the happy path. However, _EpDispatch.backward calls .contiguous() on g_recv_topk_weights without a None guard; the test workaround (0.0 * rw.float().sum()) and the comment about backward not fabricating Nones both confirm this is a real edge case rather than a hypothetical one.

transformer_engine/pytorch/ep.py — specifically _EpDispatch.backward, where the None-gradient case for g_recv_topk_weights is unguarded.

Important Files Changed

Filename Overview
transformer_engine/pytorch/ep.py New public EP API; contains known outstanding issues (g_recv_topk_weights None crash in backward) and minor validation gaps.
transformer_engine/pytorch/csrc/extensions/ep.cpp C++ pybind layer; addressed contiguity and token-count validation gaps; backward path correctly uses maybe_make_window instead of check_symm_mem_required for upstream grads.
transformer_engine/pytorch/distributed.py Adds symm_mem_alloc helper; correctly uses variadic *shape with symm_mem.empty (matches PyTorch API) and passes ProcessGroup directly to rendezvous.
tests/pytorch/distributed/run_ep.py Comprehensive 8-test suite covering identity round-trip, VJP, stability, CUDA graph, PP-1F1B; ep_group created in setUpClass but never explicitly destroyed.
examples/pytorch/ep/ep_moe.py End-to-end MoE example with fwd+bwd and analytical reference check; ep_group is created but not explicitly destroyed in the finally block.
build_tools/pytorch.py Adds -DNVTE_WITH_NCCL_EP and -DUSE_NCCL compile flags gated on NVTE_WITH_NCCL_EP env var; straightforward and correct.
transformer_engine/pytorch/csrc/extensions.h Adds EP function declarations with correct signatures; clean addition of cstdint header.
transformer_engine/pytorch/csrc/extensions/pybind.cpp Delegates EP binding registration to register_ep_bindings() guarded by NVTE_WITH_NCCL_EP; minimal and correct change.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant User as User Code
    participant PyEP as ep.py (Python)
    participant CppEP as ep.cpp (C++)
    participant Backend as NVTE EP Backend
    participant NCCL as NCCL / Symm-Mem

    Note over User, NCCL: Bootstrap (once per process)
    User->>PyEP: "ep_bootstrap(ep_group, zero_copy=True)"
    PyEP->>CppEP: ep_initialize(comm_ptr, group_name, ...)
    CppEP->>Backend: nvte_ep_initialize(borrowed_comm, cfg)
    Backend-->>CppEP: OK
    CppEP-->>PyEP: "g_ep_initialized=true"

    Note over User, NCCL: Forward pass (per microbatch)
    User->>PyEP: ep_dispatch(buffer, tokens, topk_idx, topk_weights)
    PyEP->>CppEP: ep_prepare(handle_mem, topk_idx, token_counts, top_k, alignment)
    CppEP->>Backend: nvte_ep_prepare(...)
    Backend->>NCCL: AllGather routing table
    PyEP->>CppEP: ep_dispatch(handle_mem, ..., recv_tokens, recv_topk_weights)
    CppEP->>CppEP: maybe_make_window(recv_tokens)
    CppEP->>Backend: nvte_ep_dispatch(..., tokens_win, recv_tokens_win, ...)
    Backend->>NCCL: One-sided NCCL dispatch (zero-copy or staged)
    NCCL-->>PyEP: recv_tokens, recv_topk_weights filled

    User->>PyEP: ep_combine(buffer, expert_out)
    PyEP->>CppEP: ep_combine(handle_mem, expert_out, result)
    CppEP->>Backend: nvte_ep_combine(...)
    Backend->>NCCL: One-sided NCCL combine
    NCCL-->>User: result tensor

    Note over User, NCCL: Backward pass (autograd)
    User->>PyEP: loss.backward()
    PyEP->>CppEP: ep_combine_bwd(handle_mem, g_result, grad_expert_out)
    CppEP->>Backend: nvte_ep_combine_bwd(...)
    Backend->>NCCL: Reverse scatter
    PyEP->>CppEP: ep_dispatch_bwd(handle_mem, g_recv_tokens, g_recv_topk_weights, grad_tokens, grad_topk_weights)
    CppEP->>Backend: nvte_ep_dispatch_bwd(...)
    Backend->>NCCL: Reverse gather
    NCCL-->>User: grad_tokens, grad_topk_weights
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant User as User Code
    participant PyEP as ep.py (Python)
    participant CppEP as ep.cpp (C++)
    participant Backend as NVTE EP Backend
    participant NCCL as NCCL / Symm-Mem

    Note over User, NCCL: Bootstrap (once per process)
    User->>PyEP: "ep_bootstrap(ep_group, zero_copy=True)"
    PyEP->>CppEP: ep_initialize(comm_ptr, group_name, ...)
    CppEP->>Backend: nvte_ep_initialize(borrowed_comm, cfg)
    Backend-->>CppEP: OK
    CppEP-->>PyEP: "g_ep_initialized=true"

    Note over User, NCCL: Forward pass (per microbatch)
    User->>PyEP: ep_dispatch(buffer, tokens, topk_idx, topk_weights)
    PyEP->>CppEP: ep_prepare(handle_mem, topk_idx, token_counts, top_k, alignment)
    CppEP->>Backend: nvte_ep_prepare(...)
    Backend->>NCCL: AllGather routing table
    PyEP->>CppEP: ep_dispatch(handle_mem, ..., recv_tokens, recv_topk_weights)
    CppEP->>CppEP: maybe_make_window(recv_tokens)
    CppEP->>Backend: nvte_ep_dispatch(..., tokens_win, recv_tokens_win, ...)
    Backend->>NCCL: One-sided NCCL dispatch (zero-copy or staged)
    NCCL-->>PyEP: recv_tokens, recv_topk_weights filled

    User->>PyEP: ep_combine(buffer, expert_out)
    PyEP->>CppEP: ep_combine(handle_mem, expert_out, result)
    CppEP->>Backend: nvte_ep_combine(...)
    Backend->>NCCL: One-sided NCCL combine
    NCCL-->>User: result tensor

    Note over User, NCCL: Backward pass (autograd)
    User->>PyEP: loss.backward()
    PyEP->>CppEP: ep_combine_bwd(handle_mem, g_result, grad_expert_out)
    CppEP->>Backend: nvte_ep_combine_bwd(...)
    Backend->>NCCL: Reverse scatter
    PyEP->>CppEP: ep_dispatch_bwd(handle_mem, g_recv_tokens, g_recv_topk_weights, grad_tokens, grad_topk_weights)
    CppEP->>Backend: nvte_ep_dispatch_bwd(...)
    Backend->>NCCL: Reverse gather
    NCCL-->>User: grad_tokens, grad_topk_weights
Loading

Reviews (17): Last reviewed commit: "Update transformer_engine/pytorch/distri..." | Re-trigger Greptile

Comment thread transformer_engine/pytorch/ep.py Outdated
Comment thread transformer_engine/pytorch/csrc/extensions/ep.cpp Outdated
Comment thread transformer_engine/pytorch/ep.py Outdated
Comment on lines +558 to +568
@contextlib.contextmanager
def _zero_copy_scope(enabled: bool):
"""Toggles whether per-step ops apply the symm-mem NCCL window annotation."""
if enabled:
yield
return
tex.ep_set_zero_copy(False)
try:
yield
finally:
tex.ep_set_zero_copy(True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _zero_copy_scope does not save/restore the previous flag value

When enabled=False, the manager unconditionally sets g_zero_copy_enabled=False on entry and g_zero_copy_enabled=True on exit. If two callers both use zero_copy=False concurrently (e.g., pipeline-parallel microbatches dispatched from separate Python threads) or if the context is nested, the inner scope's finally block prematurely re-enables zero-copy while the outer scope is still active. The outer scope's finally then sets True again, but between the inner finally and the outer finally the C++ layer sees True unexpectedly.

The fix is to capture the previous value before writing and restore it unconditionally: save old = tex.ep_get_zero_copy() (adding a corresponding getter), then tex.ep_set_zero_copy(old) in the finally block. At minimum, document the single-caller-at-a-time assumption prominently so pipeline-parallel users know to serialize.

Comment thread transformer_engine/common/ep/ep_backend.cpp Outdated
@phu0ngng phu0ngng marked this pull request as draft May 22, 2026 03:03
@phu0ngng phu0ngng force-pushed the phuong/ep-3-pytorch-on-commwindow branch 4 times, most recently from 540ef54 to bacae5f Compare May 24, 2026 00:06
Comment thread transformer_engine/pytorch/ep.py Outdated
device = expert_out.device
# Weight in payload dtype: single fused broadcast multiply into combine_in.
w = recv_topk_weights.unsqueeze(-1).to(expert_out.dtype)
torch.mul(expert_out, w, out=combine_in)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need this?🤔
At the training scenario, the weight gets multiplied onto the activation between fc1 and fc2 (we also dispatch the weight at the same time as dispatching the tokens), or am I misunderstanding something here?

My understanding is that this multiplication is unnecessary. Furthermore, if it is removed, another problem becomes more prominent: how do we add symm buffer support for the combine input? This would require changes on the grouped GEMM side.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second this. I saw unexpected kernel here and found this same problem. A potential solution is to provide a separate path when the weight is not provided. This means the weight multiplication is handled elsewhere, and in this case skip the multiplication here.

@phu0ngng phu0ngng May 26, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to learn that we can fuse the weight x to the activation. I will make this optional.

We will need to change the GG to return the symmetric memory buf.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. we need change the grouped gemm I think

ep_group: dist.ProcessGroup,
num_experts: int,
max_tokens_per_rank: int,
recv_capacity_per_rank: int,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When allocating the buffer, we need to allocate according to the worst case. There are two scenarios here:

  • The first is rank-major, where the memory footprint is max_tokens_per_rank × num_of_ranks. This generally stays below 10 GB, which is the primary memory overhead of typical EP setups and is acceptable.
  • The second is expert-major, where the memory footprint is max_tokens_per_rank × num_of_ranks × min(topk, num_of_experts). This could reach 40–50 GB, which is unacceptable.

If I understand this correctly, we must find a way to optimize the memory usage in the expert-major layout — or alternatively, we need to fall back to the rank-major layout + explicit permutation approach.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the rank-major, you still need to overallocate the output buffer of local permute as in expert-major. Right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two types of buffers:

The first is the EP buffer, which serves as the destination for communication (NCCL EP is a push-based design), so it requires a relatively costly registration process. These are reused globally as static buffers as much as possible, so they are allocated based on the worst-case size. In HEP, the rank-major output buffer is an EP buffer, so we only need a rank-major worst-case-size buffer. I haven't studied NCCL EP in detail, but my understanding is that if our output is a symmetric buffer, we don't need a built-in static comm buffer inside NCCL EP — meaning recv_capacity_per_rank is not needed when the output buffer is a symm buffer. I think this is worth discussing and clarifying.

The second type is regular GPU memory, which can be managed by the caching allocator. In HEP, the output of the permute operation falls into this category — it can be dynamically allocated each iteration based on the scan result, with just one additional sync required. Additionally, in sync-free mode, the size of this buffer is specified by the user.

To summarize, we may need to confirm whether recv_capacity_per_rank requires building an expert-major worst-case-size buffer inside NCCL EP. If the output is a symm buffer, we theoretically don't need such a buffer. However, if it is necessary, then we cannot accept an expert-major worst-case-size buffer. I also observed in my draft PR that NCCL EP uses more memory.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
It's correct that if the output buffer is a symmem, then we should not need to register the gigantic IPC/MC buffer in ep_group with the size based on recv_capacity_per_rank. Let's request NCCL EP to add an option to skip this buffer allocation.

However, I think we should still ask users to specify this recv_capacity_per_rank so that we can handle overflow policy in the metadata_preprocessing rather than delaying it to dispatch phase.

@Autumn1998 Autumn1998 May 28, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need an option to skip this internal buffer.
Also, are you thinking of using recv_capacity_per_rank to support the sync-free mechanism? That is, tokens exceeding the threshold get dropped, and then trigger the flipping of the overflow flag? I think this is not correct — we should not set it at buffer initialization, but instead pass it as a parameter before the preprocess step of each dispatch, because the threshold changes every iteration.
cc @nanz-nv plz correct me if I made mistakes

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the threshold changes every iteration.

I'm curious to learn about this possibility. From my understanding, the output buffers need to have a static size for CUDA Graph replay, and so does the recv_capacity.

@Autumn1998 Autumn1998 May 29, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for each global batch, we recalculate a new output size, since each batch has its own CUDA graph — but I'm not 100% sure on this. You may want to confirm with @nanz-nv.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is something in between. With the current way of doing full-iteration cuda graph, ideally recv_capacity_per_rank should stay the same across training, but it can sometimes gets updated. So I'd treat it as something that may change but not frequently.

@timmoon10 timmoon10 self-requested a review June 1, 2026 17:34
@phu0ngng phu0ngng force-pushed the phuong/ep-3-pytorch-on-commwindow branch 4 times, most recently from 40d8011 to 2153492 Compare June 10, 2026 01:27
@phu0ngng phu0ngng marked this pull request as ready for review June 10, 2026 01:28
Comment thread transformer_engine/pytorch/csrc/extensions/ep.cpp
@phu0ngng phu0ngng force-pushed the phuong/ep-3-pytorch-on-commwindow branch from 9ec1aff to 7ce8d8b Compare June 10, 2026 03:20
Comment thread transformer_engine/pytorch/ep.py
@phu0ngng phu0ngng force-pushed the phuong/ep-3-pytorch-on-commwindow branch from b2ab069 to c8c54fd Compare June 11, 2026 00:22
Comment thread transformer_engine/pytorch/ep.py
Comment thread transformer_engine/pytorch/csrc/extensions/ep.cpp
@phu0ngng phu0ngng force-pushed the phuong/ep-3-pytorch-on-commwindow branch from df732a5 to 67917a3 Compare June 11, 2026 16:16
@phu0ngng

Copy link
Copy Markdown
Collaborator Author

/te-ci pytorch L1

@phu0ngng

Copy link
Copy Markdown
Collaborator Author

Pipeline #54455868 TE EP tests passed in L1_pytorch_distributed_unittest--B200_8GPU and L1_pytorch_distributed_unittest--H100_4GPU. There are other failures that are unrelated to TE EP.

@phu0ngng phu0ngng force-pushed the phuong/ep-3-pytorch-on-commwindow branch 2 times, most recently from 52bbf88 to d6c5745 Compare June 13, 2026 00:08
…ights loss term in dispatch autograd test

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@phu0ngng phu0ngng force-pushed the phuong/ep-3-pytorch-on-commwindow branch from ff06156 to 53a3834 Compare June 23, 2026 16:39
@phu0ngng

Copy link
Copy Markdown
Collaborator Author

/te-ci pytorch L1

@phu0ngng phu0ngng requested a review from Autumn1998 June 23, 2026 16:40
Comment thread build_tools/pytorch.py Outdated
phu0ngng added 9 commits June 23, 2026 09:48
…_NCCL_EP

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…der zero-copy, allocate in-flight otherwise

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…ad and rename dispatch autograd tests

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…include and combine_bwd grad_eo locals to grad_expert_out

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
… and run 1f1b interleave both eager and CUDA-graph-captured

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…s allocated internally

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…opy; ep_dispatch falls back to them

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…ute instead of save_for_backward

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@phu0ngng phu0ngng added the 2.7.0 label Jun 24, 2026
phu0ngng and others added 5 commits June 24, 2026 06:39
…ler_provides_combine_grad_buffer; recv_topk_weights is always buffer-owned

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…ovides_combine_grad_buffer CLI flags to ep_moe example and ep_bench (default False)

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…buffer, dispatch/combine, tests and examples

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Comment thread transformer_engine/pytorch/ep.py Outdated
# to opt in; the C++ backend then operates the EP group in zero-copy mode.


def symm_mem_alloc(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This function is pretty generic to allocate symm mem, maybe consider to move it to general utils?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

self.token_counts = torch.empty(self.num_local_experts, dtype=torch.int32, device=device)
# Persistent tensor; keep resident if activation CPU offloading is on.
mark_not_offload(self.handle_mem)
self._alloc_symm_buffers()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just note this buffer allocation logic might be pending for future change.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an existing warning that the API related to zero-copy is subject to change.

ctx.mark_non_differentiable(token_counts)
# Detach so the long-lived buffers aren't tracked as differentiable outputs;
# autograd re-attaches grad_fn pointing back at this Function.
return recv_tokens.detach(), recv_topk_weights.detach(), token_counts

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super clear to me why we do detach here. Autograd function is running in no_grad context anyway. If these tensors are long-lived buffers, user should allocate them as requires_grad=False?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when recv_tokens are symmem, we need to do detach to avoid the grad_fn from sticking with this tensor, while when it is a non-symmem, we have an in-place modification which requires dirty-mark, which detachs can trigger a similar effect. I'm new to PyTorch so let me know if this is incorrect.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we allocate recv_tokens as requires_grad=False, it should not have a grad_fn. I think we should let user manage the responsibility if this tensor requires grad, i.e. if user explicitly want the output of dispatch carries grad_fn for some reason, we should not forbid it. Otherwise they can just make the tensor not require grad.

torch.ops.transformer_engine_ep.combine(handle_mem, expert_out, result)
ctx.save_for_backward(handle_mem)
ctx.grad_symm_buf = grad_symm_buf
if grad_symm_buf is None:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If grad_symm_buf is not None, we should use ctx.save_for_backward to save it?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think save_for_backward should be used only when you need to read the value of the tensor in the backward path.
Here, we only want to pass the reference to the buffer so that we can write to it.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All tensors that will be used by backward should be passed by ctx.save_for_backward, this is to prevent memory leak, check https://docs.pytorch.org/docs/2.12/generated/torch.autograd.function.FunctionCtx.save_for_backward.html. I think it is needed for torch to manage its autograd graph lifecycle. If the tensor does not require grad, it is probably okay to assign it with ctx directly, but it is always safe to use ctx.save_for_backward

phu0ngng added 2 commits June 25, 2026 05:09
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…/combine_grad_expert_out buffers on EpBuffer

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Comment thread examples/pytorch/ep/ep_moe.py Outdated
phu0ngng and others added 2 commits June 25, 2026 15:54
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…ard docstring

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@phu0ngng

Copy link
Copy Markdown
Collaborator Author

/te-ci pytorch L1

Comment thread transformer_engine/pytorch/distributed.py Outdated
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@phu0ngng

Copy link
Copy Markdown
Collaborator Author

/te-ci pytorch L1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants