[#14225][perf] AutoDeploy MTP + ADP enablement and MoE all-to-all optimization by MrGeva · Pull Request #15063 · NVIDIA/TensorRT-LLM

MrGeva · 2026-06-07T13:45:30Z

Description

Optimizes the SuperV3 MTP + attention-DP path, building on the recently merged
MoE all-to-all stateful cache (#13718 / #13723) and the SSM-replay PR.

MoE all-to-all per-rank token budget (runtime_max_tokens_per_rank)

Replaces the per-iteration cross-rank-max read (an int(batch_info_host[14].item())
on a pinned-host tensor, fed by a per-forward tp_allgather) with a sync-free,
shape-based budget via _hybrid_runtime_max_tokens_per_rank:
- Under cuda-graph capture/warm-up the budget is x.shape[0] — uniform across DP
  ranks because maybe_pad_for_cuda_graph pads every rank to a common
  cg_batch_size and MTP tokens-per-seq is uniform — gated so the tight budget is
  only taken while the MoE-GEMM row count stays in the fast small-M tactic region.
- In eager (prefill, or any step the mixed-mode bypass forces eager) it falls back
  to the static max_num_tokens, which every rank computes identically.
- No per-layer host read, and no extra collective.
Drops the now-dead batch_info_host DP-max plumbing: the slot-14 (max_dp_num_tokens)
storage + accessors in BatchInfo (_NUM_ELEMENTS 15→14), the pre-forward
tp_allgather+update in the AutoDeploy shim, the injection into the MoE op in both
the dict-based (sharding.py) and IR-based (sharding_ir.py) sharding paths, and the
corresponding op signatures in trtllm_moe.py / torch_moe.py.

Scheduling / sharding

Enables the attention-DP request balancer (PyExecutor._balance_adp_requests) in the
AD executor so prefill is co-scheduled across ranks, avoiding a single prefill
straggler stalling the others at the MoE all-to-all collective.
Keeps the draft-EP revert under attention-DP in sharding (replicate the draft model's
MoE rather than EP-sharding it) to avoid the shared-workspace corruption that hangs the
all-to-all at concurrency.

Mixed-mode cuda-graph bypass uses the process-wide BypassCapturedGraphs() context
manager (cuda_graph_state.in_bypass()), keeping all ranks consistent when one is in prefill.

Perf Impact

see: #14225 (comment)

Test Coverage

tests/integration/defs/accuracy/test_llm_api_autodeploy.py::TestNemotronSuperV3::test_mtp[*]
and ::test_accuracy[*-4-attn_dp_on-trtllm] (bf16/fp8/nvfp4) — exercise the MTP +
attention-DP MoE all-to-all path.
tests/unittest/auto_deploy/multigpu/compile/test_bypass_captured_graphs.py — guards the
mixed-mode captured-graph bypass.
New perf-sanity config + post-merge enrollment:
perf/test_perf_sanity.py::test_e2e[aggr_upload-super_mtp_ad_nvfp4_blackwell-super_mtp_ad_nvfp4_ws4_1k1k].
Local e2e validation (4× GB200, TP4, BF16 SuperV3-MTP, attention-DP): builds end-to-end
with no NaN / hang / all-to-all timeout; GSM8K 47/50 = 94.0%.

PR Checklist

Please check this after reviewing the above items as appropriate for this PR.

Summary by CodeRabbit

New Features
- Added low-latency SuperV3 MTP configuration optimized for reduced serving latency.
Configuration Updates
- Enhanced SuperV3 MTP configuration with batch-size targets and attention-DP support.
- Updated speculative decoding (Eagle) to properly handle attention-DP mode.
Performance Improvements
- Optimized MoE all-to-all dispatch to use actual token counts instead of padded estimates, reducing overhead.
- Streamlined batch metadata handling for more efficient attention computation.

MrGeva · 2026-06-07T13:49:57Z

/bot run

coderabbitai · 2026-06-07T13:53:14Z

📝 Walkthrough

Walkthrough

This PR removes a DP-aware token-info slot from BatchInfo metadata, eliminates batch_info_host parameter passing through MoE custom ops, introduces a CUDA-graph-aware hybrid dispatch logic, and wires enable_attention_dp control through the Eagle model and factory initialization.

Changes

Batch-info slot removal and MoE dispatch refactoring

Layer / File(s)	Summary
BatchInfo layout and accessor removal `tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`	`BatchInfo._NUM_ELEMENTS` reduced from 15 to 14, removing DP-aware token slot. Accessor methods `update_max_dp_num_tokens` and `get_max_dp_num_tokens` deleted. `SequenceInfo.nest_sequences` and `SequenceInfo.switch_to_generate_` updated to no longer write to/call removed slot methods.
MoE custom op signature cleanup `tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/torch_moe.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py`	All torch-based and trtllm-based MoE custom op functions (and their `register_fake` counterparts) remove the `batch_info_host: Optional[torch.Tensor] = None` parameter. Call sites delete corresponding `batch_info_host=...` argument passing.
MoE all-to-all dispatch refactoring `tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py`	New `_hybrid_runtime_max_tokens_per_rank` helper selects dispatch budget based on CUDA graph capture state: uses `local_num_tokens` during capture/warmup, otherwise uses `max_num_tokens`. Input tensors no longer padded; dispatch uses real `local_num_tokens` with `runtime_max_tokens_per_rank` only governing workspace/receive shapes.
Graph sharding and placeholder cleanup `tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py`, `tensorrt_llm/_torch/auto_deploy/transform/library/sharding_ir.py`	Remove `batch_info_host` placeholder insertion/retrieval in `ShardingTransformExecutor` and graph IR. Update MoE mapping-arg injection to pass only `mapping_config` and `max_num_tokens`. Add guard in draft sharding to clear EP transforms when attention-DP is enabled, preventing shared MoE all-to-all workspace corruption.
Eagle model attention-DP integration `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py`, `tensorrt_llm/_torch/auto_deploy/models/eagle.py`	Add `enable_attention_dp: bool` field to `EagleWrapperConfig`. Store flag in `EagleWrapper` and make rank-0 token broadcast in `sample_greedy` conditional—skipped when attention-DP is enabled. Wire `enable_attention_dp` through `EagleOneModelFactory`.
Factory and executor integration `tensorrt_llm/_torch/auto_deploy/llm_args.py`, `tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py`	`LlmArgs.create_factory` derives `enable_attention_dp` from sharding config and passes to factory. `ADEngine.__init__` initializes `stream_interval` and `attention_dp_config` early, removes old DP slot-14 allgather update. `create_autodeploy_executor` computes draft sizes and forwards them to `PyExecutor`.
Configuration and test updates `examples/auto_deploy/model_registry/configs/*`, `tests/integration/defs/accuracy/test_llm_api_autodeploy.py`, `tests/integration/test_lists/test-db/l0_dgx_b200.yml`, `tests/scripts/perf-sanity/aggregated/super_mtp_ad_nvfp4_blackwell.yaml`, `tests/unittest/auto_deploy/multigpu/compile/test_bypass_captured_graphs.py`, `tests/unittest/auto_deploy/singlegpu/models/test_eagle.py`, `tensorrt_llm/_torch/auto_deploy/utils/cuda_graph.py`	New SuperV3 low-latency config with attention-DP enabled and MTP speculative decoding. Updated SuperV3 MTP config header and batch-size targets. Added Eagle `sample_greedy` broadcast control tests. Updated test docstrings and comments to reference capture-time scalar reads and `runtime_max_tokens_per_rank` dispatch logic. Registered performance sanity test with new configs.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#13723: Prior MoE all-to-all metadata plumbing that introduced batch_info_host and runtime_max_tokens_per_rank slot contract (this PR removes the DP-aware slot and eliminates host-tensor passing).
NVIDIA/TensorRT-LLM#11073: Original Eagle model implementation and sample_greedy flow (this PR adds attention-DP conditional broadcast control).
NVIDIA/TensorRT-LLM#14943: Concurrent changes to ad_executor.py stream_interval and attention_dp_config initialization (aligned with attention-DP enablement plumbing).

Suggested reviewers

greg-kwasniewski1
nv-guomingz
danielafrimi
zhaoyangwang-nvidia
shaharmor98

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 45.83% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title accurately summarizes the main changes: AutoDeploy MTP enablement and MoE all-to-all optimization with attention-DP integration, which is the primary focus across the changeset.
Description check	✅ Passed	The pull request description comprehensively explains the changes, including detailed MoE optimization details, scheduling improvements, test coverage, and performance impact.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
405-434: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Sync the remaining batch_info docs with the new 14-slot contract.

This docstring is updated, but SequenceInfo still says batch_info_host includes “DP-aware token info”, and nest_sequences() still documents batch_info as a 3-element shape. That leaves this file with multiple incompatible specs for the same tensor.

Based on learnings, ensure that batch_info's format matches its documented spec.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py` around
lines 405 - 434, Update the remaining docs and any related constants to match
the 14-slot batch_info contract: change SequenceInfo's description of
batch_info_host to list the 14 elements (or refer to the shared doc) instead of
"DP-aware token info", and update nest_sequences() docstring to describe
batch_info as a 14-element vector (not a 3-element shape); also verify any
references to _NUM_ELEMENTS, batch_info, or batch_info_host in SequenceInfo,
nest_sequences(), and adjacent helpers reflect the 14-slot semantics (including
slots 0–13 names like num_prefill, max_context_length, max_draft_len,
use_replay) so all docs and constants are consistent.
Source: Learnings

🧹 Nitpick comments (1)

tests/integration/test_lists/test-db/l0_dgx_b200.yml (1)
397-397: Coverage sufficiency looks good for this PR scope.

Adding perf/test_perf_sanity.py::test_e2e[aggr_upload-super_mtp_ad_nvfp4_blackwell-super_mtp_ad_nvfp4_ws4_1k1k] in tests/integration/test_lists/test-db/l0_dgx_b200.yml gives targeted post-merge coverage for the new NVFP4 Super MTP AutoDeploy profile.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/test_lists/test-db/l0_dgx_b200.yml` at line 397, Add the
new targeted test entry for the NVFP4 Super MTP AutoDeploy profile to the
l0_dgx_b200 test list by inserting the exact test identifier
perf/test_perf_sanity.py::test_e2e[aggr_upload-super_mtp_ad_nvfp4_blackwell-super_mtp_ad_nvfp4_ws4_1k1k1k]
(use the identifier from the review content) into
tests/integration/test_lists/test-db/l0_dgx_b200.yml under the appropriate test
list block, preserving YAML list syntax and indentation, avoid duplicates, and
ensure the test string is quoted or escaped if needed to prevent YAML parsing
issues; verify the entry appears exactly as the reviewer requested and run a
quick YAML lint to confirm validity.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/auto_deploy/model_registry/configs/super_v3_mtp_low_latency.yaml`:
- Around line 68-69: Update the explanatory comment that currently states
"Triton SSM + causal conv are required for MTP as currently they are the only
backends that support speculative mamba state caching." to reflect that
FlashInfer SSM is also supported: mention that Triton SSM and FlashInfer SSM
(`flashinfer_ssm`) — together with causal conv — are supported backends for MTP
speculative mamba state caching where applicable, and ensure consistency with
the paired config `super_v3_mtp.yaml`.
- Around line 4-10: Update the header diff notes to match the actual config keys
and values: change the "16 vs 128" phrasing to reflect that super_v3_mtp.yaml in
this PR uses max_batch_size: 64 (so say "16 vs 64" or just "max_batch_size
lowered to 16 from 64"), and replace references to cuda_graph_batch_sizes with
the correct nested key cuda_graph_config.batch_sizes; ensure the explanatory
bullets reference the actual keys max_batch_size and
cuda_graph_config.batch_sizes and the actual removed batch sizes (drop 24, 32,
64, 128) so the header is accurate and not misleading.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py`:
- Around line 62-66: The branch that returns a capture-time token budget uses
cuda_graph_state.in_warm_up() and torch.cuda.is_current_stream_capturing() but
does not respect the global bypass flag, so steps forced to eager via
BypassCapturedGraphs() still get local_num_tokens; update the conditional that
checks capture/warm-up (the if using torch.cuda.is_current_stream_capturing() or
cuda_graph_state.in_warm_up()) to also require that BypassCapturedGraphs() is
false (i.e., only apply the capture-time budget when not bypassed), keeping the
existing budget checks (budget > 0 and budget * ep_size * 4 <= max_num_tokens)
and otherwise return max_num_tokens as before.

---

Outside diff comments:
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`:
- Around line 405-434: Update the remaining docs and any related constants to
match the 14-slot batch_info contract: change SequenceInfo's description of
batch_info_host to list the 14 elements (or refer to the shared doc) instead of
"DP-aware token info", and update nest_sequences() docstring to describe
batch_info as a 14-element vector (not a 3-element shape); also verify any
references to _NUM_ELEMENTS, batch_info, or batch_info_host in SequenceInfo,
nest_sequences(), and adjacent helpers reflect the 14-slot semantics (including
slots 0–13 names like num_prefill, max_context_length, max_draft_len,
use_replay) so all docs and constants are consistent.

---

Nitpick comments:
In `@tests/integration/test_lists/test-db/l0_dgx_b200.yml`:
- Line 397: Add the new targeted test entry for the NVFP4 Super MTP AutoDeploy
profile to the l0_dgx_b200 test list by inserting the exact test identifier
perf/test_perf_sanity.py::test_e2e[aggr_upload-super_mtp_ad_nvfp4_blackwell-super_mtp_ad_nvfp4_ws4_1k1k1k]
(use the identifier from the review content) into
tests/integration/test_lists/test-db/l0_dgx_b200.yml under the appropriate test
list block, preserving YAML list syntax and indentation, avoid duplicates, and
ensure the test string is quoted or escaped if needed to prevent YAML parsing
issues; verify the entry appears exactly as the reviewer requested and run a
quick YAML lint to confirm validity.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0fb5e177-877d-4d70-8e49-f7d13437e036

📥 Commits

Reviewing files that changed from the base of the PR and between 47666de and 02fbb3d.

📒 Files selected for processing (17)

examples/auto_deploy/model_registry/configs/super_v3_mtp.yaml
examples/auto_deploy/model_registry/configs/super_v3_mtp_low_latency.yaml
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/torch_moe.py
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py
tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py
tensorrt_llm/_torch/auto_deploy/models/eagle.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py
tensorrt_llm/_torch/auto_deploy/transform/library/sharding_ir.py
tensorrt_llm/_torch/auto_deploy/utils/cuda_graph.py
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
tests/integration/test_lists/test-db/l0_dgx_b200.yml
tests/scripts/perf-sanity/aggregated/super_mtp_ad_nvfp4_blackwell.yaml
tests/unittest/auto_deploy/multigpu/compile/test_bypass_captured_graphs.py
tests/unittest/auto_deploy/singlegpu/models/test_eagle.py

💤 Files with no reviewable changes (1)

tensorrt_llm/_torch/auto_deploy/transform/library/sharding_ir.py

coderabbitai · 2026-06-07T13:53:17Z

+# Diff from ``super_v3_mtp.yaml``:
+# - ``max_batch_size`` lowered (16 vs 128). Less mamba/KV cache pressure
+#   and only small captured graphs are exercised.
+# - ``cuda_graph_batch_sizes`` trimmed to the latency-relevant range
+#   (drop 24, 32, 64, 128). Every captured graph costs warmup time and
+#   GPU memory; if you'll never serve c>16, capturing larger sizes is
+#   wasted.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Header diff notes are stale and currently misleading.

Line [5] says max_batch_size is "16 vs 128", but super_v3_mtp.yaml in this PR uses max_batch_size: 64.
Line [7] references cuda_graph_batch_sizes, while this file uses cuda_graph_config.batch_sizes. Please sync the header notes with the actual config keys/values to avoid operator confusion.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/auto_deploy/model_registry/configs/super_v3_mtp_low_latency.yaml` around lines 4 - 10, Update the header diff notes to match the actual config keys and values: change the "16 vs 128" phrasing to reflect that super_v3_mtp.yaml in this PR uses max_batch_size: 64 (so say "16 vs 64" or just "max_batch_size lowered to 16 from 64"), and replace references to cuda_graph_batch_sizes with the correct nested key cuda_graph_config.batch_sizes; ensure the explanatory bullets reference the actual keys max_batch_size and cuda_graph_config.batch_sizes and the actual removed batch sizes (drop 24, 32, 64, 128) so the header is accurate and not misleading.

tensorrt-cicd · 2026-06-07T13:56:21Z

PR_Github #52586 [ run ] triggered by Bot. Commit: 02fbb3d Link to invocation

tensorrt-cicd · 2026-06-07T15:28:28Z

PR_Github #52586 [ run ] completed with state SUCCESS. Commit: 02fbb3d
/LLM/main/L0_MergeRequest_PR pipeline #41868 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

MrGeva · 2026-06-07T16:29:22Z

/bot run

tensorrt-cicd · 2026-06-07T16:34:46Z

PR_Github #52591 [ run ] triggered by Bot. Commit: c0acbd7 Link to invocation

…mization Rebases the SuperV3-MTP attention-DP optimization onto current upstream/main (which now carries gk's MoE all-to-all stateful cache NVIDIA#13718/NVIDIA#13723 and gagam's SSM-replay PR). All net changes from the optimization branch are preserved. MoE all-to-all per-rank token budget (runtime_max_tokens_per_rank): - Replace the per-iteration cross-rank-max read (an int(batch_info_host[14] .item()) on a pinned-host tensor, fed by a per-forward tp_allgather) with a sync-free shape-based budget via _hybrid_runtime_max_tokens_per_rank: under cuda-graph capture/warm-up the budget is x.shape[0] (uniform across DP ranks because maybe_pad_for_cuda_graph pads every rank to a common cg_batch_size and MTP tokens-per-seq is uniform), gated so the tight budget is only taken while the MoE-GEMM row count stays in the fast small-M tactic region; in eager (prefill or bypass) it falls back to the static max_num_tokens every rank computes identically. No per-layer host read. - Drop the now-dead batch_info_host plumbing for the DP-max slot: the slot-14 (max_dp_num_tokens) storage and update/get accessors in BatchInfo (_NUM_ELEMENTS 15->14), the pre-forward tp_allgather + update in the AD shim, and the batch_info_host injection into the MoE op in both the dict-based (sharding.py) and IR-based (sharding_ir.py) sharding paths, plus the op signatures in trtllm_moe.py / torch_moe.py. Scheduling / sharding: - Enable the attention-DP request balancer (PyExecutor._balance_adp_requests) in the AD executor so prefill is co-scheduled across ranks, avoiding a single prefill straggler stalling the others at the MoE all-to-all collective. - Keep the draft-EP revert under attention-DP in sharding (replicate the draft model's MoE rather than EP-sharding it) to avoid the shared-workspace corruption that hangs the all-to-all at concurrency. Mixed-mode cuda-graph bypass uses upstream's process-wide BypassCapturedGraphs() context manager (cuda_graph_state.in_bypass()) instead of the per-instance flag, keeping all ranks consistent when one is in prefill. Tests: - Add NVFP4 SuperV3-MTP attn-DP perf-sanity config + post-merge enrollment. - test_mtp / test_accuracy NVFP4 coverage resolved to upstream's parametrization. Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>

tensorrt-cicd · 2026-06-07T20:20:42Z

PR_Github #52591 [ run ] completed with state SUCCESS. Commit: c0acbd7
/LLM/main/L0_MergeRequest_PR pipeline #41873 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

MrGeva requested review from a team as code owners June 7, 2026 13:45

MrGeva requested a review from tfogal June 7, 2026 13:45

github-actions Bot assigned MrGeva Jun 7, 2026

coderabbitai Bot reviewed Jun 7, 2026

View reviewed changes

MrGeva force-pushed the eg/superv3-on-upstream branch from 02fbb3d to 9563db8 Compare June 7, 2026 13:57

MrGeva changed the title ~~[None][perf][autodeploy] SuperV3-MTP attention-DP MoE all-to-all optimization~~ [None][perf][autodeploy] MTP + ADP enablement and MoE all-to-all optimization Jun 7, 2026

MrGeva changed the title ~~[None][perf][autodeploy] MTP + ADP enablement and MoE all-to-all optimization~~ [#14225][perf] AutoDeploy MTP + ADP enablement and MoE all-to-all optimization Jun 7, 2026

MrGeva force-pushed the eg/superv3-on-upstream branch from 9563db8 to c0acbd7 Compare June 7, 2026 16:28

MrGeva commented Jun 7, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py

Comment thread examples/auto_deploy/model_registry/configs/super_v3_mtp_low_latency.yaml Outdated

MrGeva force-pushed the eg/superv3-on-upstream branch from c0acbd7 to 25a8ad1 Compare June 7, 2026 17:16

Conversation

MrGeva commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Perf Impact

Test Coverage

PR Checklist

Summary by CodeRabbit

Uh oh!

MrGeva commented Jun 7, 2026

Uh oh!

coderabbitai Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 7, 2026

Uh oh!

tensorrt-cicd commented Jun 7, 2026

Uh oh!

MrGeva commented Jun 7, 2026

Uh oh!

tensorrt-cicd commented Jun 7, 2026

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MrGeva commented Jun 7, 2026 •

edited

Loading

coderabbitai Bot commented Jun 7, 2026 •

edited

Loading