[None][chore] Enable skip memory estimation on kv cache manager v2 by HuiGao-NV · Pull Request #15052 · NVIDIA/TensorRT-LLM

HuiGao-NV · 2026-06-07T01:17:28Z

Summary by CodeRabbit

Tests
- Test infrastructure now captures SLURM log files during result collection, providing enhanced diagnostic data for debugging test failures and performance analysis.
Chores
- KV cache estimation behavior updated to skip by default based on cache manager implementation type, improving default system efficiency.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: HuiGao <huig@nvidia.com>

HuiGao-NV · 2026-06-07T01:17:40Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-07T01:24:02Z

PR_Github #52532 [ run ] triggered by Bot. Commit: d4f5566 Link to invocation

coderabbitai · 2026-06-07T01:24:28Z

📝 Walkthrough

Walkthrough

Jenkins post-test result collection now downloads SLURM log artifacts. Python executor creation refines KV cache estimation skip behavior to check both an environment variable default (now true) and KV cache manager implementation type, with conditional validation updates.

Changes

CI Test Artifact Collection

Layer / File(s)	Summary
SLURM log artifact download `jenkins/L0_Test.groovy`	Post-test result upload now enumerates and downloads SLURM log files (`.log`, `.out`, `slurm-*`) from the remote test environment into a local `${stageName}/slurm-logs/` directory; whitespace adjusted in isolated test rerun failure handler.

KV Cache Estimation Behavior

Layer / File(s)	Summary
KV cache manager type accessibility `tensorrt_llm/_torch/pyexecutor/_util.py`, `tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`	New `KvCacheCreator.kv_cache_manager_cls()` accessor exposes the selected manager class; `KVCacheManagerV2` is imported to enable type checks in executor creation.
Type-aware estimation skip logic `tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`, `tensorrt_llm/_torch/pyexecutor/_util.py`	Environment variable `TRTLLM_SKIP_KV_CACHE_ESTIMATION` default changes to `'1'` (skip enabled by default); skip decision is further gated on whether the manager is a `KVCacheManagerV2` subclass; CP-type constraint validation is now conditional on estimation not being skipped.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description contains only the template structure with empty sections; the Description, Test Coverage, and PR Checklist items are not filled out with actual content.	Fill in the Description section explaining the issue and solution, provide Test Coverage details listing relevant tests, and complete the PR Checklist items.
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: enabling skip memory estimation for KV cache manager v2, which aligns with the code modifications in py_executor_creator.py and _util.py.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
896-898: 💤 Low value

Consider adding a setter method instead of direct private attribute assignment.

The logic is correct: skip_est is enabled only when both the environment variable allows it AND the manager is a KVCacheManagerV2 subclass. However, directly assigning to kv_cache_creator._skip_est modifies a private attribute from outside the class.

Consider adding a setter method to KvCacheCreator:
def set_skip_est(self, skip_est: bool) -> None:
    """Update skip estimation flag after manager class is determined."""
    self._skip_est = skip_est
Then use:
kv_cache_creator.set_skip_est(skip_est)
This makes the intent clearer and provides a defined interface for updating the flag.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/py_executor_creator.py` around lines 896 -
898, The code currently sets kv_cache_creator._skip_est directly after checking
issubclass(kv_cache_creator.kv_cache_manager_cls(), KVCacheManagerV2); instead,
add a public setter on the KvCacheCreator class (e.g., def set_skip_est(self,
skip_est: bool) -> None: self._skip_est = skip_est) and replace the direct
assignment with kv_cache_creator.set_skip_est(skip_est) so the private field is
updated via a defined interface.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@jenkins/L0_Test.groovy`:
- Around line 219-229: The upload gate currently checks only hasTimeoutTest,
downloadResultSucceed, and downloadPerfResultSucceed so SLURM-only runs are
skipped; modify the download block that handles slurmLogFiles (where
slurmLogSources is built) to set a boolean flag (e.g., downloadSlurmLogsSucceed)
when the scp/download succeeds, and then include that flag in the upload
condition (replace if (hasTimeoutTest || downloadResultSucceed ||
downloadPerfResultSucceed) with a check that also ORs downloadSlurmLogsSucceed).
Ensure the new flag is initialized false and updated by Utils.exec's return
status so artifact/archive steps run when only SLURM logs are present.
- Around line 217-224: The slurm log filenames in slurmLogFiles are being
interpolated directly into slurmLogSources and passed to
Utils.exec/scpFromRemoteCmd, allowing shell injection; instead, change the logic
that builds slurmLogSources to iterate slurmLogFiles and invoke scpFromRemoteCmd
once per file (using the individual file path constructed from
perfResultsBasePath and the filename) so each remote path is strictly
quoted/escaped before being handed to the shell, and call Utils.exec(sc
pFromRemoteCmd(remote, singleSource, "${stageName}/slurm-logs/"), returnStatus:
true, numRetries: 3) for each file to avoid expanding an untrusted list in a
single shell interpolation.

---

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`:
- Around line 896-898: The code currently sets kv_cache_creator._skip_est
directly after checking issubclass(kv_cache_creator.kv_cache_manager_cls(),
KVCacheManagerV2); instead, add a public setter on the KvCacheCreator class
(e.g., def set_skip_est(self, skip_est: bool) -> None: self._skip_est =
skip_est) and replace the direct assignment with
kv_cache_creator.set_skip_est(skip_est) so the private field is updated via a
defined interface.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1f80affb-9512-4994-ab63-c6d0cea1bd4d

📥 Commits

Reviewing files that changed from the base of the PR and between 520262d and d4f5566.

📒 Files selected for processing (3)

jenkins/L0_Test.groovy
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py

coderabbitai · 2026-06-07T01:24:31Z

+            def slurmLogFiles = slurmLogListOutput.split(/\s+/).collect { it.trim() }.findAll { it }
+            echo "Slurm Log Files: ${slurmLogFiles}"
+            if (slurmLogFiles) {
+                sh "mkdir -p ${stageName}/slurm-logs"
+                def slurmLogSources = slurmLogFiles.size() == 1
+                    ? "${perfResultsBasePath}/${slurmLogFiles[0]}"
+                    : "{${slurmLogFiles.collect { "${perfResultsBasePath}/${it}" }.join(',')}}"
+                Utils.exec(pipeline, script: scpFromRemoteCmd(remote, slurmLogSources, "${stageName}/slurm-logs/"), returnStatus: true, numRetries: 3)


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Unsafe shell interpolation in SCP source construction can enable command injection.

At Line 217 and Lines 221-224, filenames returned from remote find are interpolated into a shell command without robust quoting/escaping. A crafted log filename (e.g., containing shell metacharacters) can execute unintended commands on the Jenkins side when scpFromRemoteCmd() builds the command string.

Suggested hardening (copy one file at a time with strict quoting)

- def slurmLogFiles = slurmLogListOutput.split(/\s+/).collect { it.trim() }.findAll { it } + def slurmLogFiles = slurmLogListOutput + .split('\n') + .collect { it.trim() } + .findAll { it } echo "Slurm Log Files: ${slurmLogFiles}" if (slurmLogFiles) { sh "mkdir -p ${stageName}/slurm-logs" - def slurmLogSources = slurmLogFiles.size() == 1 - ? "${perfResultsBasePath}/${slurmLogFiles[0]}" - : "{${slurmLogFiles.collect { "${perfResultsBasePath}/${it}" }.join(',')}}" - Utils.exec(pipeline, script: scpFromRemoteCmd(remote, slurmLogSources, "${stageName}/slurm-logs/"), returnStatus: true, numRetries: 3) + slurmLogFiles.each { logName -> + def remoteFile = "${perfResultsBasePath}/${logName}".replace("'", "'\"'\"'") + def localDir = "${stageName}/slurm-logs/".replace("'", "'\"'\"'") + Utils.exec( + pipeline, + script: scpFromRemoteCmd(remote, "'${remoteFile}'", "'${localDir}'"), + returnStatus: true, + numRetries: 3 + ) + } }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@jenkins/L0_Test.groovy` around lines 217 - 224, The slurm log filenames in slurmLogFiles are being interpolated directly into slurmLogSources and passed to Utils.exec/scpFromRemoteCmd, allowing shell injection; instead, change the logic that builds slurmLogSources to iterate slurmLogFiles and invoke scpFromRemoteCmd once per file (using the individual file path constructed from perfResultsBasePath and the filename) so each remote path is strictly quoted/escaped before being handed to the shell, and call Utils.exec(sc pFromRemoteCmd(remote, singleSource, "${stageName}/slurm-logs/"), returnStatus: true, numRetries: 3) for each file to avoid expanding an untrusted list in a single shell interpolation.

coderabbitai · 2026-06-07T01:24:31Z

+            if (slurmLogFiles) {
+                sh "mkdir -p ${stageName}/slurm-logs"
+                def slurmLogSources = slurmLogFiles.size() == 1
+                    ? "${perfResultsBasePath}/${slurmLogFiles[0]}"
+                    : "{${slurmLogFiles.collect { "${perfResultsBasePath}/${it}" }.join(',')}}"
+                Utils.exec(pipeline, script: scpFromRemoteCmd(remote, slurmLogSources, "${stageName}/slurm-logs/"), returnStatus: true, numRetries: 3)
+            }
+
            echo "hasTimeoutTest: ${hasTimeoutTest}, downloadResultSucceed: ${downloadResultSucceed}, downloadPerfResultSucceed: ${downloadPerfResultSucceed}"
+
            if (hasTimeoutTest || downloadResultSucceed || downloadPerfResultSucceed) {


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

SLURM logs can be downloaded but never uploaded as artifacts.

At Line 229, artifact upload is gated only by XML/perf flags. The new SLURM log path has no corresponding success flag in this condition, so runs with only SLURM logs still hit “No results xml to submit” and skip archive/upload.

Suggested fix (include SLURM log download status in upload gate)

def downloadPerfResultSucceed = false + def downloadSlurmLogsSucceed = false @@ if (slurmLogFiles) { sh "mkdir -p ${stageName}/slurm-logs" @@ - Utils.exec(pipeline, script: scpFromRemoteCmd(remote, slurmLogSources, "${stageName}/slurm-logs/"), returnStatus: true, numRetries: 3) + downloadSlurmLogsSucceed = + Utils.exec( + pipeline, + script: scpFromRemoteCmd(remote, slurmLogSources, "${stageName}/slurm-logs/"), + returnStatus: true, + numRetries: 3 + ) == 0 } @@ - if (hasTimeoutTest || downloadResultSucceed || downloadPerfResultSucceed) { + if (hasTimeoutTest || downloadResultSucceed || downloadPerfResultSucceed || downloadSlurmLogsSucceed) {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@jenkins/L0_Test.groovy` around lines 219 - 229, The upload gate currently checks only hasTimeoutTest, downloadResultSucceed, and downloadPerfResultSucceed so SLURM-only runs are skipped; modify the download block that handles slurmLogFiles (where slurmLogSources is built) to set a boolean flag (e.g., downloadSlurmLogsSucceed) when the scp/download succeeds, and then include that flag in the upload condition (replace if (hasTimeoutTest || downloadResultSucceed || downloadPerfResultSucceed) with a check that also ORs downloadSlurmLogsSucceed). Ensure the new flag is initialized false and updated by Utils.exec's return status so artifact/archive steps run when only SLURM logs are present.

tensorrt-cicd · 2026-06-07T12:47:01Z

PR_Github #52532 [ run ] completed with state FAILURE. Commit: d4f5566
/LLM/main/L0_MergeRequest_PR pipeline #41818 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

HuiGao-NV · 2026-06-08T01:16:57Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-08T01:23:31Z

PR_Github #52609 [ run ] triggered by Bot. Commit: d4f5566 Link to invocation

Signed-off-by: HuiGao <huig@nvidia.com>

HuiGao-NV · 2026-06-08T06:29:58Z

/bot run --stage-list="B300-PyTorch-2,DGX_H100-PyTorch-5" --disable-fail-fast

tensorrt-cicd · 2026-06-08T06:39:35Z

PR_Github #52680 [ run ] triggered by Bot. Commit: 7ac775b Link to invocation

tensorrt-cicd · 2026-06-08T06:44:49Z

PR_Github #52609 [ run ] completed with state ABORTED. Commit: d4f5566

Link to invocation

tensorrt-cicd · 2026-06-08T07:10:41Z

PR_Github #52680 [ run ] completed with state FAILURE. Commit: 7ac775b
/LLM/main/L0_MergeRequest_PR pipeline #41949 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

HuiGao-NV · 2026-06-08T07:48:30Z

/bot run --stage-list="B300-PyTorch-2,DGX_H100-PyTorch-5" --disable-fail-fast

tensorrt-cicd · 2026-06-08T07:54:21Z

PR_Github #52705 [ run ] triggered by Bot. Commit: 7ac775b Link to invocation

tensorrt-cicd · 2026-06-08T08:44:44Z

PR_Github #52705 [ run ] completed with state FAILURE. Commit: 7ac775b
/LLM/main/L0_MergeRequest_PR pipeline #41972 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Enable skip memory estimation on kv cache manager v2

d4f5566

Signed-off-by: HuiGao <huig@nvidia.com>

HuiGao-NV requested review from a team as code owners June 7, 2026 01:17

HuiGao-NV requested review from EmmaQiaoCh, achartier and mlefeb01 June 7, 2026 01:17

github-actions Bot assigned HuiGao-NV Jun 7, 2026

coderabbitai Bot reviewed Jun 7, 2026

View reviewed changes

Add method to get kv cache manager class in DummyKvCache in test

7ac775b

Signed-off-by: HuiGao <huig@nvidia.com>

Conversation

HuiGao-NV commented Jun 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

HuiGao-NV commented Jun 7, 2026

Uh oh!

tensorrt-cicd commented Jun 7, 2026

Uh oh!

coderabbitai Bot commented Jun 7, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jun 7, 2026

Uh oh!

HuiGao-NV commented Jun 8, 2026

Uh oh!

tensorrt-cicd commented Jun 8, 2026

Uh oh!

HuiGao-NV commented Jun 8, 2026

Uh oh!

tensorrt-cicd commented Jun 8, 2026

Uh oh!

tensorrt-cicd commented Jun 8, 2026

Uh oh!

tensorrt-cicd commented Jun 8, 2026

Uh oh!

HuiGao-NV commented Jun 8, 2026

Uh oh!

tensorrt-cicd commented Jun 8, 2026

Uh oh!

tensorrt-cicd commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HuiGao-NV commented Jun 7, 2026 •

edited by coderabbitai Bot

Loading