Skip to content

[None][chore] Enable skip memory estimation on kv cache manager v2#15052

Open
HuiGao-NV wants to merge 2 commits into
NVIDIA:mainfrom
HuiGao-NV:skip_mem_test_kvm2
Open

[None][chore] Enable skip memory estimation on kv cache manager v2#15052
HuiGao-NV wants to merge 2 commits into
NVIDIA:mainfrom
HuiGao-NV:skip_mem_test_kvm2

Conversation

@HuiGao-NV
Copy link
Copy Markdown
Collaborator

@HuiGao-NV HuiGao-NV commented Jun 7, 2026

Summary by CodeRabbit

  • Tests

    • Test infrastructure now captures SLURM log files during result collection, providing enhanced diagnostic data for debugging test failures and performance analysis.
  • Chores

    • KV cache estimation behavior updated to skip by default based on cache manager implementation type, improving default system efficiency.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: HuiGao <huig@nvidia.com>
@HuiGao-NV HuiGao-NV requested review from a team as code owners June 7, 2026 01:17
@HuiGao-NV
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52532 [ run ] triggered by Bot. Commit: d4f5566 Link to invocation

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 7, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Jenkins post-test result collection now downloads SLURM log artifacts. Python executor creation refines KV cache estimation skip behavior to check both an environment variable default (now true) and KV cache manager implementation type, with conditional validation updates.

Changes

CI Test Artifact Collection

Layer / File(s) Summary
SLURM log artifact download
jenkins/L0_Test.groovy
Post-test result upload now enumerates and downloads SLURM log files (*.log, *.out, slurm-*) from the remote test environment into a local ${stageName}/slurm-logs/ directory; whitespace adjusted in isolated test rerun failure handler.

KV Cache Estimation Behavior

Layer / File(s) Summary
KV cache manager type accessibility
tensorrt_llm/_torch/pyexecutor/_util.py, tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
New KvCacheCreator.kv_cache_manager_cls() accessor exposes the selected manager class; KVCacheManagerV2 is imported to enable type checks in executor creation.
Type-aware estimation skip logic
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py, tensorrt_llm/_torch/pyexecutor/_util.py
Environment variable TRTLLM_SKIP_KV_CACHE_ESTIMATION default changes to '1' (skip enabled by default); skip decision is further gated on whether the manager is a KVCacheManagerV2 subclass; CP-type constraint validation is now conditional on estimation not being skipped.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description contains only the template structure with empty sections; the Description, Test Coverage, and PR Checklist items are not filled out with actual content. Fill in the Description section explaining the issue and solution, provide Test Coverage details listing relevant tests, and complete the PR Checklist items.
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: enabling skip memory estimation for KV cache manager v2, which aligns with the code modifications in py_executor_creator.py and _util.py.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)

896-898: 💤 Low value

Consider adding a setter method instead of direct private attribute assignment.

The logic is correct: skip_est is enabled only when both the environment variable allows it AND the manager is a KVCacheManagerV2 subclass. However, directly assigning to kv_cache_creator._skip_est modifies a private attribute from outside the class.

Consider adding a setter method to KvCacheCreator:

def set_skip_est(self, skip_est: bool) -> None:
    """Update skip estimation flag after manager class is determined."""
    self._skip_est = skip_est

Then use:

kv_cache_creator.set_skip_est(skip_est)

This makes the intent clearer and provides a defined interface for updating the flag.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/py_executor_creator.py` around lines 896 -
898, The code currently sets kv_cache_creator._skip_est directly after checking
issubclass(kv_cache_creator.kv_cache_manager_cls(), KVCacheManagerV2); instead,
add a public setter on the KvCacheCreator class (e.g., def set_skip_est(self,
skip_est: bool) -> None: self._skip_est = skip_est) and replace the direct
assignment with kv_cache_creator.set_skip_est(skip_est) so the private field is
updated via a defined interface.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@jenkins/L0_Test.groovy`:
- Around line 219-229: The upload gate currently checks only hasTimeoutTest,
downloadResultSucceed, and downloadPerfResultSucceed so SLURM-only runs are
skipped; modify the download block that handles slurmLogFiles (where
slurmLogSources is built) to set a boolean flag (e.g., downloadSlurmLogsSucceed)
when the scp/download succeeds, and then include that flag in the upload
condition (replace if (hasTimeoutTest || downloadResultSucceed ||
downloadPerfResultSucceed) with a check that also ORs downloadSlurmLogsSucceed).
Ensure the new flag is initialized false and updated by Utils.exec's return
status so artifact/archive steps run when only SLURM logs are present.
- Around line 217-224: The slurm log filenames in slurmLogFiles are being
interpolated directly into slurmLogSources and passed to
Utils.exec/scpFromRemoteCmd, allowing shell injection; instead, change the logic
that builds slurmLogSources to iterate slurmLogFiles and invoke scpFromRemoteCmd
once per file (using the individual file path constructed from
perfResultsBasePath and the filename) so each remote path is strictly
quoted/escaped before being handed to the shell, and call Utils.exec(sc
pFromRemoteCmd(remote, singleSource, "${stageName}/slurm-logs/"), returnStatus:
true, numRetries: 3) for each file to avoid expanding an untrusted list in a
single shell interpolation.

---

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`:
- Around line 896-898: The code currently sets kv_cache_creator._skip_est
directly after checking issubclass(kv_cache_creator.kv_cache_manager_cls(),
KVCacheManagerV2); instead, add a public setter on the KvCacheCreator class
(e.g., def set_skip_est(self, skip_est: bool) -> None: self._skip_est =
skip_est) and replace the direct assignment with
kv_cache_creator.set_skip_est(skip_est) so the private field is updated via a
defined interface.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1f80affb-9512-4994-ab63-c6d0cea1bd4d

📥 Commits

Reviewing files that changed from the base of the PR and between 520262d and d4f5566.

📒 Files selected for processing (3)
  • jenkins/L0_Test.groovy
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py

Comment thread jenkins/L0_Test.groovy
Comment on lines +217 to +224
def slurmLogFiles = slurmLogListOutput.split(/\s+/).collect { it.trim() }.findAll { it }
echo "Slurm Log Files: ${slurmLogFiles}"
if (slurmLogFiles) {
sh "mkdir -p ${stageName}/slurm-logs"
def slurmLogSources = slurmLogFiles.size() == 1
? "${perfResultsBasePath}/${slurmLogFiles[0]}"
: "{${slurmLogFiles.collect { "${perfResultsBasePath}/${it}" }.join(',')}}"
Utils.exec(pipeline, script: scpFromRemoteCmd(remote, slurmLogSources, "${stageName}/slurm-logs/"), returnStatus: true, numRetries: 3)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Unsafe shell interpolation in SCP source construction can enable command injection.

At Line 217 and Lines 221-224, filenames returned from remote find are interpolated into a shell command without robust quoting/escaping. A crafted log filename (e.g., containing shell metacharacters) can execute unintended commands on the Jenkins side when scpFromRemoteCmd() builds the command string.

Suggested hardening (copy one file at a time with strict quoting)
-            def slurmLogFiles = slurmLogListOutput.split(/\s+/).collect { it.trim() }.findAll { it }
+            def slurmLogFiles = slurmLogListOutput
+                .split('\n')
+                .collect { it.trim() }
+                .findAll { it }

             echo "Slurm Log Files: ${slurmLogFiles}"
             if (slurmLogFiles) {
                 sh "mkdir -p ${stageName}/slurm-logs"
-                def slurmLogSources = slurmLogFiles.size() == 1
-                    ? "${perfResultsBasePath}/${slurmLogFiles[0]}"
-                    : "{${slurmLogFiles.collect { "${perfResultsBasePath}/${it}" }.join(',')}}"
-                Utils.exec(pipeline, script: scpFromRemoteCmd(remote, slurmLogSources, "${stageName}/slurm-logs/"), returnStatus: true, numRetries: 3)
+                slurmLogFiles.each { logName ->
+                    def remoteFile = "${perfResultsBasePath}/${logName}".replace("'", "'\"'\"'")
+                    def localDir = "${stageName}/slurm-logs/".replace("'", "'\"'\"'")
+                    Utils.exec(
+                        pipeline,
+                        script: scpFromRemoteCmd(remote, "'${remoteFile}'", "'${localDir}'"),
+                        returnStatus: true,
+                        numRetries: 3
+                    )
+                }
             }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@jenkins/L0_Test.groovy` around lines 217 - 224, The slurm log filenames in
slurmLogFiles are being interpolated directly into slurmLogSources and passed to
Utils.exec/scpFromRemoteCmd, allowing shell injection; instead, change the logic
that builds slurmLogSources to iterate slurmLogFiles and invoke scpFromRemoteCmd
once per file (using the individual file path constructed from
perfResultsBasePath and the filename) so each remote path is strictly
quoted/escaped before being handed to the shell, and call Utils.exec(sc
pFromRemoteCmd(remote, singleSource, "${stageName}/slurm-logs/"), returnStatus:
true, numRetries: 3) for each file to avoid expanding an untrusted list in a
single shell interpolation.

Comment thread jenkins/L0_Test.groovy
Comment on lines +219 to 229
if (slurmLogFiles) {
sh "mkdir -p ${stageName}/slurm-logs"
def slurmLogSources = slurmLogFiles.size() == 1
? "${perfResultsBasePath}/${slurmLogFiles[0]}"
: "{${slurmLogFiles.collect { "${perfResultsBasePath}/${it}" }.join(',')}}"
Utils.exec(pipeline, script: scpFromRemoteCmd(remote, slurmLogSources, "${stageName}/slurm-logs/"), returnStatus: true, numRetries: 3)
}

echo "hasTimeoutTest: ${hasTimeoutTest}, downloadResultSucceed: ${downloadResultSucceed}, downloadPerfResultSucceed: ${downloadPerfResultSucceed}"

if (hasTimeoutTest || downloadResultSucceed || downloadPerfResultSucceed) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

SLURM logs can be downloaded but never uploaded as artifacts.

At Line 229, artifact upload is gated only by XML/perf flags. The new SLURM log path has no corresponding success flag in this condition, so runs with only SLURM logs still hit “No results xml to submit” and skip archive/upload.

Suggested fix (include SLURM log download status in upload gate)
         def downloadPerfResultSucceed = false
+        def downloadSlurmLogsSucceed = false
@@
             if (slurmLogFiles) {
                 sh "mkdir -p ${stageName}/slurm-logs"
@@
-                Utils.exec(pipeline, script: scpFromRemoteCmd(remote, slurmLogSources, "${stageName}/slurm-logs/"), returnStatus: true, numRetries: 3)
+                downloadSlurmLogsSucceed =
+                    Utils.exec(
+                        pipeline,
+                        script: scpFromRemoteCmd(remote, slurmLogSources, "${stageName}/slurm-logs/"),
+                        returnStatus: true,
+                        numRetries: 3
+                    ) == 0
             }
@@
-            if (hasTimeoutTest || downloadResultSucceed || downloadPerfResultSucceed) {
+            if (hasTimeoutTest || downloadResultSucceed || downloadPerfResultSucceed || downloadSlurmLogsSucceed) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@jenkins/L0_Test.groovy` around lines 219 - 229, The upload gate currently
checks only hasTimeoutTest, downloadResultSucceed, and downloadPerfResultSucceed
so SLURM-only runs are skipped; modify the download block that handles
slurmLogFiles (where slurmLogSources is built) to set a boolean flag (e.g.,
downloadSlurmLogsSucceed) when the scp/download succeeds, and then include that
flag in the upload condition (replace if (hasTimeoutTest ||
downloadResultSucceed || downloadPerfResultSucceed) with a check that also ORs
downloadSlurmLogsSucceed). Ensure the new flag is initialized false and updated
by Utils.exec's return status so artifact/archive steps run when only SLURM logs
are present.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52532 [ run ] completed with state FAILURE. Commit: d4f5566
/LLM/main/L0_MergeRequest_PR pipeline #41818 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@HuiGao-NV
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52609 [ run ] triggered by Bot. Commit: d4f5566 Link to invocation

@HuiGao-NV
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list="B300-PyTorch-2,DGX_H100-PyTorch-5" --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52680 [ run ] triggered by Bot. Commit: 7ac775b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52609 [ run ] completed with state ABORTED. Commit: d4f5566

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52680 [ run ] completed with state FAILURE. Commit: 7ac775b
/LLM/main/L0_MergeRequest_PR pipeline #41949 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@HuiGao-NV
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list="B300-PyTorch-2,DGX_H100-PyTorch-5" --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52705 [ run ] triggered by Bot. Commit: 7ac775b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52705 [ run ] completed with state FAILURE. Commit: 7ac775b
/LLM/main/L0_MergeRequest_PR pipeline #41972 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants