[None][chore] Enable skip memory estimation on kv cache manager v2#15052
[None][chore] Enable skip memory estimation on kv cache manager v2#15052HuiGao-NV wants to merge 2 commits into
Conversation
Signed-off-by: HuiGao <huig@nvidia.com>
|
/bot run --disable-fail-fast |
|
PR_Github #52532 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughJenkins post-test result collection now downloads SLURM log artifacts. Python executor creation refines KV cache estimation skip behavior to check both an environment variable default (now true) and KV cache manager implementation type, with conditional validation updates. ChangesCI Test Artifact Collection
KV Cache Estimation Behavior
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
896-898: 💤 Low valueConsider adding a setter method instead of direct private attribute assignment.
The logic is correct:
skip_estis enabled only when both the environment variable allows it AND the manager is aKVCacheManagerV2subclass. However, directly assigning tokv_cache_creator._skip_estmodifies a private attribute from outside the class.Consider adding a setter method to
KvCacheCreator:def set_skip_est(self, skip_est: bool) -> None: """Update skip estimation flag after manager class is determined.""" self._skip_est = skip_estThen use:
kv_cache_creator.set_skip_est(skip_est)This makes the intent clearer and provides a defined interface for updating the flag.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/py_executor_creator.py` around lines 896 - 898, The code currently sets kv_cache_creator._skip_est directly after checking issubclass(kv_cache_creator.kv_cache_manager_cls(), KVCacheManagerV2); instead, add a public setter on the KvCacheCreator class (e.g., def set_skip_est(self, skip_est: bool) -> None: self._skip_est = skip_est) and replace the direct assignment with kv_cache_creator.set_skip_est(skip_est) so the private field is updated via a defined interface.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@jenkins/L0_Test.groovy`:
- Around line 219-229: The upload gate currently checks only hasTimeoutTest,
downloadResultSucceed, and downloadPerfResultSucceed so SLURM-only runs are
skipped; modify the download block that handles slurmLogFiles (where
slurmLogSources is built) to set a boolean flag (e.g., downloadSlurmLogsSucceed)
when the scp/download succeeds, and then include that flag in the upload
condition (replace if (hasTimeoutTest || downloadResultSucceed ||
downloadPerfResultSucceed) with a check that also ORs downloadSlurmLogsSucceed).
Ensure the new flag is initialized false and updated by Utils.exec's return
status so artifact/archive steps run when only SLURM logs are present.
- Around line 217-224: The slurm log filenames in slurmLogFiles are being
interpolated directly into slurmLogSources and passed to
Utils.exec/scpFromRemoteCmd, allowing shell injection; instead, change the logic
that builds slurmLogSources to iterate slurmLogFiles and invoke scpFromRemoteCmd
once per file (using the individual file path constructed from
perfResultsBasePath and the filename) so each remote path is strictly
quoted/escaped before being handed to the shell, and call Utils.exec(sc
pFromRemoteCmd(remote, singleSource, "${stageName}/slurm-logs/"), returnStatus:
true, numRetries: 3) for each file to avoid expanding an untrusted list in a
single shell interpolation.
---
Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`:
- Around line 896-898: The code currently sets kv_cache_creator._skip_est
directly after checking issubclass(kv_cache_creator.kv_cache_manager_cls(),
KVCacheManagerV2); instead, add a public setter on the KvCacheCreator class
(e.g., def set_skip_est(self, skip_est: bool) -> None: self._skip_est =
skip_est) and replace the direct assignment with
kv_cache_creator.set_skip_est(skip_est) so the private field is updated via a
defined interface.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1f80affb-9512-4994-ab63-c6d0cea1bd4d
📒 Files selected for processing (3)
jenkins/L0_Test.groovytensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/_torch/pyexecutor/py_executor_creator.py
| def slurmLogFiles = slurmLogListOutput.split(/\s+/).collect { it.trim() }.findAll { it } | ||
| echo "Slurm Log Files: ${slurmLogFiles}" | ||
| if (slurmLogFiles) { | ||
| sh "mkdir -p ${stageName}/slurm-logs" | ||
| def slurmLogSources = slurmLogFiles.size() == 1 | ||
| ? "${perfResultsBasePath}/${slurmLogFiles[0]}" | ||
| : "{${slurmLogFiles.collect { "${perfResultsBasePath}/${it}" }.join(',')}}" | ||
| Utils.exec(pipeline, script: scpFromRemoteCmd(remote, slurmLogSources, "${stageName}/slurm-logs/"), returnStatus: true, numRetries: 3) |
There was a problem hiding this comment.
Unsafe shell interpolation in SCP source construction can enable command injection.
At Line 217 and Lines 221-224, filenames returned from remote find are interpolated into a shell command without robust quoting/escaping. A crafted log filename (e.g., containing shell metacharacters) can execute unintended commands on the Jenkins side when scpFromRemoteCmd() builds the command string.
Suggested hardening (copy one file at a time with strict quoting)
- def slurmLogFiles = slurmLogListOutput.split(/\s+/).collect { it.trim() }.findAll { it }
+ def slurmLogFiles = slurmLogListOutput
+ .split('\n')
+ .collect { it.trim() }
+ .findAll { it }
echo "Slurm Log Files: ${slurmLogFiles}"
if (slurmLogFiles) {
sh "mkdir -p ${stageName}/slurm-logs"
- def slurmLogSources = slurmLogFiles.size() == 1
- ? "${perfResultsBasePath}/${slurmLogFiles[0]}"
- : "{${slurmLogFiles.collect { "${perfResultsBasePath}/${it}" }.join(',')}}"
- Utils.exec(pipeline, script: scpFromRemoteCmd(remote, slurmLogSources, "${stageName}/slurm-logs/"), returnStatus: true, numRetries: 3)
+ slurmLogFiles.each { logName ->
+ def remoteFile = "${perfResultsBasePath}/${logName}".replace("'", "'\"'\"'")
+ def localDir = "${stageName}/slurm-logs/".replace("'", "'\"'\"'")
+ Utils.exec(
+ pipeline,
+ script: scpFromRemoteCmd(remote, "'${remoteFile}'", "'${localDir}'"),
+ returnStatus: true,
+ numRetries: 3
+ )
+ }
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@jenkins/L0_Test.groovy` around lines 217 - 224, The slurm log filenames in
slurmLogFiles are being interpolated directly into slurmLogSources and passed to
Utils.exec/scpFromRemoteCmd, allowing shell injection; instead, change the logic
that builds slurmLogSources to iterate slurmLogFiles and invoke scpFromRemoteCmd
once per file (using the individual file path constructed from
perfResultsBasePath and the filename) so each remote path is strictly
quoted/escaped before being handed to the shell, and call Utils.exec(sc
pFromRemoteCmd(remote, singleSource, "${stageName}/slurm-logs/"), returnStatus:
true, numRetries: 3) for each file to avoid expanding an untrusted list in a
single shell interpolation.
| if (slurmLogFiles) { | ||
| sh "mkdir -p ${stageName}/slurm-logs" | ||
| def slurmLogSources = slurmLogFiles.size() == 1 | ||
| ? "${perfResultsBasePath}/${slurmLogFiles[0]}" | ||
| : "{${slurmLogFiles.collect { "${perfResultsBasePath}/${it}" }.join(',')}}" | ||
| Utils.exec(pipeline, script: scpFromRemoteCmd(remote, slurmLogSources, "${stageName}/slurm-logs/"), returnStatus: true, numRetries: 3) | ||
| } | ||
|
|
||
| echo "hasTimeoutTest: ${hasTimeoutTest}, downloadResultSucceed: ${downloadResultSucceed}, downloadPerfResultSucceed: ${downloadPerfResultSucceed}" | ||
|
|
||
| if (hasTimeoutTest || downloadResultSucceed || downloadPerfResultSucceed) { |
There was a problem hiding this comment.
SLURM logs can be downloaded but never uploaded as artifacts.
At Line 229, artifact upload is gated only by XML/perf flags. The new SLURM log path has no corresponding success flag in this condition, so runs with only SLURM logs still hit “No results xml to submit” and skip archive/upload.
Suggested fix (include SLURM log download status in upload gate)
def downloadPerfResultSucceed = false
+ def downloadSlurmLogsSucceed = false
@@
if (slurmLogFiles) {
sh "mkdir -p ${stageName}/slurm-logs"
@@
- Utils.exec(pipeline, script: scpFromRemoteCmd(remote, slurmLogSources, "${stageName}/slurm-logs/"), returnStatus: true, numRetries: 3)
+ downloadSlurmLogsSucceed =
+ Utils.exec(
+ pipeline,
+ script: scpFromRemoteCmd(remote, slurmLogSources, "${stageName}/slurm-logs/"),
+ returnStatus: true,
+ numRetries: 3
+ ) == 0
}
@@
- if (hasTimeoutTest || downloadResultSucceed || downloadPerfResultSucceed) {
+ if (hasTimeoutTest || downloadResultSucceed || downloadPerfResultSucceed || downloadSlurmLogsSucceed) {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@jenkins/L0_Test.groovy` around lines 219 - 229, The upload gate currently
checks only hasTimeoutTest, downloadResultSucceed, and downloadPerfResultSucceed
so SLURM-only runs are skipped; modify the download block that handles
slurmLogFiles (where slurmLogSources is built) to set a boolean flag (e.g.,
downloadSlurmLogsSucceed) when the scp/download succeeds, and then include that
flag in the upload condition (replace if (hasTimeoutTest ||
downloadResultSucceed || downloadPerfResultSucceed) with a check that also ORs
downloadSlurmLogsSucceed). Ensure the new flag is initialized false and updated
by Utils.exec's return status so artifact/archive steps run when only SLURM logs
are present.
|
PR_Github #52532 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #52609 [ run ] triggered by Bot. Commit: |
Signed-off-by: HuiGao <huig@nvidia.com>
|
/bot run --stage-list="B300-PyTorch-2,DGX_H100-PyTorch-5" --disable-fail-fast |
|
PR_Github #52680 [ run ] triggered by Bot. Commit: |
|
PR_Github #52609 [ run ] completed with state |
|
PR_Github #52680 [ run ] completed with state
|
|
/bot run --stage-list="B300-PyTorch-2,DGX_H100-PyTorch-5" --disable-fail-fast |
|
PR_Github #52705 [ run ] triggered by Bot. Commit: |
|
PR_Github #52705 [ run ] completed with state
|
Summary by CodeRabbit
Tests
Chores
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.