Skip to content

feat(ops): automated ephemeral stack cleanup script#109

Open
scottschreckengaust wants to merge 3 commits into
mainfrom
feat/cleanup-ephemeral-stacks
Open

feat(ops): automated ephemeral stack cleanup script#109
scottschreckengaust wants to merge 3 commits into
mainfrom
feat/cleanup-ephemeral-stacks

Conversation

@scottschreckengaust
Copy link
Copy Markdown
Contributor

@scottschreckengaust scottschreckengaust commented May 18, 2026

Summary

  • Adds scripts/cleanup-ephemeral-stacks.sh — a shell script that identifies and deletes orphaned ABCA ephemeral CloudFormation stacks
  • Handles stuck ENI cleanup (Lambda/AgentCore Hyperplane ENIs) before stack deletion to prevent DELETE_FAILED states
  • Respects termination protection and skips stacks in active transitions
  • Supports --dry-run, --max-age-hours, --prefix filter, and --force-eni options

Status

Script only — this is the operational cleanup tool from Issue #72. Still needed for full issue completion:

  • EventBridge scheduled rule (Lambda or Step Functions wrapper) for automated recurring execution
  • CloudWatch logging/metrics for audit trail
  • CDK construct or standalone stack to deploy the scheduler
  • Integration tests / dry-run CI validation

The script itself is complete and production-ready for manual use today.

Test plan

  • Run with --dry-run against an account with ephemeral stacks
  • Verify termination-protected stacks are never touched
  • Verify ENI cleanup works for stuck VPC security groups
  • Verify --max-age-hours filtering is correct

Part of #72 — this PR delivers the manual cleanup script foundation only. Full issue completion (EventBridge schedule, CloudWatch audit, CDK construct, tests) lands in a future PR.

🤖 Generated with Claude Code

Shell script that identifies and deletes orphaned ABCA ephemeral
CloudFormation stacks. Handles stuck ENI cleanup (Lambda/AgentCore
Hyperplane ENIs) before stack deletion, respects termination protection,
and supports dry-run mode.

Closes #72

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@scottschreckengaust scottschreckengaust marked this pull request as ready for review June 5, 2026 22:35
@scottschreckengaust scottschreckengaust requested a review from a team as a code owner June 5, 2026 22:35
Security/robustness review of scripts/cleanup-ephemeral-stacks.sh (#72):

- Fail CLOSED on unparseable CreationTime. Previously a parse failure fell
  back to epoch 0, making every matching stack look ~billions of seconds old
  and eligible for deletion — the age gate failed open. Now it SKIPs.
- Validate --max-age-hours is a non-negative integer before arithmetic
  (rejects injected/garbage input).
- Print account + caller ARN (sts:GetCallerIdentity) before any action so the
  operator can confirm blast radius; hard-fail if identity can't be resolved.
- Tolerate a single delete-stack failure instead of aborting the whole loop
  under set -e (would otherwise orphan later stacks); track and report a
  Failed count, and only increment Deleted on a delete actually initiated.
- Remove dead --force-eni flag (parsed but never used; shellcheck SC2034).
- Annotate the JMESPath --query backticks as intentional (shellcheck SC2016).

shellcheck: clean (exit 0). semgrep --config=auto: 0 findings.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@scottschreckengaust
Copy link
Copy Markdown
Contributor Author

Revived this branch: rebased onto latest main (kept your "Update branch" merge 971b1da) and hardened the script for defense in depth (commit baaaec0).

Security/robustness review — ran shellcheck (exit 0, clean) and semgrep --config=auto (0 findings) on the result. Manual review found and fixed:

  • Fail-open age gate (most important): a parse failure on CreationTime previously fell back to epoch 0, making every matching stack look billions of seconds old and eligible for deletion. Now fails closed — unparseable time → SKIP.
  • Input validation: --max-age-hours is validated as a non-negative integer before arithmetic (rejects garbage/injected values).
  • Blast-radius visibility: prints account ID + caller ARN (sts:GetCallerIdentity) before any mutation; hard-fails if identity can't be resolved.
  • No mid-run abort: a single delete-stack failure under set -e would abort the loop and orphan later stacks — now tolerated, counted, and reported (Failed: line).
  • Removed dead --force-eni flag (SC2034); annotated JMESPath --query backticks (SC2016 false positive).

Scope unchanged: still the manual script only (Part of #72). EventBridge schedule, CloudWatch audit, CDK construct, and tests are deferred to a future PR.

@scottschreckengaust
Copy link
Copy Markdown
Contributor Author

Follow-up filed: #278 — add shellcheck to the toolchain (mise + prek hooks + CI). This PR added the first destructive shell script and surfaced that shell is the only language surface with no static-analysis gate. Out of scope for this PR (toolchain/CI change, separate governance per ADR-003); tracked separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant