Skip to content

Fix summary step to wait for retry workers before reporting#393

Merged
ianks merged 5 commits intomainfrom
ianks/fix-summary-wait-for-retry
Apr 8, 2026
Merged

Fix summary step to wait for retry workers before reporting#393
ianks merged 5 commits intomainfrom
ianks/fix-summary-wait-for-retry

Conversation

@ianks
Copy link
Copy Markdown
Contributor

@ianks ianks commented Apr 8, 2026

Two fixes for the retry summary step.

The race condition: previously, wait_for_workers would see the queue as exhausted and exit immediately on retry runs -- before the retry worker had a chance to re-run the failed tests and clear error-reports. This meant the summary step would report failures that were about to be fixed. wait_for_workers now stays open on retry runs (BUILDKITE_RETRY_COUNT > 0) until error-reports is empty or inactive_workers_timeout expires.

The misleading stats line: when workers die before flushing their per-test stats, the summary line would show "0 failures, 0 errors" in green while simultaneously printing a FAILED TESTS SUMMARY below it. aggregates now uses error-reports as a floor for the failure count, so the line can never be misleadingly green when failures are known.

ianks added 4 commits April 8, 2026 09:20
When Buildkite retries a job, the main queue is already exhausted from
the original run. A retry worker may find unresolved failures via the
error-reports fallback and start re-running them via the Retry queue.
But those tests are not in the Redis running set, so active_workers?
returns false and the summary's wait_for_workers loop exits immediately
on exhausted? — canceling the retry worker before it can clear the
error-report.

Fix: after the main loop exits due to exhausted?, if this is a retry
run (BUILDKITE_RETRY_COUNT > 0) and error-reports are still non-empty,
wait up to inactive_workers_timeout for retry workers to clear them.
Covers the case where wait_for_workers must stay open on a retry run
until retry workers clear error-reports. Also verifies the secondary
wait is skipped when BUILDKITE_RETRY_COUNT is not set.
…ailures

When workers die before flushing per-test stats, error-reports has the
authoritative failure data but the aggregated stats show 0. This caused
the summary line to say "0 failures, 0 errors" in green while simultaneously
printing a FAILED TESTS SUMMARY below it.

Also moves FakeEntry/fake_entry test helper to a shared support file so it's
available across all test files without cross-file dependencies.
@ianks
Copy link
Copy Markdown
Contributor Author

ianks commented Apr 8, 2026

Verified end-to-end on https://buildkite.com/shopify/world-shopify-selective-tests/builds/903009 -- CiQueueRetrySmokeTest#test_passes_only_on_manual_retry ran through the retry path and passed with BUILDKITE_RETRY_TYPE=manual. Final aggregate: 0 failures, 0 errors across 602k tests.

@ianks ianks merged commit 3dc8157 into main Apr 8, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants