Fix run attribution loss after waitpoint resume in managed workers#3724
Fix run attribution loss after waitpoint resume in managed workers#3724darshitp091 wants to merge 2 commits into
Conversation
Addresses issue triggerdotdev#3674 by avoiding unconditional secure=true on HTTP ClickHouse URLs in the webapp entrypoint.
Addresses issue triggerdotdev#3672 by re-enabling task context when a waitpoint resumes in the same process, so metrics emitted after resumption keep RUN_ID, TASK_SLUG, and ATTEMPT_NUMBER attributes.
|
|
Hi @darshitp091, thanks for your interest in contributing! This project requires that pull request authors are vouched, and you are not in the list of vouched users. This PR will be closed automatically. See https://github.com/triggerdotdev/trigger.dev/blob/main/CONTRIBUTING.md for more details. |
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
WalkthroughThis PR extracts explicit task context enablement into a dedicated Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Fixes a telemetry attribution bug in Trigger.dev managed/dev workers where metrics emitted after resuming from a waitpoint could lose run-scoped attributes (e.g. RUN_ID) by ensuring the task context is re-enabled when a waitpoint resolves.
Changes:
- Added
TaskContextAPI.enable()and reused it insetGlobalTaskContext()to explicitly re-enable run attribution. - Re-enabled task context in both managed and dev worker
RESOLVE_WAITPOINTIPC handlers. - Added a unit regression test around
disable()→enable()behavior (but see review comment re: coverage strength). - Updated Docker ClickHouse migration DSN normalization logic in the entrypoint script.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/core/src/v3/taskContext/index.ts | Adds enable() and uses it to restore run attribution state. |
| packages/core/src/v3/taskContext/index.test.ts | Adds a regression test for disable/enable flow. |
| packages/cli-v3/src/entryPoints/managed-run-worker.ts | Calls taskContext.enable() when handling RESOLVE_WAITPOINT. |
| packages/cli-v3/src/entryPoints/dev-run-worker.ts | Calls taskContext.enable() when handling RESOLVE_WAITPOINT. |
| docker/scripts/entrypoint.sh | Changes ClickHouse migration connection string handling. |
| it("re-enables run attribution after disable()", () => { | ||
| const api = TaskContextAPI.getInstance(); | ||
| api.setGlobalTaskContext({ ctx: FAKE_CTX, worker: FAKE_WORKER }); | ||
|
|
||
| api.disable(); | ||
|
|
||
| expect(api.isRunDisabled).toBe(true); | ||
|
|
||
| api.enable(); | ||
|
|
||
| expect(api.isRunDisabled).toBe(false); | ||
| expect(api.attributes[SemanticInternalAttributes.RUN_ID]).toBe(FAKE_CTX.run.id); | ||
| }); |
| # Run ClickHouse migrations | ||
| echo "Running ClickHouse migrations..." | ||
| export GOOSE_DRIVER=clickhouse | ||
|
|
||
| # Ensure secure=true is in the connection string | ||
| if echo "$CLICKHOUSE_URL" | grep -q "secure="; then | ||
| # secure parameter already exists, use as is | ||
| export GOOSE_DBSTRING="$CLICKHOUSE_URL" | ||
| elif echo "$CLICKHOUSE_URL" | grep -q "?"; then | ||
| # URL has query parameters, append secure=true | ||
| export GOOSE_DBSTRING="${CLICKHOUSE_URL}&secure=true" | ||
| else | ||
| # URL has no query parameters, add secure=true | ||
| export GOOSE_DBSTRING="${CLICKHOUSE_URL}?secure=true" | ||
| fi | ||
|
|
||
| # Goose derives TLS from the URL scheme. Strip any existing secure query | ||
| # parameter and only set secure=true for https URLs. | ||
| export GOOSE_DBSTRING="$(node -e 'const url = new URL(process.env.CLICKHOUSE_URL); url.searchParams.delete("secure"); if (url.protocol === "https:") { url.searchParams.set("secure", "true"); } process.stdout.write(url.toString());')" |
This PR fixes the telemetry attribution bug reported in issue #3672, where metrics emitted after a task resumes from a waitpoint were being ingested with empty RUN_ID, TASK_SLUG, and ATTEMPT_NUMBER fields.
Problem
On managed Trigger Cloud, long-running tasks that hit a waitpoint and then resume in the same worker process would continue emitting process and Node.js auto-metrics, but those metrics were no longer attributable to the original run. They still reached ClickHouse, but only the environment, project, and machine-level attributes were preserved, so per-run queries like this would miss the post-waitpoint portion of the execution:
Root Cause
When the worker flushes telemetry at a waitpoint, it sends
FLUSH { disableContext: true }, and the task context marks the run as disabled. That behavior is intentional for the between-runs state so the exporter can strip run-specific fields.The problem is that when the waitpoint is resolved, the worker only resolves the promise for the runtime and never re-enables the task context. Because the same Node process continues executing the resumed task, the exporter stays in the disabled branch for the rest of that run and keeps dropping the run-specific attributes.
Fix
This change makes the task context explicitly re-enable when a waitpoint is resolved:
taskContext.enable()method alongside the existingdisable()method.taskContext.enable()in both managed and dev workerRESOLVE_WAITPOINThandlers.disable()followed byenable()restores run attribution, includingRUN_ID.Why this fixes it
The waitpoint flush still disables context during the suspension boundary, which preserves the original behavior for between-runs telemetry. The new enable step restores the run-scoped context immediately when execution resumes, so all subsequent metrics emitted by the resumed task keep their original attribution fields and remain queryable by run.
Validation