fix: null-safe getTime() calls in replication services (#3519)#3697
fix: null-safe getTime() calls in replication services (#3519)#3697deepshekhardas wants to merge 1 commit into
Conversation
…#3519) Add optional chaining to getTime() calls on createdAt, updatedAt fields in runs and sessions replication services. When CDC sends rows before timestamps are fully populated, calling .getTime() on undefined crashes the replication service with a TypeError. - runsReplicationService.server.ts: null-safe updatedAt/createdAt in #prepareTaskRunInsert and #preparePayloadInsert - sessionsReplicationService.server.ts: null-safe createdAt/updatedAt in toSessionInsertArray
|
|
Hi @deepshekhardas, thanks for your interest in contributing! This project requires that pull request authors are vouched, and you are not in the list of vouched users. This PR will be closed automatically. See https://github.com/triggerdotdev/trigger.dev/blob/main/CONTRIBUTING.md for more details. |
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughThis pull request adds defensive timestamp handling to two replication service files. In Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🚩 REPLICA IDENTITY FULL not set in production migrations — delete events may crash before reaching these fallbacks
Tests explicitly run ALTER TABLE public."TaskRun" REPLICA IDENTITY FULL and ALTER TABLE public."Session" REPLICA IDENTITY FULL before each test case, but there is no corresponding production migration in internal-packages/database/prisma/migrations/ that sets this. With the default replica identity, PostgreSQL only sends key columns for delete events. In the service code at runsReplicationService.server.ts:415, message.old would be null for key-based deletes, and the as TaskRun cast doesn't protect at runtime. The code would then crash at line 828 (run.environmentType) before ever reaching the timestamp lines changed in this PR. This means the ?. fallbacks for timestamps don't actually help with the delete-event null scenario — the crash happens earlier. If REPLICA IDENTITY FULL is indeed set out-of-band (e.g., via a manual migration or deploy script), this is fine, but it's worth verifying that the production database has it configured.
Was this helpful? React with 👍 or 👎 to provide feedback.
| run.updatedAt.getTime(), // updated_at | ||
| run.createdAt.getTime(), // created_at | ||
| run.updatedAt?.getTime() ?? Date.now(), // updated_at | ||
| run.createdAt?.getTime() ?? Date.now(), // created_at |
There was a problem hiding this comment.
🔴 Date.now() fallback for created_at corrupts ClickHouse partition/sort key, breaking ReplacingMergeTree deduplication
When createdAt is null and falls back to Date.now(), the record is inserted into a ClickHouse partition determined by the current month (toYYYYMM(created_at)), not the record's actual creation month. Both task_runs_v2 (internal-packages/clickhouse/schema/004_create_task_runs_v2.sql:75-76) and sessions_v1 (internal-packages/clickhouse/schema/030_create_sessions_v1.sql:37-38) use PARTITION BY toYYYYMM(created_at) and include created_at in the ORDER BY key with ReplacingMergeTree(_version, _is_deleted). Since ReplacingMergeTree only deduplicates within the same partition and sort key, if a record's created_at changes between inserts (e.g., a previous insert/update used the real timestamp, but a later event uses Date.now()), the records land in different partitions and will never be deduplicated — creating permanent ghost rows. Additionally, no warning is logged when the fallback is triggered, making this silent data corruption.
Affected locations using Date.now() as created_at fallback
runsReplicationService.server.ts:879— task run insertcreated_atrunsReplicationService.server.ts:939— payload insertcreated_atsessionsReplicationService.server.ts:753— session insertcreated_at
Prompt for agents
The created_at field is used as the ClickHouse PARTITION BY key (toYYYYMM(created_at)) and is part of the ORDER BY sort key in both task_runs_v2 and sessions_v1. Using Date.now() as a fallback when createdAt is null means the record could land in a different partition than its other versions, permanently breaking ReplacingMergeTree deduplication.
The fix should either:
1. Skip the insert entirely (return early) when created_at is null, since a record without a valid creation timestamp cannot be correctly placed in ClickHouse. This matches the pattern already used at line 828-830 for missing environmentType/organizationId.
2. Or at minimum, log a prominent warning when the fallback is triggered so operators can detect the issue.
Affected locations:
- runsReplicationService.server.ts line 879: run.createdAt?.getTime() ?? Date.now() in #prepareTaskRunInsert
- runsReplicationService.server.ts line 939: run.createdAt?.getTime() ?? Date.now() in #preparePayloadInsert
- sessionsReplicationService.server.ts line 753: session.createdAt?.getTime() ?? Date.now() in toSessionInsertArray
Note: updated_at is NOT part of the partition/sort key, so Date.now() fallback for updated_at (lines 878 and 754) is less critical but still produces incorrect timestamps.
Was this helpful? React with 👍 or 👎 to provide feedback.
| run.updatedAt?.getTime() ?? Date.now(), // updated_at | ||
| run.createdAt?.getTime() ?? Date.now(), // created_at |
There was a problem hiding this comment.
🚩 The ?. null guards are likely unnecessary for insert/update events given non-nullable Prisma schema
Both createdAt and updatedAt are declared as non-nullable DateTime in the Prisma schema (internal-packages/database/prisma/schema.prisma:924-925 for TaskRun, lines 824-825 for Session). PostgreSQL enforces NOT NULL on these columns, and the pgoutput replication stream always includes all columns in the new tuple for insert and update messages (as seen in the readTuple method at internal-packages/replication/src/pgoutput.ts:433-466). The only scenario where these could be null is if there's a bug in the parser or the data is somehow corrupted in transit. The defensive ?. might be motivated by a production incident — if so, the root cause (why a non-nullable column appeared null) deserves investigation rather than just adding fallbacks.
Was this helpful? React with 👍 or 👎 to provide feedback.
|
Hi team, this fixes #3519 — replication service crash when CDC sends rows before timestamps are populated. Would appreciate a review and merge. Thanks! |
Adds null-safe access to getTime() calls on createdAt/updatedAt in runs/sessions replication services. Fixes crash when CDC sends rows before timestamps are populated.