Skip to content

[opt](job) delay Kafka read committed zero-row retries#64046

Open
sollhui wants to merge 1 commit into
apache:masterfrom
sollhui:opt_read_commit_retry
Open

[opt](job) delay Kafka read committed zero-row retries#64046
sollhui wants to merge 1 commit into
apache:masterfrom
sollhui:opt_read_commit_retry

Conversation

@sollhui
Copy link
Copy Markdown
Contributor

@sollhui sollhui commented Jun 3, 2026

What problem does this PR solve?

Kafka routine load with isolation.level=read_committed can finish a task with 0 consumed rows while the task still has positive lag, for example when upstream transactional records are not committed and therefore invisible. PR #63664 added an OtherMsg hint for this case, but the renewed task could still be scheduled immediately when the normal EOF heuristic did not apply, causing repeated retries. This change reuses the same read_committed zero-row lag detection to mark the next Kafka routine load task for delayed scheduling, so it follows the existing max_batch_interval delay path used by EOF tasks.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@sollhui
Copy link
Copy Markdown
Contributor Author

sollhui commented Jun 3, 2026

run buildall

@sollhui
Copy link
Copy Markdown
Contributor Author

sollhui commented Jun 3, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: no blocking issues found in the changed code.

Critical checkpoint conclusions:

  • Goal and proof: The PR delays renewed Kafka routine-load tasks when read_committed commits consume zero rows while task lag remains positive. The implementation reuses the existing zero-row positive-lag predicate and adds a focused unit test covering propagation to the renewed task.
  • Scope and clarity: The change is small and focused on RoutineLoadTaskInfo/KafkaTaskInfo/KafkaRoutineLoadJob scheduling state.
  • Concurrency and locking: The new state is set in the existing transaction status handling path under the routine-load job write lock and copied during the existing renew path. No new lock ordering or heavy locked operation was introduced.
  • Lifecycle/static state: No new static initialization dependency or special ownership lifecycle was introduced.
  • Configuration/compatibility: No new config, persisted format, FE-BE protocol field, or incompatible storage/API change was introduced.
  • Parallel paths: Non-Kafka routine-load tasks keep the default false delay predicate; Kafka renew now preserves both caller-requested delay and the task-derived delay. Kinesis behavior is unchanged.
  • Conditional checks: The new condition matches the existing read_committed zero-row positive-lag hint logic.
  • Tests: A unit test was added for the new delayed-renew behavior. I did not run FE tests locally because the required FE build prerequisite thirdparty/installed/bin/protoc is missing in this runner.
  • Observability: Existing OtherMsg observability for this condition remains in place; no additional metrics/logs appear necessary for this scheduling-only change.
  • Transaction/persistence/data correctness: The change affects retry timing only and does not alter data visibility, offset advancement semantics, or transaction commit rules.
  • Performance: The added predicate reuses small in-memory maps and does not introduce material overhead.

User focus: No additional user-provided review focus was specified.

Copy link
Copy Markdown
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liaoxin01
Copy link
Copy Markdown
Contributor

run buildall

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 3, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

PR approved by anyone and no changes requested.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 28809 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 1bbf97d3c8f762f490457aa6726eb733072bfbcf, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17659	3969	4022	3969
q2	q3	10806	1360	826	826
q4	4685	471	342	342
q5	7600	857	589	589
q6	182	174	151	151
q7	777	858	632	632
q8	9859	1432	1520	1432
q9	6257	4464	4428	4428
q10	6864	1830	1534	1534
q11	433	270	246	246
q12	632	434	292	292
q13	18188	3724	2702	2702
q14	262	258	244	244
q15	q16	830	777	698	698
q17	955	949	955	949
q18	6820	5730	5638	5638
q19	1321	1386	956	956
q20	525	402	258	258
q21	6121	2816	2603	2603
q22	448	389	320	320
Total cold run time: 101224 ms
Total hot run time: 28809 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4794	4743	4735	4735
q2	q3	5134	5230	4597	4597
q4	2135	2206	1396	1396
q5	4883	4802	4710	4710
q6	246	183	128	128
q7	1826	1742	1627	1627
q8	2445	2187	2075	2075
q9	7358	7404	7310	7310
q10	4750	4674	4168	4168
q11	523	381	358	358
q12	744	743	525	525
q13	2939	3361	2816	2816
q14	273	289	254	254
q15	q16	682	701	605	605
q17	1276	1256	1239	1239
q18	7244	6855	6812	6812
q19	1108	1083	1115	1083
q20	2222	2211	1934	1934
q21	5270	4588	4486	4486
q22	516	463	396	396
Total cold run time: 56368 ms
Total hot run time: 51254 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169296 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 1bbf97d3c8f762f490457aa6726eb733072bfbcf, data reload: false

query5	4309	640	472	472
query6	450	196	178	178
query7	4845	538	293	293
query8	385	222	212	212
query9	8818	4015	4040	4015
query10	438	327	258	258
query11	5804	2348	2164	2164
query12	160	105	101	101
query13	1276	604	410	410
query14	6419	5390	5053	5053
query14_1	4422	4394	4452	4394
query15	211	200	179	179
query16	1038	428	448	428
query17	1121	705	600	600
query18	2612	475	350	350
query19	205	212	148	148
query20	116	111	111	111
query21	211	139	124	124
query22	13659	13603	13403	13403
query23	17386	16547	16133	16133
query23_1	16255	16204	16350	16204
query24	7587	1763	1329	1329
query24_1	1341	1268	1323	1268
query25	596	496	380	380
query26	1293	329	166	166
query27	2669	575	327	327
query28	4441	2020	2011	2011
query29	1046	587	469	469
query30	309	249	211	211
query31	1132	1071	958	958
query32	127	63	60	60
query33	536	315	244	244
query34	1173	1123	648	648
query35	753	773	701	701
query36	1383	1399	1214	1214
query37	149	102	92	92
query38	3211	3142	3033	3033
query39	933	913	901	901
query39_1	880	885	865	865
query40	219	118	97	97
query41	64	63	64	63
query42	102	93	94	93
query43	313	315	278	278
query44	
query45	205	187	182	182
query46	1106	1201	763	763
query47	2366	2424	2225	2225
query48	408	421	310	310
query49	649	473	403	403
query50	1005	361	254	254
query51	4355	4455	4199	4199
query52	87	87	83	83
query53	246	263	192	192
query54	269	228	196	196
query55	77	75	67	67
query56	224	233	224	224
query57	1464	1408	1297	1297
query58	244	214	206	206
query59	1550	1661	1412	1412
query60	281	254	237	237
query61	157	152	159	152
query62	698	664	589	589
query63	220	189	180	180
query64	2541	779	662	662
query65	
query66	1769	471	352	352
query67	29726	29718	29513	29513
query68	
query69	423	309	271	271
query70	949	950	950	950
query71	304	231	211	211
query72	2900	2654	2386	2386
query73	819	737	454	454
query74	5157	4942	4815	4815
query75	2679	2567	2273	2273
query76	2373	1160	763	763
query77	358	375	292	292
query78	12371	12441	11915	11915
query79	1422	1040	799	799
query80	603	505	474	474
query81	456	278	240	240
query82	567	159	121	121
query83	351	277	249	249
query84	264	137	112	112
query85	868	515	433	433
query86	372	307	280	280
query87	3372	3403	3185	3185
query88	3623	2763	2695	2695
query89	437	375	328	328
query90	1908	183	193	183
query91	178	167	134	134
query92	60	64	57	57
query93	1588	1391	895	895
query94	536	353	289	289
query95	665	376	473	376
query96	1093	799	350	350
query97	2720	2686	2558	2558
query98	211	207	199	199
query99	1146	1173	1029	1029
Total cold run time: 251218 ms
Total hot run time: 169296 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 77.78% (7/9) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.1.x dev/4.0.x dev/4.1.x reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants