Skip to content

[Enhancement](pyudf) Enhance Python process poll init and repair#64039

Open
linrrzqqq wants to merge 2 commits into
apache:masterfrom
linrrzqqq:enhance-python-process-pool-init
Open

[Enhancement](pyudf) Enhance Python process poll init and repair#64039
linrrzqqq wants to merge 2 commits into
apache:masterfrom
linrrzqqq:enhance-python-process-pool-init

Conversation

@linrrzqqq
Copy link
Copy Markdown
Collaborator

Problem Summary:

Python UDF process pool initialization previously required the whole pool to finish initialization before BE could continue serving the query.

In abnormal environments, Python process startup may hang or take a very long time in paths such as:

  • fork / child process creation
  • waiting for the Python Flight socket to appear
  • terminating and waiting for a failed child process to exit

When one process slot gets stuck, the whole process pool initialization can be blocked. As a result, FE may hit the send fragments RPC timeout before BE returns a meaningful Python UDF error: RpcException, msg: timeout when waiting for send fragments rpc, query timeout:900, left timeout for this operation:30.

be.log:

Initializing Python process pool for version 3.8.19 with 8 processes
Python process pool initialization progress for version 3.8.19: waiting_slot=4/8, success=3, failed=0, elapsed_ms=20508
Python process pool initialization progress for version 3.8.19: waiting_slot=4/8, success=3, failed=0, elapsed_ms=40508
Python process pool initialization progress for version 3.8.19: waiting_slot=4/8, success=3, failed=0, elapsed_ms=60508
Python process pool initialization progress for version 3.8.19: waiting_slot=4/8, success=3, failed=0, elapsed_ms=80508
Python process pool initialization progress for version 3.8.19: waiting_slot=4/8, success=3, failed=0, elapsed_ms=100508
Python process pool initialization progress for version 3.8.19: waiting_slot=4/8, success=3, failed=0, elapsed_ms=120508

Solution

Change Python process pool initialization from "wait until all processes are created" to "return once at least one usable process is available".

The pool no longer treats full-size initialization as a prerequisite for serving queries. Once one Python process is alive, the current query can proceed. Missing or failed process slots are repaired asynchronously by the existing health check / repair path.

  • Bound process pool initialization time, so BE can return SERVICE_UNAVAILABLE before FE send fragments RPC timeout.
  • Allow partial pool availability: initialization succeeds as long as one usable Python process exists.
  • Mark the first initialization round as completed after success or timeout, then rely on health check / repair to fill missing slots.
  • Add bounded wait/reap logic for Python child shutdown to avoid blocking indefinitely in wait.
  • Protect late init / repair workers from writing back after shutdown, and discard late duplicate processes safely.
  • Share repair guarding between foreground repair and health check to avoid duplicate repair pressure.

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found two lifecycle/concurrency issues that should be fixed before merge.

Critical checkpoint conclusions:

  • Goal: The PR bounds Python UDF process-pool initialization and allows partial availability. The main flow and new tests cover stuck startup and repair behavior, but shutdown races are not covered.
  • Scope: The implementation is focused on Python UDF process management, but the detached worker lifecycle adds shutdown coordination requirements.
  • Concurrency: Blocking issue. Detached init/repair workers and health-check startup can race with shutdown, leaving unmanaged Python processes and racing on _health_check_thread.
  • Lifecycle: Blocking issue. Pools created during shutdown are not marked stopped and detached workers can publish processes after the manager snapshot has already been drained.
  • Configuration/compatibility: No new config or storage/protocol compatibility concern found.
  • Parallel paths: Init, foreground repair, and health-check repair are mostly aligned, but shutdown is not synchronized with pool creation/start.
  • Tests: Positive and timeout cases are covered, but concurrent shutdown during initialization/health-check start is missing.
  • Observability/performance: Existing logs are sufficient for the reviewed paths; no additional blocking observability issue found.
  • User focus: No additional user-provided review focus was specified.

Comment thread be/src/udf/python/python_server.cpp
Comment thread be/src/udf/python/python_server.cpp
@linrrzqqq linrrzqqq force-pushed the enhance-python-process-pool-init branch from 03dc422 to 4f8e5ae Compare June 2, 2026 23:18
@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one additional blocking lifecycle issue in the new repair path. The existing review threads already cover the health-check thread race and the pool-creation/shutdown race, so I did not duplicate those.

Critical checkpoint conclusions:

  • Goal/test: The PR aims to bound Python UDF pool initialization and allow partial availability; the added tests cover timeout and partial-success cases, but not shutdown racing with repair wait.
  • Scope/focus: The change is focused on Python UDF process lifecycle, but the new asynchronous repair/init model introduces extra lifecycle edges.
  • Concurrency: Still problematic; _get_process() can return a process after shutdown() marks the pool stopped. Existing comments cover two other shutdown races.
  • Lifecycle/static init: No new static initialization issue found; process ownership/lifecycle has the issue noted inline.
  • Config/compatibility/protocol: No new config or FE/BE protocol compatibility issue found.
  • Parallel paths: Health-check repair and foreground repair are mostly unified by repairing; the stopped-pool return path is the uncovered parallel edge.
  • Tests: Missing a concurrent shutdown-vs-repair wait regression for the inline issue.
  • Observability/performance: No additional blocking observability or performance issue found beyond lifecycle correctness.

User focus: no additional user-provided focus was supplied.

Comment thread be/src/udf/python/python_server.cpp
@linrrzqqq linrrzqqq force-pushed the enhance-python-process-pool-init branch from 4f8e5ae to 1be429f Compare June 2, 2026 23:55
@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review completed for PR 64039.

No additional inline findings from my pass. I treated the existing inline threads as known context and did not duplicate them; the current head appears to include changes addressing the previously raised shutdown/pool-creation/health-check synchronization concerns.

Critical checkpoint conclusions:

  • Goal and tests: The PR aims to make Python UDF process pool initialization/repair bounded and more resilient to stuck process startup. The code generally implements that through detached per-slot initialization, bounded socket startup/termination waits, pool repair, and unit coverage for stuck/partial initialization and repair paths.
  • Scope and focus: Changes are concentrated in the Python UDF process manager/runtime and related BE unit tests.
  • Concurrency: The modified code is concurrency-heavy. Current head uses per-pool mutex/CV state, _pools_mutex, _health_check_mutex, and atomics to coordinate initialization, repair, health check, and shutdown. I did not find an additional distinct race beyond the already-known review context.
  • Lifecycle: Process and health-check lifecycle paths were reviewed, including shutdown and detached init/repair workers. No new lifecycle blocker found in this pass.
  • Configuration: No new config item is added. Existing max_python_process_num and Python memory limit are used consistently with the pre-existing behavior.
  • Compatibility: No FE/BE protocol, storage format, function symbol, or persisted metadata compatibility issue identified.
  • Parallel paths: UDF, UDAF, UDTF client acquisition all share PythonServerManager::get_client, so the changed pool behavior applies consistently.
  • Tests: BE unit tests were added/updated for bounded stuck startup, partial pool initialization, repair, and shutdown-related behavior. I did not run tests in this review runner.
  • Observability: Existing/new logs include version and process counts/errors for initialization and repair; sufficient for this change.
  • Transaction/persistence/data correctness: Not applicable; no transaction, storage visibility, or persistence path changed.
  • Performance: The design avoids waiting for every process during initial pool creation and keeps slow forks outside per-pool locks. No additional obvious hot-path performance issue found.
  • User focus: No additional user-provided review focus was specified.

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29032 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 1be429fd8e9cac0788b7693b9925bd9241dbdabc, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17829	4083	4010	4010
q2	q3	10746	1368	800	800
q4	4686	480	343	343
q5	7604	888	586	586
q6	192	168	135	135
q7	768	823	638	638
q8	9776	1581	1605	1581
q9	6977	4520	4536	4520
q10	6735	1840	1518	1518
q11	449	274	255	255
q12	640	428	293	293
q13	18183	3389	2791	2791
q14	275	262	241	241
q15	q16	824	793	714	714
q17	1337	972	843	843
q18	6854	5774	5584	5584
q19	1725	1265	1179	1179
q20	487	417	261	261
q21	5916	2611	2441	2441
q22	440	349	299	299
Total cold run time: 102443 ms
Total hot run time: 29032 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4368	4276	4283	4276
q2	q3	4529	4923	4311	4311
q4	2081	2216	1394	1394
q5	4422	4352	4334	4334
q6	235	178	127	127
q7	1754	2108	1656	1656
q8	2617	2200	2110	2110
q9	7984	7967	7943	7943
q10	4866	4787	4327	4327
q11	582	625	392	392
q12	742	765	539	539
q13	3347	3510	3050	3050
q14	321	343	287	287
q15	q16	725	726	651	651
q17	1366	1310	1346	1310
q18	7986	7429	6953	6953
q19	1124	1125	1119	1119
q20	2222	2219	1960	1960
q21	5271	4559	4454	4454
q22	517	463	396	396
Total cold run time: 57059 ms
Total hot run time: 51589 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169510 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 1be429fd8e9cac0788b7693b9925bd9241dbdabc, data reload: false

query5	4357	639	482	482
query6	442	198	177	177
query7	5027	560	304	304
query8	365	218	206	206
query9	8774	4012	3999	3999
query10	436	312	265	265
query11	5815	2376	2190	2190
query12	162	105	109	105
query13	1295	598	440	440
query14	6487	5432	5082	5082
query14_1	4470	4411	4422	4411
query15	201	198	191	191
query16	1010	459	433	433
query17	1098	691	567	567
query18	2625	491	347	347
query19	202	179	140	140
query20	112	111	103	103
query21	218	133	123	123
query22	13693	13628	13480	13480
query23	17359	16609	16192	16192
query23_1	16262	16343	16262	16262
query24	8087	1763	1320	1320
query24_1	1337	1312	1308	1308
query25	544	445	392	392
query26	1295	325	165	165
query27	2647	547	351	351
query28	4449	2015	1990	1990
query29	1074	628	475	475
query30	323	247	205	205
query31	1106	1070	954	954
query32	112	63	61	61
query33	539	332	276	276
query34	1165	1178	646	646
query35	756	790	694	694
query36	1407	1425	1228	1228
query37	159	107	92	92
query38	3196	3166	2996	2996
query39	941	931	896	896
query39_1	891	888	892	888
query40	224	124	111	111
query41	73	70	66	66
query42	110	100	97	97
query43	321	321	283	283
query44	
query45	205	190	184	184
query46	1093	1222	738	738
query47	2374	2374	2302	2302
query48	393	404	307	307
query49	653	484	374	374
query50	980	352	256	256
query51	4345	4533	4283	4283
query52	92	91	80	80
query53	254	271	203	203
query54	287	231	217	217
query55	82	79	75	75
query56	246	250	236	236
query57	1442	1432	1329	1329
query58	253	240	225	225
query59	1653	1674	1488	1488
query60	293	270	252	252
query61	184	184	179	179
query62	697	672	590	590
query63	236	194	186	186
query64	2598	866	620	620
query65	
query66	1784	472	351	351
query67	30193	29116	29603	29116
query68	
query69	430	306	266	266
query70	960	931	928	928
query71	298	224	212	212
query72	3000	2723	2362	2362
query73	856	749	432	432
query74	5111	4940	4791	4791
query75	2655	2587	2233	2233
query76	2311	1157	790	790
query77	359	360	289	289
query78	12444	12445	11904	11904
query79	1442	1130	777	777
query80	575	483	397	397
query81	490	281	242	242
query82	589	156	125	125
query83	364	272	247	247
query84	261	145	111	111
query85	867	537	442	442
query86	375	302	272	272
query87	3405	3346	3204	3204
query88	3590	2728	2704	2704
query89	431	386	329	329
query90	1986	191	184	184
query91	179	165	143	143
query92	67	63	55	55
query93	1465	1585	867	867
query94	533	369	300	300
query95	677	469	350	350
query96	1089	788	348	348
query97	2713	2719	2596	2596
query98	227	209	212	209
query99	1166	1177	1059	1059
Total cold run time: 252888 ms
Total hot run time: 169510 ms

@linrrzqqq linrrzqqq force-pushed the enhance-python-process-pool-init branch from 1be429f to 09765bf Compare June 3, 2026 11:09
@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 5.03% (19/378) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.87% (21021/39020)
Line Coverage 37.52% (199696/532181)
Region Coverage 33.62% (156654/465978)
Branch Coverage 34.58% (68532/198178)

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review summary for PR 64039 at head 09765bf4c122f438a1ea22939475f612989600e7.

I reviewed the current diff and the existing inline review context. The previously raised shutdown/init races around pool creation after shutdown, starting the health-check thread during shutdown, and returning a repaired process after stopped is set are already-known review items and appear to be addressed in the current head, so I am not resubmitting duplicate inline comments.

Critical checkpoint conclusions:

  • Goal/test proof: The PR changes Python UDF process pool initialization to return once one usable process is available, bounds startup/termination waits, and relies on repair/health-check to fill missing slots. The added BE unit tests cover stuck socket creation, partial pool availability, invalid versions, repair, and multi-version initialization.
  • Scope/minimality: The changes are focused on Python UDF process lifecycle and its unit tests, though the lifecycle code is inherently concurrency-heavy.
  • Concurrency: The current head serializes pool-map creation with shutdown through _pools_mutex and _shutdown_flag, protects health-check thread ownership with _health_check_mutex, uses per-pool locks for pool state, and rechecks stopped after repair waits. I did not find an additional distinct race that should block this PR.
  • Lifecycle/static initialization: No new cross-translation-unit static initialization dependency was identified. Detached init/repair workers capture the per-version pool and use static fork/repair helpers, which avoids direct manager lifetime access after shutdown.
  • Configuration: No new configuration item is added. Existing max_python_process_num is still read dynamically for init/repair sizing.
  • Compatibility: No FE-BE protocol, storage format, or function symbol compatibility issue was identified.
  • Parallel paths: Foreground repair and health-check repair now share the repairing guard; current head applies the main lifecycle checks across init, foreground get, health check, repair, and shutdown paths.
  • Conditional checks: The new stopped, initialized, has_available_process, and repairing checks have local rationale and follow the lifecycle being introduced.
  • Test coverage/results: Unit tests were added/updated for the changed behavior. I did not run the test suite in this review environment.
  • Observability: Existing and new logs include Python version, process counts, pid, and failure status in the relevant init/repair/shutdown paths; no additional required metric was identified.
  • Transaction/persistence/data writes: Not applicable.
  • FE-BE variable passing: Not applicable.
  • Performance: The change avoids blocking request startup on a fully populated pool and keeps fork work out of pool locks. I did not identify an obvious new hot-path performance blocker.

User focus response: No additional user-provided review focus was specified.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29768 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 09765bf4c122f438a1ea22939475f612989600e7, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17904	4168	4117	4117
q2	q3	10759	1483	834	834
q4	4684	484	353	353
q5	7534	872	576	576
q6	193	174	140	140
q7	777	851	638	638
q8	9341	1578	1582	1578
q9	5828	4522	4531	4522
q10	6750	1823	1575	1575
q11	459	272	264	264
q12	636	436	292	292
q13	18130	3455	2799	2799
q14	272	263	251	251
q15	q16	813	788	707	707
q17	1013	1044	981	981
q18	7087	5742	5612	5612
q19	1352	1206	1134	1134
q20	528	412	270	270
q21	6283	2898	2803	2803
q22	474	395	322	322
Total cold run time: 100817 ms
Total hot run time: 29768 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5213	4857	4991	4857
q2	q3	4875	5368	4649	4649
q4	2123	2193	1412	1412
q5	4775	4917	4720	4720
q6	237	173	128	128
q7	1891	1768	1604	1604
q8	2493	2196	2168	2168
q9	7948	7506	7436	7436
q10	4735	4710	4196	4196
q11	527	385	355	355
q12	734	752	521	521
q13	3032	3446	2818	2818
q14	277	288	247	247
q15	q16	688	708	615	615
q17	1313	1292	1281	1281
q18	7265	6764	6893	6764
q19	1111	1096	1101	1096
q20	2197	2225	1935	1935
q21	5298	4673	4471	4471
q22	517	466	406	406
Total cold run time: 57249 ms
Total hot run time: 51679 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169113 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 09765bf4c122f438a1ea22939475f612989600e7, data reload: false

query5	4313	639	484	484
query6	462	199	185	185
query7	4826	547	299	299
query8	381	236	216	216
query9	8763	3999	3972	3972
query10	470	304	257	257
query11	5955	2342	2206	2206
query12	166	109	101	101
query13	1265	613	417	417
query14	6327	5371	5048	5048
query14_1	4377	4380	4397	4380
query15	205	201	178	178
query16	1009	452	409	409
query17	1122	697	583	583
query18	2445	472	340	340
query19	206	182	141	141
query20	115	105	105	105
query21	214	141	124	124
query22	13894	13609	13349	13349
query23	17321	16509	16132	16132
query23_1	16326	16244	16245	16244
query24	7783	1815	1326	1326
query24_1	1365	1313	1312	1312
query25	538	463	379	379
query26	1290	306	157	157
query27	2724	584	336	336
query28	4489	2024	2029	2024
query29	1072	620	494	494
query30	308	235	198	198
query31	1133	1086	962	962
query32	108	64	58	58
query33	516	319	256	256
query34	1182	1118	629	629
query35	759	792	670	670
query36	1381	1385	1238	1238
query37	152	105	96	96
query38	3194	3146	3050	3050
query39	923	917	891	891
query39_1	903	891	889	889
query40	221	127	102	102
query41	86	76	64	64
query42	98	95	95	95
query43	315	328	274	274
query44	
query45	193	188	186	186
query46	1072	1198	763	763
query47	2297	2374	2256	2256
query48	370	421	283	283
query49	630	473	358	358
query50	1034	340	271	271
query51	4387	4299	4241	4241
query52	87	90	80	80
query53	247	270	199	199
query54	291	240	228	228
query55	82	81	73	73
query56	248	239	234	234
query57	1436	1398	1325	1325
query58	256	226	222	222
query59	1589	1674	1445	1445
query60	309	265	251	251
query61	189	188	181	181
query62	702	676	596	596
query63	239	191	192	191
query64	2622	855	717	717
query65	
query66	1825	479	369	369
query67	29759	29629	29522	29522
query68	
query69	442	322	272	272
query70	987	972	969	969
query71	300	226	217	217
query72	3248	2666	2350	2350
query73	859	743	415	415
query74	5146	4979	4758	4758
query75	2663	2601	2246	2246
query76	2323	1140	761	761
query77	350	391	279	279
query78	12389	12403	11825	11825
query79	1438	1072	773	773
query80	954	481	402	402
query81	496	282	241	241
query82	719	160	124	124
query83	350	279	251	251
query84	251	139	112	112
query85	915	524	430	430
query86	414	294	287	287
query87	3369	3358	3196	3196
query88	3653	2749	2729	2729
query89	444	381	331	331
query90	1847	176	177	176
query91	175	167	137	137
query92	68	64	56	56
query93	1433	1524	959	959
query94	627	370	318	318
query95	676	469	350	350
query96	1051	772	369	369
query97	2717	2709	2574	2574
query98	213	207	209	207
query99	1193	1179	1030	1030
Total cold run time: 252678 ms
Total hot run time: 169113 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 38.38% (142/370) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.82% (27460/38235)
Line Coverage 55.41% (294170/530903)
Region Coverage 52.05% (244895/470485)
Branch Coverage 53.29% (106039/198967)

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking process-lifecycle issue in the new bounded shutdown path. I did not repeat the existing review threads; the current head appears to address the previously raised pool-creation/shutdown races and the stopped recheck after repair wait.

Critical checkpoint conclusions:

  • Goal/test coverage: the change targets bounded Python UDF pool initialization and partial availability, with focused BE unit tests for timeout, partial success, repair, and shutdown paths.
  • Scope/clarity: the implementation is mostly focused on Python UDF process lifecycle, but the unreaped-detach fallback below leaves a correctness/resource leak gap in the failure path.
  • Concurrency/lifecycle: pool mutex, manager mutex, health-check mutex, and stopped flags now cover the main pool races already discussed; detached init/repair workers are guarded before publishing. The remaining issue is child-process ownership after timed-out SIGKILL waits.
  • Configuration: no new config items. Existing max_python_process_num remains used.
  • Compatibility/protocol/storage/transactions: no protocol, storage format, transaction, or FE-BE compatibility changes found.
  • Parallel paths: init, foreground repair, health-check repair, and shutdown paths were reviewed together; the child-reaping concern applies to shutdown and failed-fork cleanup semantics.
  • Observability: timeout and kill paths log useful pids, but logging alone does not clean up late-exiting children.
  • User focus: no additional user-provided review focus was specified.


if (int exit_code = _child.exit_code(); exit_code > 128 && exit_code <= 255) {
int signal = exit_code - 128;
LOG(INFO) << "Python process was killed by signal " << signal;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detaching here after both SIGTERM and SIGKILL waits time out leaves no owner that will ever reap this pid. A concrete path is a Python process stuck in uninterruptible I/O longer than terminate_timeout: shutdown() sends SIGKILL, wait_child_exit() returns TIMEOUT, _child.detach() drops boost's handle, and _is_shutdown is set. If the process exits later, it remains as a zombie under Doris because neither PythonServerManager nor a background reaper tracks the pid. Please keep a reaping owner, for example by enqueueing the pid to a bounded background reaper, or avoid detaching/marking shutdown until the child has actually been reaped.

@linrrzqqq
Copy link
Copy Markdown
Collaborator Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 28610 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c922b15e2d27bca1ad3a0d708bb50cb77d62120f, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17720	3935	3946	3935
q2	q3	10769	1357	813	813
q4	4678	480	352	352
q5	7581	865	573	573
q6	177	168	133	133
q7	795	844	635	635
q8	9786	1670	1626	1626
q9	7089	4535	4464	4464
q10	6805	1840	1516	1516
q11	439	271	246	246
q12	636	430	287	287
q13	18178	3334	2756	2756
q14	269	253	239	239
q15	q16	814	776	707	707
q17	975	1003	888	888
q18	6681	5646	5487	5487
q19	1379	1291	1053	1053
q20	521	413	255	255
q21	5948	2656	2335	2335
q22	440	356	310	310
Total cold run time: 101680 ms
Total hot run time: 28610 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4344	4219	4218	4218
q2	q3	4525	4933	4379	4379
q4	2081	2198	1379	1379
q5	4408	4315	4297	4297
q6	221	172	127	127
q7	1752	1640	1984	1640
q8	2591	2175	2148	2148
q9	7937	7946	7933	7933
q10	4820	4749	4264	4264
q11	615	433	398	398
q12	754	757	551	551
q13	3452	3605	2976	2976
q14	323	330	303	303
q15	q16	727	761	672	672
q17	1385	1357	1339	1339
q18	7935	7513	7121	7121
q19	1112	1076	1130	1076
q20	2209	2228	1950	1950
q21	5259	4518	4420	4420
q22	508	463	398	398
Total cold run time: 56958 ms
Total hot run time: 51589 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169868 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c922b15e2d27bca1ad3a0d708bb50cb77d62120f, data reload: false

query5	4326	625	481	481
query6	447	198	184	184
query7	4830	574	294	294
query8	370	228	215	215
query9	8799	4105	4092	4092
query10	434	324	256	256
query11	5859	2359	2205	2205
query12	152	103	100	100
query13	1256	625	430	430
query14	6399	5434	5069	5069
query14_1	4394	4410	4375	4375
query15	212	198	178	178
query16	1042	465	459	459
query17	1161	724	596	596
query18	2698	508	365	365
query19	210	190	147	147
query20	118	110	107	107
query21	220	144	117	117
query22	13686	13527	13390	13390
query23	17341	16411	16087	16087
query23_1	16330	16188	16317	16188
query24	7682	1782	1320	1320
query24_1	1313	1320	1335	1320
query25	578	480	416	416
query26	1328	306	165	165
query27	2650	523	354	354
query28	4412	2075	2038	2038
query29	1104	639	511	511
query30	324	235	203	203
query31	1139	1081	958	958
query32	112	65	62	62
query33	530	334	284	284
query34	1156	1157	672	672
query35	769	793	721	721
query36	1372	1407	1240	1240
query37	150	112	91	91
query38	3278	3170	3125	3125
query39	952	957	940	940
query39_1	914	917	912	912
query40	220	127	102	102
query41	69	65	66	65
query42	106	96	94	94
query43	325	328	285	285
query44	
query45	205	186	180	180
query46	1131	1188	773	773
query47	2444	2441	2354	2354
query48	412	429	303	303
query49	631	485	379	379
query50	1060	359	264	264
query51	4303	4291	4234	4234
query52	85	88	78	78
query53	249	273	195	195
query54	270	216	201	201
query55	80	78	71	71
query56	246	228	217	217
query57	1426	1418	1337	1337
query58	245	217	212	212
query59	1584	1686	1447	1447
query60	284	240	234	234
query61	162	161	157	157
query62	696	663	585	585
query63	232	185	182	182
query64	2517	777	636	636
query65	
query66	1787	468	383	383
query67	29742	29654	29531	29531
query68	
query69	426	326	260	260
query70	934	948	950	948
query71	296	224	213	213
query72	3000	2701	2463	2463
query73	829	808	445	445
query74	5150	4950	4780	4780
query75	2644	2602	2222	2222
query76	2341	1159	777	777
query77	357	377	293	293
query78	12551	12615	11887	11887
query79	2677	1052	733	733
query80	1671	464	404	404
query81	536	282	247	247
query82	611	156	119	119
query83	323	282	249	249
query84	266	144	113	113
query85	934	523	437	437
query86	452	310	263	263
query87	3527	3354	3200	3200
query88	3706	2770	2769	2769
query89	442	378	329	329
query90	1867	187	189	187
query91	187	171	139	139
query92	67	64	59	59
query93	1662	1502	960	960
query94	720	356	315	315
query95	676	469	347	347
query96	1112	850	359	359
query97	2708	2703	2556	2556
query98	213	209	203	203
query99	1188	1167	1031	1031
Total cold run time: 255029 ms
Total hot run time: 169868 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 38.38% (142/370) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.84% (27487/38263)
Line Coverage 55.41% (294520/531486)
Region Coverage 52.11% (245431/470997)
Branch Coverage 53.31% (106242/199308)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 38.38% (142/370) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.83% (27486/38263)
Line Coverage 55.41% (294512/531486)
Region Coverage 52.11% (245423/470997)
Branch Coverage 53.31% (106246/199308)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 38.38% (142/370) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.83% (27485/38263)
Line Coverage 55.41% (294503/531486)
Region Coverage 52.14% (245563/470997)
Branch Coverage 53.31% (106256/199308)

Comment thread be/src/udf/python/python_server.cpp
Comment thread be/src/udf/python/python_udf_runtime.cpp
Comment thread be/src/udf/python/python_server.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants