[Enhancement](pyudf) Enhance Python process poll init and repair#64039
[Enhancement](pyudf) Enhance Python process poll init and repair#64039linrrzqqq wants to merge 2 commits into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
I found two lifecycle/concurrency issues that should be fixed before merge.
Critical checkpoint conclusions:
- Goal: The PR bounds Python UDF process-pool initialization and allows partial availability. The main flow and new tests cover stuck startup and repair behavior, but shutdown races are not covered.
- Scope: The implementation is focused on Python UDF process management, but the detached worker lifecycle adds shutdown coordination requirements.
- Concurrency: Blocking issue. Detached init/repair workers and health-check startup can race with shutdown, leaving unmanaged Python processes and racing on
_health_check_thread. - Lifecycle: Blocking issue. Pools created during shutdown are not marked stopped and detached workers can publish processes after the manager snapshot has already been drained.
- Configuration/compatibility: No new config or storage/protocol compatibility concern found.
- Parallel paths: Init, foreground repair, and health-check repair are mostly aligned, but shutdown is not synchronized with pool creation/start.
- Tests: Positive and timeout cases are covered, but concurrent shutdown during initialization/health-check start is missing.
- Observability/performance: Existing logs are sufficient for the reviewed paths; no additional blocking observability issue found.
- User focus: No additional user-provided review focus was specified.
03dc422 to
4f8e5ae
Compare
|
/review |
There was a problem hiding this comment.
I found one additional blocking lifecycle issue in the new repair path. The existing review threads already cover the health-check thread race and the pool-creation/shutdown race, so I did not duplicate those.
Critical checkpoint conclusions:
- Goal/test: The PR aims to bound Python UDF pool initialization and allow partial availability; the added tests cover timeout and partial-success cases, but not shutdown racing with repair wait.
- Scope/focus: The change is focused on Python UDF process lifecycle, but the new asynchronous repair/init model introduces extra lifecycle edges.
- Concurrency: Still problematic;
_get_process()can return a process aftershutdown()marks the pool stopped. Existing comments cover two other shutdown races. - Lifecycle/static init: No new static initialization issue found; process ownership/lifecycle has the issue noted inline.
- Config/compatibility/protocol: No new config or FE/BE protocol compatibility issue found.
- Parallel paths: Health-check repair and foreground repair are mostly unified by
repairing; the stopped-pool return path is the uncovered parallel edge. - Tests: Missing a concurrent shutdown-vs-repair wait regression for the inline issue.
- Observability/performance: No additional blocking observability or performance issue found beyond lifecycle correctness.
User focus: no additional user-provided focus was supplied.
4f8e5ae to
1be429f
Compare
|
/review |
There was a problem hiding this comment.
Code review completed for PR 64039.
No additional inline findings from my pass. I treated the existing inline threads as known context and did not duplicate them; the current head appears to include changes addressing the previously raised shutdown/pool-creation/health-check synchronization concerns.
Critical checkpoint conclusions:
- Goal and tests: The PR aims to make Python UDF process pool initialization/repair bounded and more resilient to stuck process startup. The code generally implements that through detached per-slot initialization, bounded socket startup/termination waits, pool repair, and unit coverage for stuck/partial initialization and repair paths.
- Scope and focus: Changes are concentrated in the Python UDF process manager/runtime and related BE unit tests.
- Concurrency: The modified code is concurrency-heavy. Current head uses per-pool mutex/CV state,
_pools_mutex,_health_check_mutex, and atomics to coordinate initialization, repair, health check, and shutdown. I did not find an additional distinct race beyond the already-known review context. - Lifecycle: Process and health-check lifecycle paths were reviewed, including shutdown and detached init/repair workers. No new lifecycle blocker found in this pass.
- Configuration: No new config item is added. Existing
max_python_process_numand Python memory limit are used consistently with the pre-existing behavior. - Compatibility: No FE/BE protocol, storage format, function symbol, or persisted metadata compatibility issue identified.
- Parallel paths: UDF, UDAF, UDTF client acquisition all share
PythonServerManager::get_client, so the changed pool behavior applies consistently. - Tests: BE unit tests were added/updated for bounded stuck startup, partial pool initialization, repair, and shutdown-related behavior. I did not run tests in this review runner.
- Observability: Existing/new logs include version and process counts/errors for initialization and repair; sufficient for this change.
- Transaction/persistence/data correctness: Not applicable; no transaction, storage visibility, or persistence path changed.
- Performance: The design avoids waiting for every process during initial pool creation and keeps slow forks outside per-pool locks. No additional obvious hot-path performance issue found.
- User focus: No additional user-provided review focus was specified.
|
run buildall |
TPC-H: Total hot run time: 29032 ms |
TPC-DS: Total hot run time: 169510 ms |
1be429f to
09765bf
Compare
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
Automated review summary for PR 64039 at head 09765bf4c122f438a1ea22939475f612989600e7.
I reviewed the current diff and the existing inline review context. The previously raised shutdown/init races around pool creation after shutdown, starting the health-check thread during shutdown, and returning a repaired process after stopped is set are already-known review items and appear to be addressed in the current head, so I am not resubmitting duplicate inline comments.
Critical checkpoint conclusions:
- Goal/test proof: The PR changes Python UDF process pool initialization to return once one usable process is available, bounds startup/termination waits, and relies on repair/health-check to fill missing slots. The added BE unit tests cover stuck socket creation, partial pool availability, invalid versions, repair, and multi-version initialization.
- Scope/minimality: The changes are focused on Python UDF process lifecycle and its unit tests, though the lifecycle code is inherently concurrency-heavy.
- Concurrency: The current head serializes pool-map creation with shutdown through
_pools_mutexand_shutdown_flag, protects health-check thread ownership with_health_check_mutex, uses per-pool locks for pool state, and rechecksstoppedafter repair waits. I did not find an additional distinct race that should block this PR. - Lifecycle/static initialization: No new cross-translation-unit static initialization dependency was identified. Detached init/repair workers capture the per-version pool and use static fork/repair helpers, which avoids direct manager lifetime access after shutdown.
- Configuration: No new configuration item is added. Existing
max_python_process_numis still read dynamically for init/repair sizing. - Compatibility: No FE-BE protocol, storage format, or function symbol compatibility issue was identified.
- Parallel paths: Foreground repair and health-check repair now share the
repairingguard; current head applies the main lifecycle checks across init, foreground get, health check, repair, and shutdown paths. - Conditional checks: The new
stopped,initialized,has_available_process, andrepairingchecks have local rationale and follow the lifecycle being introduced. - Test coverage/results: Unit tests were added/updated for the changed behavior. I did not run the test suite in this review environment.
- Observability: Existing and new logs include Python version, process counts, pid, and failure status in the relevant init/repair/shutdown paths; no additional required metric was identified.
- Transaction/persistence/data writes: Not applicable.
- FE-BE variable passing: Not applicable.
- Performance: The change avoids blocking request startup on a fully populated pool and keeps fork work out of pool locks. I did not identify an obvious new hot-path performance blocker.
User focus response: No additional user-provided review focus was specified.
TPC-H: Total hot run time: 29768 ms |
TPC-DS: Total hot run time: 169113 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
There was a problem hiding this comment.
I found one blocking process-lifecycle issue in the new bounded shutdown path. I did not repeat the existing review threads; the current head appears to address the previously raised pool-creation/shutdown races and the stopped recheck after repair wait.
Critical checkpoint conclusions:
- Goal/test coverage: the change targets bounded Python UDF pool initialization and partial availability, with focused BE unit tests for timeout, partial success, repair, and shutdown paths.
- Scope/clarity: the implementation is mostly focused on Python UDF process lifecycle, but the unreaped-detach fallback below leaves a correctness/resource leak gap in the failure path.
- Concurrency/lifecycle: pool mutex, manager mutex, health-check mutex, and stopped flags now cover the main pool races already discussed; detached init/repair workers are guarded before publishing. The remaining issue is child-process ownership after timed-out SIGKILL waits.
- Configuration: no new config items. Existing max_python_process_num remains used.
- Compatibility/protocol/storage/transactions: no protocol, storage format, transaction, or FE-BE compatibility changes found.
- Parallel paths: init, foreground repair, health-check repair, and shutdown paths were reviewed together; the child-reaping concern applies to shutdown and failed-fork cleanup semantics.
- Observability: timeout and kill paths log useful pids, but logging alone does not clean up late-exiting children.
- User focus: no additional user-provided review focus was specified.
|
|
||
| if (int exit_code = _child.exit_code(); exit_code > 128 && exit_code <= 255) { | ||
| int signal = exit_code - 128; | ||
| LOG(INFO) << "Python process was killed by signal " << signal; |
There was a problem hiding this comment.
Detaching here after both SIGTERM and SIGKILL waits time out leaves no owner that will ever reap this pid. A concrete path is a Python process stuck in uninterruptible I/O longer than terminate_timeout: shutdown() sends SIGKILL, wait_child_exit() returns TIMEOUT, _child.detach() drops boost's handle, and _is_shutdown is set. If the process exits later, it remains as a zombie under Doris because neither PythonServerManager nor a background reaper tracks the pid. Please keep a reaping owner, for example by enqueueing the pid to a bounded background reaper, or avoid detaching/marking shutdown until the child has actually been reaped.
|
run buildall |
TPC-H: Total hot run time: 28610 ms |
TPC-DS: Total hot run time: 169868 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
Problem Summary:
Python UDF process pool initialization previously required the whole pool to finish initialization before BE could continue serving the query.
In abnormal environments, Python process startup may hang or take a very long time in paths such as:
fork/ child process creationWhen one process slot gets stuck, the whole process pool initialization can be blocked. As a result, FE may hit the send fragments RPC timeout before BE returns a meaningful Python UDF error:
RpcException, msg: timeout when waiting for send fragments rpc, query timeout:900, left timeout for this operation:30.be.log:
Solution
Change Python process pool initialization from "wait until all processes are created" to "return once at least one usable process is available".
The pool no longer treats full-size initialization as a prerequisite for serving queries. Once one Python process is alive, the current query can proceed. Missing or failed process slots are repaired asynchronously by the existing health check / repair path.
SERVICE_UNAVAILABLEbefore FE send fragments RPC timeout.wait.