Skip to content

gh-146313: Fix multiprocessing ResourceTracker deadlock after os.fork()#146316

Merged
gpshead merged 6 commits intopython:mainfrom
gpshead:gh-146313-single
Apr 12, 2026
Merged

gh-146313: Fix multiprocessing ResourceTracker deadlock after os.fork()#146316
gpshead merged 6 commits intopython:mainfrom
gpshead:gh-146313-single

Conversation

@gpshead
Copy link
Copy Markdown
Member

@gpshead gpshead commented Mar 23, 2026

Problem

ResourceTracker.__del__ (added in gh-88887) calls os.waitpid(pid, 0) which blocks indefinitely if a process created via os.fork() still holds the tracker pipe's write end. The tracker never sees EOF, never exits, and the parent hangs at interpreter shutdown.

Root cause

Three requirements conflict:

Fix

Two layers:

Timeout safety-net. _stop_locked() gains a wait_timeout parameter. When called from __del__, it polls with WNOHANG using exponential backoff for up to 1 second instead of blocking indefinitely.

At-fork handler. An os.register_at_fork(after_in_child=...) handler closes the inherited pipe fd in the child unless a preserve flag is set. popen_fork.Popen._launch() sets the flag before its fork so mp.Process(fork) children keep the fd and reuse the parent's tracker (preserving gh-80849). Raw os.fork() children close the fd, letting the parent reap promptly.

Result

Scenario Before After
Raw os.fork(), parent exits while child alive deadlock ~30ms reap
mp.Process(fork), parent joins then exits ~30ms reap ~30ms reap
mp.Process(fork), parent exits abnormally deadlock 1s bounded wait
No fork (gh-88887 scenario) ~30ms reap ~30ms reap

The at-fork handler makes the timeout unreachable in all well-behaved paths. The timeout remains as a safety net for abnormal shutdowns.

Problem

ResourceTracker.__del__ (added in pythongh-88887) calls os.waitpid(pid, 0)
which blocks indefinitely if a process created via os.fork() still
holds the tracker pipe's write end. The tracker never sees EOF, never
exits, and the parent hangs at interpreter shutdown.

Root cause

Three requirements conflict:

- pythongh-88887 wants the parent to reap the tracker to prevent zombies
- pythongh-80849 wants mp.Process(fork) children to reuse the parent's
  tracker via the inherited pipe fd
- pythongh-146313 shows the parent can't block in waitpid() if a child
  holds the fd -- the tracker won't see EOF until all copies close

Fix

Two layers:

Timeout safety-net. _stop_locked() gains a wait_timeout parameter.
When called from __del__, it polls with WNOHANG using exponential
backoff for up to 1 second instead of blocking indefinitely.

At-fork handler. An os.register_at_fork(after_in_child=...) handler
closes the inherited pipe fd in the child unless a preserve flag is
set. popen_fork.Popen._launch() sets the flag before its fork so
mp.Process(fork) children keep the fd and reuse the parent's tracker
(preserving pythongh-80849). Raw os.fork() children close the fd, letting
the parent reap promptly.

Result

  Scenario                                       Before     After
  Raw os.fork(), parent exits while child alive  deadlock   ~30ms reap
  mp.Process(fork), parent joins then exits      ~30ms reap ~30ms reap
  mp.Process(fork), parent exits abnormally      deadlock   1s bounded wait
  No fork (pythongh-88887 scenario)                    ~30ms reap ~30ms reap

The at-fork handler makes the timeout unreachable in all well-behaved
paths. The timeout remains as a safety net for abnormal shutdowns.
Copy link
Copy Markdown
Contributor

@itamaro itamaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly nits, overall looks good, thanks!

@bedevere-app
Copy link
Copy Markdown

bedevere-app bot commented Mar 23, 2026

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@gpshead gpshead marked this pull request as ready for review April 12, 2026 04:35
@gpshead gpshead added needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels Apr 12, 2026
@gpshead gpshead enabled auto-merge (squash) April 12, 2026 04:49
@gpshead gpshead disabled auto-merge April 12, 2026 04:49
@gpshead gpshead added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Apr 12, 2026
@bedevere-bot
Copy link
Copy Markdown

🤖 New build scheduled with the buildbot fleet by @gpshead for commit b7d3b38 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F146316%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Apr 12, 2026
@gpshead
Copy link
Copy Markdown
Member Author

gpshead commented Apr 12, 2026

while waiting for buildbots before hitting merge... stashing my porposed commit message:

commit message body

ResourceTracker.__del__ (added in gh-88887 circa Python 3.12) calls
os.waitpid(pid, 0) which blocks indefinitely if a process created via os.fork()
still holds the tracker pipe's write end. The tracker never sees EOF, never
exits, and the parent hangs at interpreter shutdown.

Fix with two layers:

  • At-fork handler. An os.register_at_fork(after_in_child=...)
    handler closes the inherited pipe fd in the child unless a preserve
    flag is set. popen_fork.Popen._launch() sets the flag before its
    fork so mp.Process(fork) children keep the fd and reuse the parent's
    tracker (preserving semaphore_tracker is not reused by child processes #80849). Raw os.fork() children close the fd,
    letting the parent reap promptly.

  • Timeout safety-net. _stop_locked() gains a wait_timeout
    parameter. When called from __del__, it polls with WNOHANG using
    exponential backoff for up to 1 second instead of blocking
    indefinitely. The at-fork handler makes this unreachable in
    well-behaved paths; it remains for abnormal shutdowns.

Co-authored-by: Itamar Oren itamarost@gmail.com

@gpshead gpshead merged commit 3a7df63 into python:main Apr 12, 2026
53 checks passed
@miss-islington-app
Copy link
Copy Markdown

Thanks @gpshead for the PR 🌮🎉.. I'm working now to backport this PR to: 3.13, 3.14.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Apr 12, 2026
…s.fork() (pythonGH-146316)

`ResourceTracker.__del__` (added in pythongh-88887 circa Python 3.12) calls
os.waitpid(pid, 0) which blocks indefinitely if a process created via os.fork()
still holds the tracker pipe's write end. The tracker never sees EOF, never
exits, and the parent hangs at interpreter shutdown.

Fix with two layers:

- **At-fork handler.** An os.register_at_fork(after_in_child=...)
  handler closes the inherited pipe fd in the child unless a preserve
  flag is set. popen_fork.Popen._launch() sets the flag before its
  fork so mp.Process(fork) children keep the fd and reuse the parent's
  tracker (preserving pythongh-80849). Raw os.fork() children close the fd,
  letting the parent reap promptly.

- **Timeout safety-net.** _stop_locked() gains a wait_timeout
  parameter. When called from `__del__`, it polls with WNOHANG using
  exponential backoff for up to 1 second instead of blocking
  indefinitely. The at-fork handler makes this unreachable in
  well-behaved paths; it remains for abnormal shutdowns.
(cherry picked from commit 3a7df63)

Co-authored-by: Gregory P. Smith <68491+gpshead@users.noreply.github.com>
Co-authored-by: Itamar Oren <itamarost@gmail.com>
@bedevere-app
Copy link
Copy Markdown

bedevere-app bot commented Apr 12, 2026

GH-148425 is a backport of this pull request to the 3.14 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.14 bugs and security fixes label Apr 12, 2026
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Apr 12, 2026
…s.fork() (pythonGH-146316)

`ResourceTracker.__del__` (added in pythongh-88887 circa Python 3.12) calls
os.waitpid(pid, 0) which blocks indefinitely if a process created via os.fork()
still holds the tracker pipe's write end. The tracker never sees EOF, never
exits, and the parent hangs at interpreter shutdown.

Fix with two layers:

- **At-fork handler.** An os.register_at_fork(after_in_child=...)
  handler closes the inherited pipe fd in the child unless a preserve
  flag is set. popen_fork.Popen._launch() sets the flag before its
  fork so mp.Process(fork) children keep the fd and reuse the parent's
  tracker (preserving pythongh-80849). Raw os.fork() children close the fd,
  letting the parent reap promptly.

- **Timeout safety-net.** _stop_locked() gains a wait_timeout
  parameter. When called from `__del__`, it polls with WNOHANG using
  exponential backoff for up to 1 second instead of blocking
  indefinitely. The at-fork handler makes this unreachable in
  well-behaved paths; it remains for abnormal shutdowns.
(cherry picked from commit 3a7df63)

Co-authored-by: Gregory P. Smith <68491+gpshead@users.noreply.github.com>
Co-authored-by: Itamar Oren <itamarost@gmail.com>
@bedevere-app
Copy link
Copy Markdown

bedevere-app bot commented Apr 12, 2026

GH-148426 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Apr 12, 2026
gpshead added a commit that referenced this pull request Apr 12, 2026
…os.fork() (GH-146316) (#148425)

gh-146313: Fix multiprocessing ResourceTracker deadlock after os.fork() (GH-146316)

`ResourceTracker.__del__` (added in gh-88887 circa Python 3.12) calls
os.waitpid(pid, 0) which blocks indefinitely if a process created via os.fork()
still holds the tracker pipe's write end. The tracker never sees EOF, never
exits, and the parent hangs at interpreter shutdown.

Fix with two layers:

- **At-fork handler.** An os.register_at_fork(after_in_child=...)
  handler closes the inherited pipe fd in the child unless a preserve
  flag is set. popen_fork.Popen._launch() sets the flag before its
  fork so mp.Process(fork) children keep the fd and reuse the parent's
  tracker (preserving gh-80849). Raw os.fork() children close the fd,
  letting the parent reap promptly.

- **Timeout safety-net.** _stop_locked() gains a wait_timeout
  parameter. When called from `__del__`, it polls with WNOHANG using
  exponential backoff for up to 1 second instead of blocking
  indefinitely. The at-fork handler makes this unreachable in
  well-behaved paths; it remains for abnormal shutdowns.
(cherry picked from commit 3a7df63)

Co-authored-by: Gregory P. Smith <68491+gpshead@users.noreply.github.com>
Co-authored-by: Itamar Oren <itamarost@gmail.com>
gpshead added a commit that referenced this pull request Apr 12, 2026
…os.fork() (GH-146316) (#148426)

gh-146313: Fix multiprocessing ResourceTracker deadlock after os.fork() (GH-146316)

`ResourceTracker.__del__` (added in gh-88887 circa Python 3.12) calls
os.waitpid(pid, 0) which blocks indefinitely if a process created via os.fork()
still holds the tracker pipe's write end. The tracker never sees EOF, never
exits, and the parent hangs at interpreter shutdown.

Fix with two layers:

- **At-fork handler.** An os.register_at_fork(after_in_child=...)
  handler closes the inherited pipe fd in the child unless a preserve
  flag is set. popen_fork.Popen._launch() sets the flag before its
  fork so mp.Process(fork) children keep the fd and reuse the parent's
  tracker (preserving gh-80849). Raw os.fork() children close the fd,
  letting the parent reap promptly.

- **Timeout safety-net.** _stop_locked() gains a wait_timeout
  parameter. When called from `__del__`, it polls with WNOHANG using
  exponential backoff for up to 1 second instead of blocking
  indefinitely. The at-fork handler makes this unreachable in
  well-behaved paths; it remains for abnormal shutdowns.
(cherry picked from commit 3a7df63)

Co-authored-by: Gregory P. Smith <68491+gpshead@users.noreply.github.com>
Co-authored-by: Itamar Oren <itamarost@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants