Add GraphNode identity cache for stable object round-trips by Andy-Jost · Pull Request #1853 · NVIDIA/cuda-python

Andy-Jost · 2026-04-02T19:56:18Z

Summary

Ensures that GraphNode objects round-tripped through GraphDef.nodes(), edges(), or pred/succ traversal return the same Python object (preserving is identity). This avoids surprising behavior and enables idiomatic Python.

Depends on #1850.

Example

After this change, the following will pass:

# For nodes a, b belonging to graph g
assert any(x is a for x in g.nodes())

a.succ = {b}    # set successors of a to {b}
b2, = a.succ    # get successors of a
assert b2 is b  # the set contains b

Changes

C++ HandleRegistry for CUgraphNode (resource_handles.cpp): Added a graph_node_registry (matching the existing kernel_registry pattern) that deduplicates GraphNodeHandles so the same CUgraphNode always maps to the same shared_ptr.
Cython WeakValueDictionary (_graph_node.pyx): Added a module-level _node_registry and a _registered() helper. All node creation paths — GN_create (for driver round-trips) and every builder function (GN_launch, GN_join, GN_alloc, etc.) — register their results.
Fix HandleRegistry::unregister_handle (resource_handles.cpp): Removed the expired() guard that prevented registry cleanup when the shared_ptr was still alive. After destroy(), the Python object still holds the handle, so expired() returned false and the stale entry remained. If the driver reused the CUgraphNode pointer value for a new node, the registry returned the invalidated handle (handle=0x0), causing CUDA_ERROR_INVALID_VALUE. This was confirmed by CI failures on CUDA 12.9.1 runners where pointer reuse was more aggressive.
Rename invalidate_graph_node_handle → invalidate_graph_node: For clarity.
Identity assertions (test_graphdef.py): test_node_type_preserved_by_nodes and test_node_type_preserved_by_pred_succ now assert is identity, not just equality.
is not in test (test_graphdef_mutation.py): test_convert_linear_to_fan_in now uses node is not reduce_node instead of node != reduce_node.

Test Coverage

test_node_type_preserved_by_nodes — parameterized over all node types, checks is identity through nodes().
test_node_type_preserved_by_pred_succ — checks is identity through pred/succ traversal.
test_convert_linear_to_fan_in — exercises is not in a real graph rewiring scenario.

Related Work

Part of CUDA graph phase N - graph updates #1330 (CUDA graph updates)
Depends on Add edge mutation and MutableSet interface for graph nodes #1850 (edge mutation)

Rename test files to reflect what they actually test: - test_basic -> test_graph_builder (stream capture tests) - test_conditional -> test_graph_builder_conditional - test_advanced -> test_graph_update (moved child_graph and stream_lifetime tests into test_graph_builder) - test_capture_alloc -> test_graph_memory_resource - test_explicit* -> test_graphdef* Made-with: Cursor

- Extend Graph.update() to accept both GraphBuilder and GraphDef sources - Surface CUgraphExecUpdateResultInfo details on update failure instead of a generic CUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE message - Release the GIL during cuGraphExecUpdate via nogil block - Add parametrized happy-path test covering both GraphBuilder and GraphDef - Add error-case tests: unfinished builder, topology mismatch, wrong type Made-with: Cursor

Replace cached tuple-based pred/succ with mutable AdjacencySet backed by direct CUDA driver calls. Add GraphNode.remove() wrapping cuGraphDestroyNode. Made-with: Cursor

…cencies Enable adding/removing edges between graph nodes via AdjacencySet (a MutableSet proxy on GraphNode.pred/succ), node removal via discard(), and property setters for bulk edge replacement. Includes comprehensive mutation and interface tests. Closes part of NVIDIA#1330 (step 2: edge mutation on GraphDef). Made-with: Cursor

Replace inline skipif version check with requires_module(np, "2.1") from the shared test helpers, consistent with other test files. Made-with: Cursor

Made-with: Cursor

Rename class and file to AdjacencySetProxy to clarify write-through semantics. Add bulk-efficient clear(), __isub__(), __ior__() overrides and remove_edges() on the Cython core. Guard GraphNode.discard() against double-destroy via membership check. Filter duplicates in update(). Add error-path tests for wrong types, cross-graph edges, and self-edges. Made-with: Cursor

copy-pr-bot · 2026-04-02T19:56:22Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Andy-Jost · 2026-04-02T19:59:48Z

/ok to test

github-actions · 2026-04-02T20:15:41Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1853/
https://nvidia.github.io/cuda-python/pr-preview/pr-1853/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1853/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1853/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

…INEL Replace discard() with destroy() which calls cuGraphDestroyNode and then zeroes the CUgraphNode resource in the handle box via invalidate_graph_node_handle. This prevents stale memory access on destroyed nodes. Properties (type, pred, succ, handle) degrade gracefully to None/empty for destroyed nodes. Remove the GRAPH_NODE_SENTINEL (0x1) approach in favor of using NULL for both sentinels and destroyed nodes, which is simpler and avoids the risk of passing 0x1 to driver APIs that treat it as a valid pointer. Made-with: Cursor

Nodes retrieved via GraphDef.nodes(), edges(), or pred/succ traversal now return the same Python object that was originally created, enabling identity checks with `is`. A C++ HandleRegistry deduplicates CUgraphNode handles, and a Cython WeakValueDictionary caches the Python wrapper objects. Made-with: Cursor

Made-with: Cursor

Andy-Jost · 2026-04-02T22:23:10Z

/ok to test

…sion Sentinel (entry) nodes use NULL as their CUgraphNode, so caching them under a NULL key caused all sentinels across different graphs to share the same handle. This made nodes built from the wrong graph's entry point, causing CUDA_ERROR_INVALID_VALUE for conditional nodes and hash collisions in equality tests. Made-with: Cursor

Andy-Jost · 2026-04-02T23:34:25Z

/ok to test

When a node is destroyed, the driver may reuse its CUgraphNode pointer for a new node. Without unregistering the old entry, the registry returns a stale handle pointing to the wrong node type and graph. Made-with: Cursor

Andy-Jost · 2026-04-03T17:26:37Z

/ok to test

Andy-Jost · 2026-04-03T17:36:30Z

/ok to test

Made-with: Cursor

Andy-Jost · 2026-04-03T17:41:34Z

/ok to test

Resolve conflicts: keep _cached() wrappers and _node_cache.pop() from graph-node-identity, keep identity comparison (is not) in test. Made-with: Cursor

Andy-Jost · 2026-04-03T20:19:52Z

/ok to test

Aligns Python-side terminology with the C++ graph_node_registry. Made-with: Cursor

Andy-Jost · 2026-04-03T20:28:56Z

/ok to test

unregister_handle: remove the expired() guard that prevented erasure when the shared_ptr was still alive. This caused stale registry entries after destroy(), leading to CUDA_ERROR_INVALID_VALUE when the driver reused CUgraphNode pointer values. Rename invalidate_graph_node_handle -> invalidate_graph_node for consistency with the rest of the graph node API. Made-with: Cursor

Andy-Jost · 2026-04-03T20:45:36Z

/ok to test

Andy-Jost · 2026-04-03T20:48:07Z

cuda_core/cuda/core/_cpp/resource_handles.cpp

-        try {
-            std::lock_guard<std::mutex> lock(mutex_);
-            auto it = map_.find(key);
-            if (it != map_.end() && it->second.expired()) {
-                map_.erase(it);
-            }
-        } catch (...) {}
+        std::lock_guard<std::mutex> lock(mutex_);
+        map_.erase(key);


The expired() check was always unnecessary. For GraphNodes it is harmful. By not purging invalidated GraphNode handles, the driver reusing an old pointer could cause a false hit.

Andy-Jost · 2026-04-03T20:50:27Z

cuda_core/cuda/core/_cpp/resource_handles.hpp


 // Zero the CUgraphNode resource inside the handle, marking it invalid.
-void invalidate_graph_node_handle(const GraphNodeHandle& h) noexcept;
+void invalidate_graph_node(const GraphNodeHandle& h) noexcept;


Following on this discussion, this is a better name.

mdboom

LGTM, modulo the tests passing. My suggestions are not blocking.

mdboom · 2026-04-03T20:56:11Z

cuda_core/tests/graph/test_graphdef.py

+    ((a2, b2),) = g.edges()
+    assert a2 is a
+    assert b2 is b
+


Since a potential failure mode here might be unintentional leaking of references, maybe it's also worth testing something like:

del a, a2, b, b2

and then asserting that the registry is empty.

mdboom · 2026-04-03T20:57:24Z

cuda_core/cuda/core/_graph/_graph_def/_graph_node.pyx

 from cuda.core._graph._graph_def._adjacency_set_proxy import AdjacencySetProxy
 from cuda.core._utils.cuda_utils import driver, handle_return

+_node_registry = weakref.WeakValueDictionary()


I think you explained this live in one of the PR review meetings, but since it's subtle enough, it might be worth documenting why there is a registry on both the C++ and Python sides and how they interact (if at all).

Andy-Jost added 7 commits April 2, 2026 09:43

Add AdjacencySet proxy for pred/succ and GraphNode.remove()

7854b76

Replace cached tuple-based pred/succ with mutable AdjacencySet backed by direct CUDA driver calls. Add GraphNode.remove() wrapping cuGraphDestroyNode. Made-with: Cursor

Use requires_module mark for numpy version checks in mutation tests

aa84e26

Replace inline skipif version check with requires_module(np, "2.1") from the shared test helpers, consistent with other test files. Made-with: Cursor

Fix empty-graph return type: return set() instead of () for nodes/edges

b27dd93

Made-with: Cursor

Andy-Jost added this to the cuda.core v1.0.0 milestone Apr 2, 2026

Andy-Jost added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Apr 2, 2026

Andy-Jost self-assigned this Apr 2, 2026

Andy-Jost added 3 commits April 2, 2026 15:12

Purge node cache on destroy to prevent stale identity lookups

7a3dbb4

Made-with: Cursor

Andy-Jost force-pushed the graph-node-identity branch from 0424a46 to 7a3dbb4 Compare April 2, 2026 22:17

Unregister destroyed nodes from C++ graph_node_registry

1b7743d

When a node is destroyed, the driver may reuse its CUgraphNode pointer for a new node. Without unregistering the old entry, the registry returns a stale handle pointing to the wrong node type and graph. Made-with: Cursor

Andy-Jost force-pushed the graph-node-identity branch from 242b602 to 72e5d54 Compare April 3, 2026 17:36

Andy-Jost added 2 commits April 3, 2026 10:40

Add dedicated test for node identity preservation through round-trips

84f0b30

Made-with: Cursor

Merge branch 'main' into graph-node-identity

64d6c2d

Andy-Jost force-pushed the graph-node-identity branch from 72e5d54 to 64d6c2d Compare April 3, 2026 17:41

Merge branch 'main' into graph-node-identity

a40be9a

Resolve conflicts: keep _cached() wrappers and _node_cache.pop() from graph-node-identity, keep identity comparison (is not) in test. Made-with: Cursor

Rename _node_cache/_cached to _node_registry/_registered

729af49

Aligns Python-side terminology with the C++ graph_node_registry. Made-with: Cursor

Andy-Jost commented Apr 3, 2026

View reviewed changes

Andy-Jost marked this pull request as ready for review April 3, 2026 20:51

mdboom self-requested a review April 3, 2026 20:54

mdboom approved these changes Apr 3, 2026

View reviewed changes

Conversation

Andy-Jost commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Example

Changes

Test Coverage

Related Work

Uh oh!

copy-pr-bot bot commented Apr 2, 2026

Uh oh!

Andy-Jost commented Apr 2, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Andy-Jost commented Apr 2, 2026

Uh oh!

Andy-Jost commented Apr 2, 2026

Uh oh!

Andy-Jost commented Apr 3, 2026

Uh oh!

Andy-Jost commented Apr 3, 2026

Uh oh!

Andy-Jost commented Apr 3, 2026

Uh oh!

Andy-Jost commented Apr 3, 2026

Uh oh!

Andy-Jost commented Apr 3, 2026

Uh oh!

Andy-Jost commented Apr 3, 2026

Uh oh!

Andy-Jost Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Andy-Jost Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

mdboom left a comment

Choose a reason for hiding this comment

Uh oh!

mdboom Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

mdboom Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Andy-Jost commented Apr 2, 2026 •

edited

Loading