Add GraphNode identity cache for stable object round-trips#1853
Add GraphNode identity cache for stable object round-trips#1853Andy-Jost wants to merge 17 commits intoNVIDIA:mainfrom
Conversation
Rename test files to reflect what they actually test: - test_basic -> test_graph_builder (stream capture tests) - test_conditional -> test_graph_builder_conditional - test_advanced -> test_graph_update (moved child_graph and stream_lifetime tests into test_graph_builder) - test_capture_alloc -> test_graph_memory_resource - test_explicit* -> test_graphdef* Made-with: Cursor
- Extend Graph.update() to accept both GraphBuilder and GraphDef sources - Surface CUgraphExecUpdateResultInfo details on update failure instead of a generic CUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE message - Release the GIL during cuGraphExecUpdate via nogil block - Add parametrized happy-path test covering both GraphBuilder and GraphDef - Add error-case tests: unfinished builder, topology mismatch, wrong type Made-with: Cursor
Replace cached tuple-based pred/succ with mutable AdjacencySet backed by direct CUDA driver calls. Add GraphNode.remove() wrapping cuGraphDestroyNode. Made-with: Cursor
…cencies Enable adding/removing edges between graph nodes via AdjacencySet (a MutableSet proxy on GraphNode.pred/succ), node removal via discard(), and property setters for bulk edge replacement. Includes comprehensive mutation and interface tests. Closes part of NVIDIA#1330 (step 2: edge mutation on GraphDef). Made-with: Cursor
Replace inline skipif version check with requires_module(np, "2.1") from the shared test helpers, consistent with other test files. Made-with: Cursor
Rename class and file to AdjacencySetProxy to clarify write-through semantics. Add bulk-efficient clear(), __isub__(), __ior__() overrides and remove_edges() on the Cython core. Guard GraphNode.discard() against double-destroy via membership check. Filter duplicates in update(). Add error-path tests for wrong types, cross-graph edges, and self-edges. Made-with: Cursor
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
|
…INEL Replace discard() with destroy() which calls cuGraphDestroyNode and then zeroes the CUgraphNode resource in the handle box via invalidate_graph_node_handle. This prevents stale memory access on destroyed nodes. Properties (type, pred, succ, handle) degrade gracefully to None/empty for destroyed nodes. Remove the GRAPH_NODE_SENTINEL (0x1) approach in favor of using NULL for both sentinels and destroyed nodes, which is simpler and avoids the risk of passing 0x1 to driver APIs that treat it as a valid pointer. Made-with: Cursor
Nodes retrieved via GraphDef.nodes(), edges(), or pred/succ traversal now return the same Python object that was originally created, enabling identity checks with `is`. A C++ HandleRegistry deduplicates CUgraphNode handles, and a Cython WeakValueDictionary caches the Python wrapper objects. Made-with: Cursor
Made-with: Cursor
0424a46 to
7a3dbb4
Compare
|
/ok to test |
…sion Sentinel (entry) nodes use NULL as their CUgraphNode, so caching them under a NULL key caused all sentinels across different graphs to share the same handle. This made nodes built from the wrong graph's entry point, causing CUDA_ERROR_INVALID_VALUE for conditional nodes and hash collisions in equality tests. Made-with: Cursor
|
/ok to test |
When a node is destroyed, the driver may reuse its CUgraphNode pointer for a new node. Without unregistering the old entry, the registry returns a stale handle pointing to the wrong node type and graph. Made-with: Cursor
|
/ok to test |
242b602 to
72e5d54
Compare
|
/ok to test |
72e5d54 to
64d6c2d
Compare
|
/ok to test |
Resolve conflicts: keep _cached() wrappers and _node_cache.pop() from graph-node-identity, keep identity comparison (is not) in test. Made-with: Cursor
|
/ok to test |
Aligns Python-side terminology with the C++ graph_node_registry. Made-with: Cursor
|
/ok to test |
unregister_handle: remove the expired() guard that prevented erasure when the shared_ptr was still alive. This caused stale registry entries after destroy(), leading to CUDA_ERROR_INVALID_VALUE when the driver reused CUgraphNode pointer values. Rename invalidate_graph_node_handle -> invalidate_graph_node for consistency with the rest of the graph node API. Made-with: Cursor
|
/ok to test |
| try { | ||
| std::lock_guard<std::mutex> lock(mutex_); | ||
| auto it = map_.find(key); | ||
| if (it != map_.end() && it->second.expired()) { | ||
| map_.erase(it); | ||
| } | ||
| } catch (...) {} | ||
| std::lock_guard<std::mutex> lock(mutex_); | ||
| map_.erase(key); |
There was a problem hiding this comment.
The expired() check was always unnecessary. For GraphNodes it is harmful. By not purging invalidated GraphNode handles, the driver reusing an old pointer could cause a false hit.
|
|
||
| // Zero the CUgraphNode resource inside the handle, marking it invalid. | ||
| void invalidate_graph_node_handle(const GraphNodeHandle& h) noexcept; | ||
| void invalidate_graph_node(const GraphNodeHandle& h) noexcept; |
There was a problem hiding this comment.
Following on this discussion, this is a better name.
mdboom
left a comment
There was a problem hiding this comment.
LGTM, modulo the tests passing. My suggestions are not blocking.
| ((a2, b2),) = g.edges() | ||
| assert a2 is a | ||
| assert b2 is b | ||
|
|
There was a problem hiding this comment.
Since a potential failure mode here might be unintentional leaking of references, maybe it's also worth testing something like:
del a, a2, b, b2
and then asserting that the registry is empty.
| from cuda.core._graph._graph_def._adjacency_set_proxy import AdjacencySetProxy | ||
| from cuda.core._utils.cuda_utils import driver, handle_return | ||
|
|
||
| _node_registry = weakref.WeakValueDictionary() |
There was a problem hiding this comment.
I think you explained this live in one of the PR review meetings, but since it's subtle enough, it might be worth documenting why there is a registry on both the C++ and Python sides and how they interact (if at all).
Summary
Ensures that
GraphNodeobjects round-tripped throughGraphDef.nodes(),edges(), orpred/succtraversal return the same Python object (preservingisidentity). This avoids surprising behavior and enables idiomatic Python.Depends on #1850.
Example
After this change, the following will pass:
Changes
resource_handles.cpp): Added agraph_node_registry(matching the existingkernel_registrypattern) that deduplicatesGraphNodeHandles so the sameCUgraphNodealways maps to the sameshared_ptr._graph_node.pyx): Added a module-level_node_registryand a_registered()helper. All node creation paths —GN_create(for driver round-trips) and every builder function (GN_launch,GN_join,GN_alloc, etc.) — register their results.HandleRegistry::unregister_handle(resource_handles.cpp): Removed theexpired()guard that prevented registry cleanup when theshared_ptrwas still alive. Afterdestroy(), the Python object still holds the handle, soexpired()returnedfalseand the stale entry remained. If the driver reused theCUgraphNodepointer value for a new node, the registry returned the invalidated handle (handle=0x0), causingCUDA_ERROR_INVALID_VALUE. This was confirmed by CI failures on CUDA 12.9.1 runners where pointer reuse was more aggressive.invalidate_graph_node_handle→invalidate_graph_node: For clarity.test_graphdef.py):test_node_type_preserved_by_nodesandtest_node_type_preserved_by_pred_succnow assertisidentity, not just equality.is notin test (test_graphdef_mutation.py):test_convert_linear_to_fan_innow usesnode is not reduce_nodeinstead ofnode != reduce_node.Test Coverage
test_node_type_preserved_by_nodes— parameterized over all node types, checksisidentity throughnodes().test_node_type_preserved_by_pred_succ— checksisidentity throughpred/succtraversal.test_convert_linear_to_fan_in— exercisesis notin a real graph rewiring scenario.Related Work