Skip to content

gh-95004: specialize access to enums and fix scaling on free-threading#148184

Merged
kumaraditya303 merged 7 commits intopython:mainfrom
kumaraditya303:enums
Apr 7, 2026
Merged

gh-95004: specialize access to enums and fix scaling on free-threading#148184
kumaraditya303 merged 7 commits intopython:mainfrom
kumaraditya303:enums

Conversation

@kumaraditya303
Copy link
Copy Markdown
Contributor

@kumaraditya303 kumaraditya303 commented Apr 6, 2026

Copy link
Copy Markdown
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, I have just 3 comments.

Comment thread Lib/test/test_opcache.py Outdated
Comment thread Lib/test/test_opcache.py Outdated
Comment thread Python/specialize.c Outdated
Copy link
Copy Markdown
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty close, just one question and one minor nit.

Comment thread Misc/NEWS.d/next/Core_and_Builtins/2026-04-06-18-25-53.gh-issue-95004.CQeT_H.rst Outdated
Comment thread Python/specialize.c
Copy link
Copy Markdown
Contributor

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried this on a 96 core AWS machine and the scaling is not great: ~2.9x faster.

  • Given that, I'm pretty ambivalent about the chnage
  • I don't like to add benchmarks to Tools/ftscalingbench/ftscalingbench.py that don't scale well

@kumaraditya303
Copy link
Copy Markdown
Contributor Author

I just tried this on a 96 core AWS machine and the scaling is not great: ~2.9x faster.

On my mac I see 4.2x faster:

./python.exe Tools/ftscalingbench/ftscalingbench.py
Running benchmarks with 10 threads
object_cfunction           3.2x faster
cmodule_function           2.8x faster
object_lookup_special      4.3x faster
context_manager            4.9x faster
mult_constant              2.3x faster
generator                  2.3x faster
pymethod                   3.0x faster
pyfunction                 2.7x faster
module_function            2.7x faster
load_string_const          4.1x faster
load_tuple_const           3.6x faster
create_pyobject            4.4x faster
create_closure             4.4x faster
create_dict                3.8x faster
create_frozendict          4.0x faster
thread_local_read          3.5x faster
method_caller              3.3x faster
instantiate_dataclass      4.9x faster
instantiate_namedtuple     4.9x faster
instantiate_typing_namedtuple  4.9x faster
super_call                 4.6x faster
classmethod_call           3.8x faster
staticmethod_call          3.5x faster
deepcopy                   1.8x slower
setattr_non_interned       4.5x faster
enum_attr                  4.2x faster

I don't like to add benchmarks to Tools/ftscalingbench/ftscalingbench.py that don't scale well

I can remove the benchmark if you prefer.

@Fidget-Spinner
Copy link
Copy Markdown
Member

Fidget-Spinner commented Apr 6, 2026

Could it be that the Enum itself is not using deferred refcounting? It's a LOAD_GLOBAL_MODULE which increfs and then decrefs at each LOAD_ATTR.

@Fidget-Spinner
Copy link
Copy Markdown
Member

I found the problem:

It seems that this perpetually deopts at the first _GUARD_TYPE_VERSION. Then, that causes a re-specialization, which is obviously bottlenecked on a lot of things. So it seems the current specialization/deopt needs to be fixed.

Investigating now.

@Fidget-Spinner
Copy link
Copy Markdown
Member

Fidget-Spinner commented Apr 6, 2026

Before:

taskset -c 0,2,4,6,8,10 ./python Tools/ftscalingbench/ftscalingbench.py enum_attr -t 6 --scale 10000
Running benchmarks with 6 threads
enum_attr                  4.2x faster

After:

Running benchmarks with 6 threads
enum_attr                  6.0x faster

Seems the pre-existing specialization for METACLASS_CHECK was bugged. Diff to fix this:

diff --git a/Python/specialize.c b/Python/specialize.c
index 355b6eabdb7..bfa7b8148e4 100644
--- a/Python/specialize.c
+++ b/Python/specialize.c
@@ -1220,13 +1220,14 @@ specialize_class_load_attr(PyObject *owner, _Py_CODEUNIT *instr,
 #ifdef Py_GIL_DISABLED
             maybe_enable_deferred_ref_count(descr);
 #endif
-            write_u32(cache->type_version, tp_version);
             write_ptr(cache->descr, descr);
             if (metaclass_check) {
-                write_u32(cache->keys_version, meta_version);
+                write_u32(cache->keys_version, tp_version);
+                write_u32(cache->type_version, meta_version);
                 specialize(instr, LOAD_ATTR_CLASS_WITH_METACLASS_CHECK);
             }
             else {
+                write_u32(cache->type_version, tp_version);
                 specialize(instr, LOAD_ATTR_CLASS);
             }
             Py_XDECREF(descr);

This is bugged in 3.14 as well it seems 5d3201f

@Fidget-Spinner
Copy link
Copy Markdown
Member

@colesbury can you please try this again? Thank you!

@kumaraditya303
Copy link
Copy Markdown
Contributor Author

kumaraditya303 commented Apr 7, 2026

I see much better scaling now on AMD Ryzen Threadripper 3970X 32-Core Processor:

taskset -c 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30 ./python Tools/ftscalingbench/ftscalingbench.py enum_attr -t 16 --scale 10000
Running benchmarks with 16 threads
enum_attr                 15.5x faster

Thanks @Fidget-Spinner for fixing this

Copy link
Copy Markdown
Contributor

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kumaraditya303 kumaraditya303 merged commit e371ce1 into python:main Apr 7, 2026
54 checks passed
@kumaraditya303 kumaraditya303 deleted the enums branch April 7, 2026 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance or resource usage topic-free-threading

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants