JIT: Fix disasm of reg to reg SIMD narrowing/widening instructions#126371
JIT: Fix disasm of reg to reg SIMD narrowing/widening instructions#126371saucecontrol wants to merge 2 commits intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
There was a problem hiding this comment.
Pull request overview
This PR updates the xarch JIT instruction disassembly formatting to correctly display operand register widths for SIMD widening/narrowing instructions (e.g., showing ymm sources for zmm widening, and ymm targets for zmm narrowing), addressing incorrect/invalid-looking disasm output reported in #126363.
Changes:
- Adjust disasm operand-size selection for reg-to-reg SIMD widening and narrowing/conversion instructions by deriving source/target
emitAttrfrom the instruction size. - Refactor
IF_RRW_*reg/reg display logic to computetgtAttr/srcAttrand print via a single shared path (preserving special-case handling forbtand APX NDD shift forms). - Add
vcvtps2phhandling in the reg/reg/imm display path to account for differing source/target widths.
|
cc @dotnet/jit-contrib No code diffs, just disasm text fixes. |
| printf("%s", emitRegName(id->idReg1(), attr)); | ||
| printf(", %s", emitRegName(id->idReg2(), attr)); | ||
| emitDispShift(ins, (BYTE)0); | ||
| srcAttr = static_cast<emitAttr>(std::max(16U, static_cast<uint32_t>(EA_SIZE(attr)) / 2U)); |
There was a problem hiding this comment.
Two nits:
-
This touches a whole lot more than just what the title says it does. That is fine, but the description should likely be more accurate since its doing a broader cleanup/simplification of this
-
The widening ones look like something that can rather be handled via the
Tuple Typeinformation from the instruction, asTT_HALF_MEM,TT_QUARTER_MEM, etc cover this already and avoid the need to handle a ton of additional instructions in the switch. The narrowing could likely be handled between theInput_*BitandKMask_Base*information in turn.
There was a problem hiding this comment.
I started down that path, but I couldn't come up with a rule that worked for everything. e.g. older instructions like cvtpd2ps, which is TT_FULL, Input_64bit, KMask_Base2, giving no indication that it's narrowing.
I'm not sure it's any cleaner to generalize it once you account for the exceptions, though the newer instructions do tend to be consistent, so the exception count may be low.
There was a problem hiding this comment.
I think the biggest thing is that it reduces risk that we build up new cases in the future or at least limits it to the narrowing scenarios. It also streamlines the switch table and so should be better for throughput when disasm is used (although that's likely not super impactful).
There was a problem hiding this comment.
Ok, I'll see what I can come up with.
Fixes #126363
Diff for the current vector byte multiply implementation:
62F17C481002 vmovups zmm0, zmmword ptr [rdx] 62F17C4828C8 vmovaps zmm1, zmm0 - 62F27D4820C9 vpmovsxbw zmm1, zmm1 + 62F27D4820C9 vpmovsxbw zmm1, ymm1 62D17C481010 vmovups zmm2, zmmword ptr [r8] 62F17C4828DA vmovaps zmm3, zmm2 - 62F27D4820DB vpmovsxbw zmm3, zmm3 + 62F27D4820DB vpmovsxbw zmm3, ymm3 62F16548D5C9 vpmullw zmm1, zmm3, zmm1 - 62F27E4830C9 vpmovwb zmm1, zmm1 + 62F27E4830C9 vpmovwb ymm1, zmm1 62F37D483BC001 vextracti32x8 ymm0, zmm0, 1 - 62F27D4820C0 vpmovsxbw zmm0, zmm0 + 62F27D4820C0 vpmovsxbw zmm0, ymm0 62F37D483BD201 vextracti32x8 ymm2, zmm2, 1 - 62F27D4820D2 vpmovsxbw zmm2, zmm2 + 62F27D4820D2 vpmovsxbw zmm2, ymm2 62F16D48D5C0 vpmullw zmm0, zmm2, zmm0 - 62F27E4830C0 vpmovwb zmm0, zmm0 + 62F27E4830C0 vpmovwb ymm0, zmm0 62F375483AC001 vinserti32x8 zmm0, zmm1, ymm0, 1 62F17C481101 vmovups zmmword ptr [rcx], zmm0 488BC1 mov rax, rcx C5F877 vzeroupper C3 ret