lexer: implement \x and \u numeric escape sequences by gabrielhnf · Pull Request #25 · uutils/awk

gabrielhnf · 2026-05-22T00:50:02Z

Implements the two todo!() stubs in parse_escape for numeric escape sequences:

\xNN produces the byte with that hex value. Non-POSIX, gated on !posix_strict.
\uNNNN produces the UTF-8 encoding of the codepoint.

Both follow gawk's behavior of passing the letter through literally when no hex digits follow.

Also adds tests covering single/double digit hex, uppercase digits, multi-byte unicode, and the no-digits edge case for both escapes.

Alonely0 · 2026-05-22T08:41:18Z

Amazing, tysm! Let me get it reviewed this afternoon and get back to you. Do you mind fixing the rustfmt CI in the meantime?

oech3 · 2026-05-22T09:53:42Z

Why does not pre-commit.ci cargo fmt? I might look at coreutils's repo.

gabrielhnf · 2026-05-22T12:06:59Z

Amazing, tysm! Let me get it reviewed this afternoon and get back to you. Do you mind fixing the rustfmt CI in the meantime?

Thanks! Should be fixed now, I always forget about it

Alonely0

Overall, it's pretty good. However, there are a few things that need to be changed. Please, take a look at the docs. As for the UTF-8 specific bits; keep them (we can do it in a subsequent PR), but add FIXME comments about it in the relevant places so we know about this. Please, also add tests that test the behavior of both flags for POSIX conformance.

Alonely0 · 2026-05-22T15:22:14Z

+
+                let mut buf = [0u8; 4];
+                let encoded = c.encode_utf8(&mut buf);
+                out.extend_from_slice(encoded.as_bytes());


Consider switching to extend_from_slice_copy. Hopefully this method is better at telling LLVM to do it in-place.

Alonely0 · 2026-05-22T15:26:03Z

+                }) as char
+            }
+        }
+        'u' => {


It appears this is also gated under non-POSIX.

Alonely0 · 2026-05-22T15:38:08Z

+                let c = char::from_u32(codepoint).ok_or(LexingError::Unknown)?; // or a more specific error
+
+                let mut buf = [0u8; 4];
+                let encoded = c.encode_utf8(&mut buf);


Assumes the system locale is UTF-8. This is fine for now; we can do this in a subsequent PR given its breadth. However, note that there are quite a few assumptions about UTF-8 in the code.

Alonely0 · 2026-05-22T15:39:03Z

+                    .is_some_and(|&x| (x as char).is_ascii_hexdigit())
+            };
+
+            let num_digits = (2..=5).take_while(|&i| is_hex(i)).count();


Note this takes up to 4 characters and awk takes up to 8 (related to it being locale-dependant and not tied to UTF-8). Reproducible string: "\u00000032"; should output 2, as in "\u32".

Alonely0 · 2026-05-22T15:39:53Z

+                        + match digit {
+                            b'0'..=b'9' => (digit - b'0') as u32,
+                            b'a'..=b'f' => (digit - b'a' + 10) as u32,
+                            b'A'..=b'F' => (digit - b'A' + 10) as u32,


nitpick: I'm not thrilled about the unreachable but it should be OK enough for LLVM to optimize away.

I'm not sure what change you have in mind for the unreachable!, so I left it as is for now, happy to update if you have a preference

Alonely0 · 2026-05-22T15:42:50Z

+                            b'0'..=b'9' => digit - b'0',
+                            b'a'..=b'f' => digit - b'a' + 10,
+                            b'A'..=b'F' => digit - b'A' + 10,
+                            _ => unreachable!(),


Ditto for the comments on the other match. Consider moving this to an utility freestanding function?

Alonely0 · 2026-05-22T15:43:33Z

-        'x' if !posix_strict => todo!(),
-        'u' => todo!(),
+        'x' if !posix_strict => {
+            let is_hex = |i: usize| {


It would be best to move this to the top of the function to be reused by \u, or as a freestanding function. Do what you think fits better.

Alonely0 · 2026-05-22T15:53:55Z

+                        }
+                });
+
+                let c = char::from_u32(codepoint).ok_or(LexingError::Unknown)?; // or a more specific error


Note: this also makes UTF-8 assumptions; and \u in gawk never errors, it inserts the locale's replacement character (U+FFFD in UTF-8, fwiw) or ? for lack thereof.

Since we're assuming UTF-8 for now, I took the liberty of going with U+FFFD on error. Let me know if you prefer something different

Alonely0 · 2026-05-23T14:04:20Z

That looks great, tysm!

gabrielhnf force-pushed the add-lexer-numeric-escapes branch from 409168d to 555877e Compare May 22, 2026 12:05

Alonely0 requested changes May 22, 2026

View reviewed changes

lexer: implement \x and \u numeric escape sequences

9104b5e

gabrielhnf force-pushed the add-lexer-numeric-escapes branch from 555877e to 9104b5e Compare May 22, 2026 23:16

Alonely0 approved these changes May 23, 2026

View reviewed changes

Alonely0 merged commit e1cf3bf into uutils:main May 23, 2026
13 checks passed

Conversation

gabrielhnf commented May 22, 2026

Uh oh!

Alonely0 commented May 22, 2026

Uh oh!

oech3 commented May 22, 2026

Uh oh!

gabrielhnf commented May 22, 2026

Uh oh!

Alonely0 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alonely0 May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alonely0 commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Alonely0 May 22, 2026 •

edited

Loading