Skip to content

lexer: implement \x and \u numeric escape sequences#25

Merged
Alonely0 merged 1 commit into
uutils:mainfrom
gabrielhnf:add-lexer-numeric-escapes
May 23, 2026
Merged

lexer: implement \x and \u numeric escape sequences#25
Alonely0 merged 1 commit into
uutils:mainfrom
gabrielhnf:add-lexer-numeric-escapes

Conversation

@gabrielhnf
Copy link
Copy Markdown
Contributor

Implements the two todo!() stubs in parse_escape for numeric escape sequences:

  • \xNN produces the byte with that hex value. Non-POSIX, gated on !posix_strict.
  • \uNNNN produces the UTF-8 encoding of the codepoint.

Both follow gawk's behavior of passing the letter through literally when no hex digits follow.

Also adds tests covering single/double digit hex, uppercase digits, multi-byte unicode, and the no-digits edge case for both escapes.

@Alonely0
Copy link
Copy Markdown
Collaborator

Amazing, tysm! Let me get it reviewed this afternoon and get back to you. Do you mind fixing the rustfmt CI in the meantime?

@oech3
Copy link
Copy Markdown
Contributor

oech3 commented May 22, 2026

Why does not pre-commit.ci cargo fmt? I might look at coreutils's repo.

@gabrielhnf gabrielhnf force-pushed the add-lexer-numeric-escapes branch from 409168d to 555877e Compare May 22, 2026 12:05
@gabrielhnf
Copy link
Copy Markdown
Contributor Author

Amazing, tysm! Let me get it reviewed this afternoon and get back to you. Do you mind fixing the rustfmt CI in the meantime?

Thanks! Should be fixed now, I always forget about it

Copy link
Copy Markdown
Collaborator

@Alonely0 Alonely0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it's pretty good. However, there are a few things that need to be changed. Please, take a look at the docs. As for the UTF-8 specific bits; keep them (we can do it in a subsequent PR), but add FIXME comments about it in the relevant places so we know about this. Please, also add tests that test the behavior of both flags for POSIX conformance.

Comment thread lexer/src/lib.rs Outdated

let mut buf = [0u8; 4];
let encoded = c.encode_utf8(&mut buf);
out.extend_from_slice(encoded.as_bytes());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider switching to extend_from_slice_copy. Hopefully this method is better at telling LLVM to do it in-place.

Comment thread lexer/src/lib.rs Outdated
}) as char
}
}
'u' => {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears this is also gated under non-POSIX.

Comment thread lexer/src/lib.rs
let c = char::from_u32(codepoint).ok_or(LexingError::Unknown)?; // or a more specific error

let mut buf = [0u8; 4];
let encoded = c.encode_utf8(&mut buf);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assumes the system locale is UTF-8. This is fine for now; we can do this in a subsequent PR given its breadth. However, note that there are quite a few assumptions about UTF-8 in the code.

Comment thread lexer/src/lib.rs Outdated
.is_some_and(|&x| (x as char).is_ascii_hexdigit())
};

let num_digits = (2..=5).take_while(|&i| is_hex(i)).count();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this takes up to 4 characters and awk takes up to 8 (related to it being locale-dependant and not tied to UTF-8). Reproducible string: "\u00000032"; should output 2, as in "\u32".

Comment thread lexer/src/lib.rs Outdated
+ match digit {
b'0'..=b'9' => (digit - b'0') as u32,
b'a'..=b'f' => (digit - b'a' + 10) as u32,
b'A'..=b'F' => (digit - b'A' + 10) as u32,
Copy link
Copy Markdown
Collaborator

@Alonely0 Alonely0 May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I'm not thrilled about the unreachable but it should be OK enough for LLVM to optimize away.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what change you have in mind for the unreachable!, so I left it as is for now, happy to update if you have a preference

Comment thread lexer/src/lib.rs Outdated
b'0'..=b'9' => digit - b'0',
b'a'..=b'f' => digit - b'a' + 10,
b'A'..=b'F' => digit - b'A' + 10,
_ => unreachable!(),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for the comments on the other match. Consider moving this to an utility freestanding function?

Comment thread lexer/src/lib.rs Outdated
'x' if !posix_strict => todo!(),
'u' => todo!(),
'x' if !posix_strict => {
let is_hex = |i: usize| {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be best to move this to the top of the function to be reused by \u, or as a freestanding function. Do what you think fits better.

Comment thread lexer/src/lib.rs Outdated
}
});

let c = char::from_u32(codepoint).ok_or(LexingError::Unknown)?; // or a more specific error
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this also makes UTF-8 assumptions; and \u in gawk never errors, it inserts the locale's replacement character (U+FFFD in UTF-8, fwiw) or ? for lack thereof.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're assuming UTF-8 for now, I took the liberty of going with U+FFFD on error. Let me know if you prefer something different

@gabrielhnf gabrielhnf force-pushed the add-lexer-numeric-escapes branch from 555877e to 9104b5e Compare May 22, 2026 23:16
@Alonely0
Copy link
Copy Markdown
Collaborator

That looks great, tysm!

@Alonely0 Alonely0 merged commit e1cf3bf into uutils:main May 23, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants