lexer: implement \x and \u numeric escape sequences#25
Conversation
|
Amazing, tysm! Let me get it reviewed this afternoon and get back to you. Do you mind fixing the rustfmt CI in the meantime? |
|
Why does not pre-commit.ci |
409168d to
555877e
Compare
Thanks! Should be fixed now, I always forget about it |
Alonely0
left a comment
There was a problem hiding this comment.
Overall, it's pretty good. However, there are a few things that need to be changed. Please, take a look at the docs. As for the UTF-8 specific bits; keep them (we can do it in a subsequent PR), but add FIXME comments about it in the relevant places so we know about this. Please, also add tests that test the behavior of both flags for POSIX conformance.
|
|
||
| let mut buf = [0u8; 4]; | ||
| let encoded = c.encode_utf8(&mut buf); | ||
| out.extend_from_slice(encoded.as_bytes()); |
There was a problem hiding this comment.
Consider switching to extend_from_slice_copy. Hopefully this method is better at telling LLVM to do it in-place.
| }) as char | ||
| } | ||
| } | ||
| 'u' => { |
There was a problem hiding this comment.
It appears this is also gated under non-POSIX.
| let c = char::from_u32(codepoint).ok_or(LexingError::Unknown)?; // or a more specific error | ||
|
|
||
| let mut buf = [0u8; 4]; | ||
| let encoded = c.encode_utf8(&mut buf); |
There was a problem hiding this comment.
Assumes the system locale is UTF-8. This is fine for now; we can do this in a subsequent PR given its breadth. However, note that there are quite a few assumptions about UTF-8 in the code.
| .is_some_and(|&x| (x as char).is_ascii_hexdigit()) | ||
| }; | ||
|
|
||
| let num_digits = (2..=5).take_while(|&i| is_hex(i)).count(); |
There was a problem hiding this comment.
Note this takes up to 4 characters and awk takes up to 8 (related to it being locale-dependant and not tied to UTF-8). Reproducible string: "\u00000032"; should output 2, as in "\u32".
| + match digit { | ||
| b'0'..=b'9' => (digit - b'0') as u32, | ||
| b'a'..=b'f' => (digit - b'a' + 10) as u32, | ||
| b'A'..=b'F' => (digit - b'A' + 10) as u32, |
There was a problem hiding this comment.
nitpick: I'm not thrilled about the unreachable but it should be OK enough for LLVM to optimize away.
There was a problem hiding this comment.
I'm not sure what change you have in mind for the unreachable!, so I left it as is for now, happy to update if you have a preference
| b'0'..=b'9' => digit - b'0', | ||
| b'a'..=b'f' => digit - b'a' + 10, | ||
| b'A'..=b'F' => digit - b'A' + 10, | ||
| _ => unreachable!(), |
There was a problem hiding this comment.
Ditto for the comments on the other match. Consider moving this to an utility freestanding function?
| 'x' if !posix_strict => todo!(), | ||
| 'u' => todo!(), | ||
| 'x' if !posix_strict => { | ||
| let is_hex = |i: usize| { |
There was a problem hiding this comment.
It would be best to move this to the top of the function to be reused by \u, or as a freestanding function. Do what you think fits better.
| } | ||
| }); | ||
|
|
||
| let c = char::from_u32(codepoint).ok_or(LexingError::Unknown)?; // or a more specific error |
There was a problem hiding this comment.
Note: this also makes UTF-8 assumptions; and \u in gawk never errors, it inserts the locale's replacement character (U+FFFD in UTF-8, fwiw) or ? for lack thereof.
There was a problem hiding this comment.
Since we're assuming UTF-8 for now, I took the liberty of going with U+FFFD on error. Let me know if you prefer something different
555877e to
9104b5e
Compare
|
That looks great, tysm! |
Implements the two
todo!()stubs inparse_escapefor numeric escape sequences:\xNNproduces the byte with that hex value. Non-POSIX, gated on !posix_strict.\uNNNNproduces the UTF-8 encoding of the codepoint.Both follow gawk's behavior of passing the letter through literally when no hex digits follow.
Also adds tests covering single/double digit hex, uppercase digits, multi-byte unicode, and the no-digits edge case for both escapes.