-
Notifications
You must be signed in to change notification settings - Fork 723
Parser: fix exponential parse time on speculative prefix parsing #2352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -15,6 +15,7 @@ | |
| #[cfg(not(feature = "std"))] | ||
| use alloc::{ | ||
| boxed::Box, | ||
| collections::BTreeMap, | ||
| format, | ||
| string::{String, ToString}, | ||
| vec, | ||
|
|
@@ -24,6 +25,9 @@ use core::{ | |
| fmt::{self, Display}, | ||
| str::FromStr, | ||
| }; | ||
| #[cfg(feature = "std")] | ||
| use std::collections::BTreeMap; | ||
|
|
||
| use helpers::attached_token::AttachedToken; | ||
|
|
||
| use log::debug; | ||
|
|
@@ -359,6 +363,12 @@ pub struct Parser<'a> { | |
| options: ParserOptions, | ||
| /// Ensures the stack does not overflow by limiting recursion depth. | ||
| recursion_counter: RecursionCounter, | ||
| /// Cached errors from `parse_prefix` calls that returned `Err`. See | ||
| /// [`Parser::parse_prefix`] for the 2^N patterns this guards. | ||
| failed_prefix_positions: BTreeMap<usize, ParserError>, | ||
| /// Cached errors from the speculative reserved-word prefix arm. See | ||
| /// [`Parser::parse_prefix`] for the 2^N patterns this guards. | ||
| failed_reserved_word_prefix_positions: BTreeMap<usize, ParserError>, | ||
| } | ||
|
|
||
| impl<'a> Parser<'a> { | ||
|
|
@@ -385,6 +395,8 @@ impl<'a> Parser<'a> { | |
| dialect, | ||
| recursion_counter: RecursionCounter::new(DEFAULT_REMAINING_DEPTH), | ||
| options: ParserOptions::new().with_trailing_commas(dialect.supports_trailing_commas()), | ||
| failed_prefix_positions: BTreeMap::new(), | ||
| failed_reserved_word_prefix_positions: BTreeMap::new(), | ||
| } | ||
| } | ||
|
|
||
|
|
@@ -446,6 +458,8 @@ impl<'a> Parser<'a> { | |
| pub fn with_tokens_with_locations(mut self, tokens: Vec<TokenWithSpan>) -> Self { | ||
| self.tokens = tokens; | ||
| self.index = 0; | ||
| self.failed_prefix_positions.clear(); | ||
| self.failed_reserved_word_prefix_positions.clear(); | ||
| self | ||
| } | ||
|
|
||
|
|
@@ -1717,6 +1731,23 @@ impl<'a> Parser<'a> { | |
| return prefix; | ||
| } | ||
|
|
||
| // Memoize parse_prefix failures to break 2^N speculation when both | ||
| // prefix arms fail at every level (e.g. `IF(current_time(...x`). | ||
| // The per-arm cache in `parse_prefix_inner` complements this for | ||
| // chains where the reserved arm fails but the unreserved fallback | ||
| // succeeds (e.g. `case-case-...c`). | ||
| let start_index = self.index; | ||
| if let Some(cached) = self.failed_prefix_positions.get(&start_index) { | ||
| return Err(cached.clone()); | ||
| } | ||
| let result = self.parse_prefix_inner(); | ||
| if let Err(ref e) = result { | ||
| self.failed_prefix_positions.insert(start_index, e.clone()); | ||
| } | ||
| result | ||
| } | ||
|
|
||
| fn parse_prefix_inner(&mut self) -> Result<Expr, ParserError> { | ||
| // PostgreSQL allows any string literal to be preceded by a type name, indicating that the | ||
| // string literal represents a literal of that type. Some examples: | ||
| // | ||
|
|
@@ -1801,7 +1832,21 @@ impl<'a> Parser<'a> { | |
| // We first try to parse the word and following tokens as a special expression, and if that fails, | ||
| // we rollback and try to parse it as an identifier. | ||
| let w = w.clone(); | ||
| match self.try_parse(|parser| parser.parse_expr_prefix_by_reserved_word(&w, span)) { | ||
| // Memoize failed speculative reserved-word parses. When | ||
| // the reserved arm (CASE, CURRENT_TIME, etc.) does | ||
| // exponential work but the unreserved fallback ultimately | ||
| // succeeds, the overall `parse_prefix` returns `Ok` and the | ||
| // outer cache never fires. Chains like `case-case-...c` | ||
| // need this per-arm cache to break the doubling. | ||
|
Comment on lines
+1835
to
+1840
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wasn't able to follow this comment, what is 'outer cache' and 'break doubling' referring to?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Outer cache is the cache in the parent call before the recursive call, and doubling is that at each layer of depth it went in before, the operations doubled (hence the previous 2^layers complexity from before) |
||
| let try_parse_result = if let Some(cached) = self | ||
| .failed_reserved_word_prefix_positions | ||
| .get(&next_token_index) | ||
| { | ||
| Err(cached.clone()) | ||
| } else { | ||
| self.try_parse(|parser| parser.parse_expr_prefix_by_reserved_word(&w, span)) | ||
| }; | ||
| match try_parse_result { | ||
| // This word indicated an expression prefix and parsing was successful | ||
| Ok(Some(expr)) => Ok(expr), | ||
|
|
||
|
|
@@ -1815,6 +1860,8 @@ impl<'a> Parser<'a> { | |
| // we rollback and return the parsing error we got from trying to parse a | ||
| // special expression (to maintain backwards compatibility of parsing errors). | ||
| Err(e) => { | ||
| self.failed_reserved_word_prefix_positions | ||
| .insert(next_token_index, e.clone()); | ||
| if !self.dialect.is_reserved_for_identifier(w.keyword) { | ||
| if let Ok(Some(expr)) = self.maybe_parse(|parser| { | ||
| parser.parse_expr_prefix_by_unreserved_word(&w, span) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering about memory usage, how large do we expect the caches to get in the worse case scenarios?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a ~2GB limit on the fuzzer processes and after many billions of iterations it did not manage to OOM - it hit before the ASAN thread number limit than an OOM, so at the very least I believe it is not exponential. I can try and get a scaling though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm yeah I'm mostly worried about the increased memory usage. A lot of deployments don't have e.g. 2GB to allocate to parsing a sql query. I think the main issue is that the map contains strings, maybe if its a much cheaper/copy object somehow we were looking to store, it might be feasible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got a scaling for the cache memory. The two caches are keyed by token index, so each entry is about 40 bytes (8 byte usize key plus a 32 byte ParserError). Panels plot entries, heap String bytes, and total memory vs chain length N for valid SQL, the nested if(current_time(...x chain, and the wide case-...c chain.
So it is linear and never exponential, and valid SQL costs nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that e.g. on the
parse_expr_prefix_by_reserved_wordpath, each unique attempt will add an error string into the map, if so then we're potentially looking at one entry in that map per word roughly. is that the the case?To be clear, its not about a particular amount of memory, main thing is that we're not increasing memory usage of the parser significantly - if the additional memory usage grows as a function of the sql string, already that is problematic, then to improve it we would like to have each entry as minimal as we can, or potentially consider other solutions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow, currently the runtime is exponential, so the parser is just failing on these inputs. With the fix, the time requirement becomes linear and the the memory requirements remain linear with input size even for pathological cases