feat: Support multiple external table locations#22695
Conversation
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
| } | ||
|
|
||
| fn schemas_have_same_fields(left: &SchemaRef, right: &SchemaRef) -> bool { | ||
| left.fields() == right.fields() |
There was a problem hiding this comment.
https://github.com/apache/datafusion/pull/22695/changes#diff-8ce463e0d56c67a237fce2942f0759984d56764b13b62156472132ab84c78192R226-R227 says Schema metadata may differ between files ... but here the equality check compares the fields' metadata too - https://docs.rs/arrow-schema/58.3.0/src/arrow_schema/field.rs.html#113
Is this intentional ?
| #[prost(message, optional, tag = "9")] | ||
| pub name: ::core::option::Option<TableReference>, | ||
| #[prost(string, tag = "2")] | ||
| pub location: ::prost::alloc::string::String, |
There was a problem hiding this comment.
What is the lifetime of a serialized CreateExternalTableNode ?
If there is a use case where a CreateExternalTableNode serialized with DF v.X and deserialized with v. X+1 then it would be better to keep location around until v. X+2 (assuming users don't upgrade from X to X+2).
I.e. deserialize should do:
- if
locationsis not empty then use it - otherwise if
locationis not an empty string then insert it tolocations
Serialize should: 1) write non-empty locations and location=""
| "CreateExternalTable", | ||
| )?, | ||
| create_extern_table.location.clone(), | ||
| create_extern_table |
There was a problem hiding this comment.
If backward compatibility is needed, i.e. if a CreateExternalTableNode is serialized with v. X and it has to be read with v. X+1 then you need to keep location around for at least one major version.
If such compatibility is not needed then you can just use "" here. It is overwritten by .with_locations(...) below
| { | ||
| return plan_err!( | ||
| "All locations of a CREATE EXTERNAL TABLE must have the \ | ||
| same schema, but the provided locations have differing schemas" |
There was a problem hiding this comment.
This error message does not help the user to find the problematic schema/location.
| ) -> Result<Arc<dyn TableProvider>> { | ||
| Ok(Arc::new(TestTableProvider { | ||
| url: cmd.location.to_string(), | ||
| url: cmd.locations.first().cloned().unwrap_or_default(), |
There was a problem hiding this comment.
This could silently use "" if no locations are provided. Better return an error.
| } | ||
| write!(f, ")") | ||
| } else { | ||
| write!(f, "LOCATION {}", self.location) |
There was a problem hiding this comment.
I see that this is how it worked before but shouldn't the locations be quoted ?
|
Thank you @martin-g for great feedback. I have addressed the concerns. |
Which issue does this PR close?
Part of #16303.
Rationale for this change
CREATE EXTERNAL TABLEcan reference only one location today. This adds support for listing multiple explicit locations and reading them as one table.What changes are included in this PR?
LOCATION ('a.parquet', 'b.parquet')syntax.LOCATION 'a,b.parquet'as a single path, so literal commas still work.Are these changes tested?
Yes
Are there any user-facing changes?
Yes.
CREATE EXTERNAL TABLEnow accepts a parenthesized list of locations.There is also a public API change:
CreateExternalTable.locationis replaced byCreateExternalTable.locations.