feat: Additional Canonical Extension Types#21291
feat: Additional Canonical Extension Types#21291tobixdev wants to merge 11 commits intoapache:mainfrom
Conversation
datafusion/common/src/types/canonical_extensions/timestamp_with_offset.rs
Outdated
Show resolved
Hide resolved
paleolimbot
left a comment
There was a problem hiding this comment.
This looks great!
I may be missing something about the arrow-rs ExtensionType, but I'm not sure we need to impl it for these (which can be constrained to datafusion-specific behaviour).
I think datafusion-common is a great place for these now...there might be type-specific crates in the future that include the dependencies required to actually do things (e.g., the datafusion JSON contribution being discussed and perhaps a future variant), and when that's the case the type definition could maybe live there too.
| impl ExtensionType for DFBool8 { | ||
| const NAME: &'static str = Bool8::NAME; | ||
| type Metadata = <Bool8 as ExtensionType>::Metadata; |
There was a problem hiding this comment.
I'm not sure this is needed (only the impl DFExtensionType bit). Does the fact that these objects implement ExtensionType allow something to be done with them that isn't otherwise possible?
There was a problem hiding this comment.
Yep, this was also a thorn in my eye. I think the only thing is the default registration that requires ExtensionType + DFExtensionType.
I've removed this requirement from the default registration. I think it's better now, especially because now there is only a single ExtensionType implementation that can be used for reading/writing metadata. The DFBool8 etc. types can then be ignored by most users, except those that want to customize the registered extension types.
There was a problem hiding this comment.
After this change, I think the example is a bit weird as it implements both traits in the same struct.
For example, ExtensionType::supports_data_type now returns true for all supported data types (as it should for the arrow-rs API), but it may seem unexpected as it just stores a single sotrage type. So some users could assume that it should only return true for this specific type.
Do you think it warrants splitting the example into TemperatureExtensionType and DFTemperatureExtensionType too to avoid this?
Which issue does this PR close?
DFExtensionTypefor Arrow's Canonical Extension Types #21144 .Rationale for this change
Implements
DFExtensionTypefor the other canonical extension types (except Parquet types).What changes are included in this PR?
DFExtensionTypefor:arrow.bool8(custom formatting with true/false)arrow.fixed_shape_tensor(nu custom formatting for now)arrow.json(no custom formatting for now)arrow.opaque(no custom formatting for now)arrow.timestamp_with_offset(custom formatting as timestamp)arrow.variable_shape_tensor(no custom formatting)DFUuidforUuidso that it is consistent with the other extension types. I don't know whether we truly need the wrapper structs for extension types that only have a single valid storage type based on their metadata (e.g., Bool8, Uuid) . Open for any opinions.@paleolimbot Is this the kind of wrapper structs you imagined?
Should we also add end-to-end tests for the other extension types? Currently, we only have the UUID one from the last PR.
I think for variant supports it's best to wait for general Variant support in DataFusion as this is currently done in https://github.com/datafusion-contrib/datafusion-variant .
Are these changes tested?
Custom formatters are tested. The rest is mainly boiler plate code.
Are there any user-facing changes?
Yes more extension types and renames
UuidtoDFUuid. If we plan to keep our own wrapper structs, we should merge that before releasing a version with the oldUuidso we have no breaking changes.