GH-46374: [Python][Docs] Adds type checks for source in read_table#46330
GH-46374: [Python][Docs] Adds type checks for source in read_table#46330Mottl wants to merge 7 commits into
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
raulcd
left a comment
There was a problem hiding this comment.
This is tricky because if the dataset module is enabled, yes we can use a list of strings but if we've built pyarrow without dataset a list of strings won't work because instead of using ParquetDataset->source we will be using ParquetFile->source.
I don't think we want to mislead users thinking a list of strings can be used when it might not be the case. cc @AlenkaF
|
@raulcd I suppose we want to document this conditional behavior? |
That makes sense, then something in the lines: |
|
I agree that the conditional behavior is worth documenting.
This is great. I would just suggest changing "can also be used" to "can also be passed" to align more closely with the existing phrasing. One more thing to note: this will require a test that would use the dataset mark. Because of that, the PR no longer fits into the MINOR category and will need a corresponding issue to track the change. |
|
Agreed with the proposed text (as amended by Alenka). For the dataset marked test - might as well be a separate PR? |
|
Can we just |
|
I don't think we want to go in that direction—concatenation would introduce complexity that's already handled in the Dataset C++ layer (schema mismatches is a good example). Instead, we could consider adding a warning and suggest the use of the dataset module when passing files as a list (or a dict) here: arrow/python/pyarrow/parquet/core.py Line 1828 in a2e6136 |
|
Okay, close this PR as soon as you finish with the discussion. Thanks all! |
|
Have a look please |
Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
|
|
be263e0 to
a781890
Compare
|
|
Sure! Sorry for being unresponsive. My reply |
|
Thanks for the update @Mottl ! Could you also rebase or at least resolve the merge conflict in |
| raise ValueError( | ||
| "source should be a file name, a pyarrow.NativeFile or a file-like " | ||
| "object when the pyarrow.dataset module is not available" | ||
| ) |
There was a problem hiding this comment.
It appears we had merged a PR that covered this issue in part.
arrow/python/pyarrow/parquet/core.py
Lines 1940 to 1947 in 63813b6
Can you see if your check can be consolidated with the existing one?
Rationale for this change
pyarrow.parquet.read_tableactually supportssourceparameter as a list of stringsAre these changes tested?
No, an issue to test those has been opened as this code path is currently not being tested. See:
Are there any user-facing changes?
no, only docstring