geotiff: push down byte-affecting validation into array-level writers (#2138)#2147
Merged
Conversation
…#2138) Hardens the lower-level write entry points so direct callers cannot silently bypass the checks that ``to_geotiff`` runs upstream. Six byte-affecting gaps now fire at ``_write`` / ``_write_streaming``: compression-name validation against the canonical list, JPEG opt-in gate (#1845), ``max_z_error`` sign + LERC pairing, ``crs_epsg`` bool rejection, defensive copy on the NaN-to-sentinel rewrite, and ``float16`` / ``bool_`` auto-promotion. Renames ``write`` / ``write_streaming`` / ``read_to_array`` to underscore-prefixed canonical names, with the old names kept as aliases to avoid breaking the ~50 internal call sites that import them. Adds 26 push-down + byte-parity tests and updates module docstrings to spell out the private contract.
…xport (#2138) Two follow-ups uncovered by self-review of the push-down work: * ``_write_vrt_tiled`` / ``_write_single_tile`` now forward ``allow_internal_only_jpeg`` and ``allow_unparseable_crs`` to the per-tile ``_write`` call. Without this, ``to_geotiff(da, '...vrt', compression='jpeg', allow_internal_only_jpeg=True)`` would clear the upstream gate then trip the new push-down gate inside ``_write``. * ``xrspatial/geotiff/__init__.py`` no longer does ``from ._writer import write``. The import bound ``write`` as an attribute of ``xrspatial.geotiff`` even though the name was not in ``__all__`` and not part of the documented public API. Mirrors the ``read_to_array`` cleanup from #1708. Adds a namespace-no-leak regression test alongside the existing ``read_to_array`` pin.
brendancol
commented
May 19, 2026
Contributor
Author
brendancol
left a comment
There was a problem hiding this comment.
PR Review: geotiff: push down byte-affecting validation into array-level writers (#2138)
Read the PR head at issue-2138. The hybrid approach matches the issue's recommendation: byte-affecting checks pushed down into _write / _write_streaming, DataArray-state-dependent checks left in to_geotiff, and the three array-level entry points renamed to underscore-prefixed canonical names with public-named aliases kept to avoid disturbing ~50 internal call sites.
Blockers (must fix before merge)
None.
Suggestions (should fix, not blocking)
-
xrspatial/geotiff/_writers/eager.py:776,1063,1080--_write_single_tiledid not forward the newallow_internal_only_jpeg/allow_unparseable_crskwargs to its innerwrite(...)call. A direct caller ofto_geotiff(da, 'foo.vrt', compression='jpeg', allow_internal_only_jpeg=True)would clear the upstream gate then trip the new push-down gate inside_write. Fixed in commit b81d298 by threading the kwargs through_write_vrt_tiled->_write_single_tile->write. -
xrspatial/geotiff/__init__.py:112--from ._writer import writeleaks the array-level entry point ontoxrspatial.geotiff.writeeven thoughwriteis not in__all__. Mirrors theread_to_arraycleanup #1708 already did. Fixed in commit b81d298 by dropping the unused re-export and adding a namespace-no-leak regression test.
Nits (optional improvements)
xrspatial/geotiff/_writer.py:1711-1714-- the newTypeErrorbranch on non-stringcompressionis reachable only by direct callers (the publicto_geotiffwrapper rejects non-stringcompressionupstream with the same type-only check). The downstream_compression_tagcall would still raiseAttributeErroron the bare string-method access, so the explicit guard is defensible. Worth a one-line comment noting that the branch is unreachable fromto_geotiffand exists only for direct callers.xrspatial/geotiff/_writer.py:1920-1924-- the float16 / bool_ promotion guards onisinstance(data, np.ndarray)so dask arrays fall through._writeis documented as numpy-only (the dask path goes through_write_streaming), so the guard is correct, but a one-line comment explaining why the promotion is skipped for dask arrays would help the next reader.
What looks good
- The hybrid scope matches the issue exactly. Eight write-side validation gaps are inventoried; the six byte-affecting ones are pushed down, the two that need DataArray state (3D dim-order, fail-closed transform) stay in
to_geotiff. Read-side gaps are correctly identified as DataArray-attrs-dependent and left inopen_geotiff. - The
_validate_lowlevel_write_kwargshelper centralises the gates so_writeand_write_streamingcannot drift. Theentry_pointargument makes the error source unambiguous. restore_sentinel-gated NaN-to-sentinel rewrite with defensive copy at_writer.py:1933-1941matches the eager-path semantics exactly and respects the#1988opt-out.- Byte-parity tests at
tests/test_lowlevel_write_pushdown_2138.py:243-273are the right shape: every entry in the lossless codec set rather than spot-checking one. JPEG is correctly excluded with an explicit reason in a comment. - The aliases
write = _write,write_streaming = _write_streaming,read_to_array = _read_to_arrayare object-identity preserving, whichtest_aliases_match_underscore_namespins so a future refactor cannot drift the alias out of sync. - Test count is right for the scope: 26 new push-down + byte-parity tests, three existing tests updated for the renamed entry-point name in error messages.
Checklist
- Algorithm matches issue gap inventory
- All implemented backends produce consistent results (numpy via
_write, dask via_write_streaming; byte-parity tests pass on every lossless codec) - NaN handling is correct (defensive copy preserves caller buffer; sentinel-on-disk verified)
- Edge cases are covered by tests (bool crs_epsg, negative max_z_error, float16, bool_, unknown compression)
- Dask chunk boundaries handled correctly (push-down validation fires before any tile-row math)
- No premature materialization or unnecessary copies (the new
arr.copy()only runs when there is a NaN to rewrite) - Benchmark exists or is not needed (validation-only PR; no perf-sensitive path)
- README feature matrix updated (if applicable) -- N/A, no public API change
- Docstrings present and accurate (module docstrings spell out the private contract;
_write/_write_streamingdocument the new kwargs and the DataArray-state gap)
…uards (#2138) Two comments clarifying the reasoning for guards that are intentionally narrower than they look: * The ``TypeError`` branch on non-string ``compression`` is unreachable from ``to_geotiff`` (which only forwards ``str``); it exists so direct callers get a typed error instead of an ``AttributeError`` from ``compression.lower()``. * The ``isinstance(data, np.ndarray)`` gate on the float16 / bool_ promotion intentionally skips dask arrays because ``_write_streaming`` handles dtype promotion separately on the dask path.
brendancol
commented
May 19, 2026
Contributor
Author
brendancol
left a comment
There was a problem hiding this comment.
Follow-up review pass after addressing review feedback
Second pass against the head of issue-2138 after the merge from main and the review nits.
Disposition of the original review findings
- Suggestion:
_write_single_tiledoes not forward JPEG / CRS opt-ins -- fixed in commit b81d298. The kwargs now thread throughto_geotiff->_write_vrt_tiled->_write_single_tile->_write. - Suggestion:
__init__.pyre-exportswriteeven though it is not in__all__-- fixed in commit b81d298. Dropped the unused re-export and added a namespace-no-leak test pinning the contract. - Nit: comment the
TypeErrorbranch as unreachable fromto_geotiff-- fixed in commit adbfd38. - Nit: comment the
isinstance(data, np.ndarray)gate as intentional -- fixed in commit adbfd38.
Re-review notes
- The merge from
origin/main(commits b0c6ff9 and 24b014c, bothallow_rotatedCRS handling) touched__init__.py. Confirmed no interaction with the #2138 changes -- the_writer.writere-export drop sat in the writer-imports block, the rotated-CRS work sits in the reader-side block. - Full geotiff suite: 4285 passed, 25 skipped, 1 xfailed after the merge.
mergeStateStatusis still computing on the GitHub side; CI is running.
No new findings.
CI on Ubuntu / Windows / macOS does not include the optional ``lz4`` and ``lerc`` codec packages in every matrix slot. The byte-parity sweep and the LERC lossless round-trip test now probe the codec import the way ``_compression`` itself does and skip cleanly instead of raising ``ImportError``. The push-down validation tests (reject negative ``max_z_error``, reject ``max_z_error`` with non-LERC codecs) do not need the codecs because they raise before any encode step runs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2138.
Summary
to_geotiffdown into_writeand_write_streamingso direct callers cannot bypass it: compression-name list, JPEG opt-in gate (write_geotiff_gpu emits JPEG TIFFs that other readers reject #1845),max_z_errorsign + LERC-only pairing,crs_epsgbool rejection,crs_wktfallback gate (geotiff: CRS resolution should fail closed on unparseable strings by default #1929), defensive copy on the NaN-to-sentinel rewrite, andfloat16/bool_auto-promotion._write,_write_streaming,_read_to_array). The non-underscored names are kept as aliases so the ~50 internal call sites that import them still work. Module docstrings on_writer.pyand_reader.pynow spell out the private contract.to_geotiff): the four checks that needdata.dims/data.attrsstate -- 3D dim-order validation, band-first reorder, fail-closed georeferenced-transform, and the read-sidemasked_nodatahandling. The issue's hybrid recommendation calls these out as DataArray-level by construction.Backend coverage
CPU writer (numpy / dask-via-
_write_streaming). The GPU writer already runs its own JPEG + CRS-fallback gates before calling into_write, so its CPU-fallback path now inherits the same set of checks without further changes.Test plan
pytest xrspatial/geotiff/tests/test_lowlevel_write_pushdown_2138.py-- 26 new tests covering each push-down gap plus byte parity between_writeandto_geotifffor every lossless codec in_VALID_COMPRESSIONS.pytest xrspatial/geotiff/tests/-- full geotiff suite: 4271 passed, 25 skipped.pytest xrspatial/tests/test_geotiff_streaming_bigtiff_threshold_1785.py xrspatial/tests/test_rasterize.py-- dependent tests outsidegeotiff/.