Skip to content

refactor(remote): faster, safer, simpler OCI pull path#2855

Closed
dgageot wants to merge 4 commits into
docker:mainfrom
dgageot:refactor/oci-pull-faster-and-safer
Closed

refactor(remote): faster, safer, simpler OCI pull path#2855
dgageot wants to merge 4 commits into
docker:mainfrom
dgageot:refactor/oci-pull-faster-and-safer

Conversation

@dgageot
Copy link
Copy Markdown
Member

@dgageot dgageot commented May 21, 2026

Reviewing how we pull agent YAMLs from OCI artifacts surfaced a few wins. Each commit is independently reviewable.

What changes

4242b42b4 refactor(remote): detect proxy socket errors via type, not strings

isProxySocketError decides whether the Docker Desktop proxy is unavailable so the transport can fall back to a direct connection. It used substring matching on err.Error() ("no such file or directory", "connection refused", "dial unix", …), which breaks across Go versions and locales and gives false positives whenever any error text happens to contain those phrases.

Use errors.Is against syscall.ENOENT / syscall.ECONNREFUSED and errors.As against *net.OpError (Op == "proxyconnect" or unix-socket dial). Tests now construct realistic typed error chains, including the *url.Error wrapper http.Client adds in production.

169d6dff8 refactor(remote): collapse OCI agent read path into PullAgent

The flow to load an agent YAML from an OCI reference was split across three places: cmd/root/pull.go (Pull + open store + GetArtifact), ociSource.Read (Pull + load + corruption-aware retry + cache fallback), and remote.Pull itself (its own cache check). Each duplicated parts of the others.

Introduce remote.PullAgent(ctx, ref, force) ([]byte, digest, error) as the single entry point: it handles the digest-ref shortcut, the network-error fallback to cache, and the one-shot recovery on store corruption. ociSource.Read shrinks to a single call.

Also: skip the upfront crane.Digest HEAD when the cache is empty. With no local copy to compare against we'd have to crane.Pull anyway, so the HEAD is pure latency. Cold pulls drop from two manifest round-trips to one.

0ffbe5895 fix(remote): keep share pull strict on registry errors

Routing the share pull command through PullAgent meant a registry 404 or auth failure would silently overwrite the on-disk YAML with a stale cached copy. PullAgent's cache-fallback policy is the right behaviour for ociSource.Read (running an agent shouldn't break when the registry blips) but the wrong behaviour for an explicit user-driven refresh. share pull now calls remote.Pull directly so registry failures surface as errors.

385816cb5 fix(remote): treat unannotated cache as cache-miss, not as terminal error

Two related defense-in-depth fixes:

  1. Pull bailed out with a permanent error if local metadata lacked the cagent annotation, before contacting the registry. A user with a stale or bad cache entry on a tag was stuck until they manually wiped the store, even after the remote tag was updated to a valid agent artifact. Treat an unannotated cache entry as a cache miss so Pull falls through to crane.Pull, which re-validates the freshly fetched manifest and overwrites the bad entry.
  2. PullAgent's cache fast-paths (digest-ref shortcut and the network-error fallback) returned cached bytes without re-checking the annotation, while Pull validates it on the registry path. readCached now refuses entries missing the annotation, so the cache cannot be used to bypass that validation.

Tests:

  • TestReadCached — missing / unannotated / annotated cases.
  • TestPull_RecoversFromUnannotatedCache — uses an in-memory OCI registry to verify a poisoned cache entry no longer blocks a valid re-pull.
  • TestPullAgent_DigestRefDoesNotServeUnannotatedCache — verifies the digest-ref fast path no longer trusts an unannotated cache entry.

Validation

  • task lint — clean
  • task test — clean (130 packages)

dgageot added 4 commits May 21, 2026 13:51
isProxySocketError used substring matching on err.Error() to decide
whether to fall back from the Docker Desktop proxy to a direct
connection. That breaks across Go versions and locales, and gives
false positives whenever any error text happens to contain "dial unix"
or "connection refused".

Use errors.Is against syscall.ENOENT/ECONNREFUSED and errors.As against
*net.OpError (Op == "proxyconnect" or unix-socket dial) instead. Tests
now construct realistic typed error chains, including the url.Error
wrapper http.Client adds in production.
The flow to load an agent YAML from an OCI reference was split across
three places: cmd/root/pull.go (Pull + open store + GetArtifact),
ociSource.Read (Pull + load + corruption-aware retry + cache fallback),
and remote.Pull itself (its own cache check). Each duplicated parts of
the others.

Introduce remote.PullAgent(ctx, ref, force) ([]byte, digest, error)
as the single entry point: it handles the digest-ref shortcut, the
network-error fallback to cache, and the one-shot recovery on store
corruption. ociSource.Read and the share-pull command now each shrink
to a single call.

Skip the upfront crane.Digest HEAD when the cache is empty: with no
local copy to compare against we'd have to crane.Pull anyway, so the
HEAD is pure latency. Cold pulls drop from two manifest round-trips
to one.
Routing the share-pull command through PullAgent meant a registry 404
or auth failure would silently overwrite the on-disk YAML with a stale
cached copy. PullAgent's cache-fallback policy is the right behaviour
for `ociSource.Read` (running an agent shouldn't break when the
registry blips) but the wrong behaviour for an explicit user-driven
refresh.

Restore the direct remote.Pull call so registry failures surface as
errors instead of being papered over with whatever happens to be in
the local store.
…rror

Two related issues caught in review of the previous commits:

1. Pull bailed out with a permanent error if local metadata lacked the
   cagent annotation, before even contacting the registry. A user with
   a stale or bad cache entry on a tag was stuck until they manually
   wiped the store, even after the remote tag was updated to a valid
   agent artifact. Treat an unannotated cache entry as a cache miss so
   Pull falls through to crane.Pull, which re-validates the freshly
   fetched manifest and overwrites the bad entry.

2. PullAgent's cache fast-paths (digest-ref shortcut and the
   network-error fallback) returned cached bytes without re-checking
   the annotation, while Pull validates it on the registry path.
   Defense-in-depth fix: readCached now refuses entries missing the
   annotation, so the cache cannot be used to bypass the validation
   Pull would otherwise apply.

Tests:
  - TestReadCached covers the three cases (missing / unannotated /
    annotated).
  - TestPull_RecoversFromUnannotatedCache uses an in-memory OCI
    registry to verify a poisoned cache entry no longer blocks a valid
    re-pull.
  - TestPullAgent_DigestRefDoesNotServeUnannotatedCache verifies the
    digest-ref fast path no longer trusts an unannotated cache entry.
@dgageot dgageot requested a review from a team as a code owner May 21, 2026 12:18
Copy link
Copy Markdown

@docker-agent docker-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟡 NEEDS ATTENTION

Reviewed the OCI pull path refactor. Found 2 issues in the changed code.

Summary of changes reviewed:

  • isProxySocketError: typed error detection via errors.Is/errors.As (replaces string matching)
  • PullAgent: new unified entry point with cache fallback on network errors + corruption recovery
  • share pull: reverted to use Pull directly for strict registry error propagation
  • readCached: annotation guard added to prevent bypassing validation via cache

Comment thread pkg/remote/pull.go
}

// Store went missing right after a successful pull. Re-pull once.
if !errors.Is(err, content.ErrStoreCorrupted) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Corruption recovery re-pull is skipped for non-ErrStoreCorrupted store errors

After a successful Pull, PullAgent calls store.GetArtifact(storeKey) and only triggers the forced re-pull when the error satisfies errors.Is(err, content.ErrStoreCorrupted). However, GetArtifact passes through errors from GetArtifactImage without wrapping them — specifically, resolveIdentifier can return ErrInvalidDigest (or another non-ErrStoreCorrupted error) if the refs file is corrupt. In that case:

if !errors.Is(err, content.ErrStoreCorrupted) {
    return nil, "", fmt.Errorf("loading artifact from store: %w", err)  // fatal — no recovery re-pull
}

A user whose refs file is corrupted would be permanently stuck even though a forced re-pull would regenerate the file and fix the situation. The recovery guard should be broader — either use errors.Is(err, content.ErrStoreCorrupted) || errors.As(err, &content.ErrInvalidDigest), or simply attempt the re-pull for any post-pull GetArtifact error (since Pull just succeeded, the only plausible cause is store corruption).

Comment thread pkg/remote/transport.go
// failure dialing a Unix socket. Detection is type-based (net.OpError /
// syscall errno) rather than string matching, so it stays correct across
// Go versions and locales.
func isProxySocketError(err error) bool {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] isProxySocketError refactor may drop coverage for two legacy error patterns

The old string-based implementation matched two patterns that the new typed implementation no longer covers:

  1. "unix socket" — any error whose message contained this substring was caught. This is not covered by net.OpError{Op: "dial", Net: "unix"} (which only matches when the error arrives as a typed *net.OpError).
  2. "proxyconnect tcp" as a plain string — if a custom http.RoundTripper or proxy middleware returns a plain errors.New("proxyconnect tcp: ..."), it will not be unwrapped to a *net.OpError and will be missed.

For standard Go net/http proxy errors these paths are likely unreachable (the stdlib always wraps in *net.OpError). However, if any code in the transport stack (e.g., WSL path, custom DialContext) ever produces a bare error string, the proxy fallback will silently fail to trigger, causing all subsequent requests to fail instead of falling back to direct.

Suggest adding a test that generates a net.OpError{Op: "dial", Net: "unix"} (without a proxyconnect wrapper) to confirm the typed path is exercised, or adding a comment explaining why "unix socket" string errors cannot occur in practice.

@dgageot dgageot closed this May 21, 2026
@aheritier aheritier added area/distribution Agent registry, packaging, distribution, sharing kind/refactor PR refactors code without behavior change labels May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/distribution Agent registry, packaging, distribution, sharing kind/refactor PR refactors code without behavior change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants