refactor(remote): faster, safer, simpler OCI pull path#2855
Conversation
isProxySocketError used substring matching on err.Error() to decide whether to fall back from the Docker Desktop proxy to a direct connection. That breaks across Go versions and locales, and gives false positives whenever any error text happens to contain "dial unix" or "connection refused". Use errors.Is against syscall.ENOENT/ECONNREFUSED and errors.As against *net.OpError (Op == "proxyconnect" or unix-socket dial) instead. Tests now construct realistic typed error chains, including the url.Error wrapper http.Client adds in production.
The flow to load an agent YAML from an OCI reference was split across three places: cmd/root/pull.go (Pull + open store + GetArtifact), ociSource.Read (Pull + load + corruption-aware retry + cache fallback), and remote.Pull itself (its own cache check). Each duplicated parts of the others. Introduce remote.PullAgent(ctx, ref, force) ([]byte, digest, error) as the single entry point: it handles the digest-ref shortcut, the network-error fallback to cache, and the one-shot recovery on store corruption. ociSource.Read and the share-pull command now each shrink to a single call. Skip the upfront crane.Digest HEAD when the cache is empty: with no local copy to compare against we'd have to crane.Pull anyway, so the HEAD is pure latency. Cold pulls drop from two manifest round-trips to one.
Routing the share-pull command through PullAgent meant a registry 404 or auth failure would silently overwrite the on-disk YAML with a stale cached copy. PullAgent's cache-fallback policy is the right behaviour for `ociSource.Read` (running an agent shouldn't break when the registry blips) but the wrong behaviour for an explicit user-driven refresh. Restore the direct remote.Pull call so registry failures surface as errors instead of being papered over with whatever happens to be in the local store.
…rror
Two related issues caught in review of the previous commits:
1. Pull bailed out with a permanent error if local metadata lacked the
cagent annotation, before even contacting the registry. A user with
a stale or bad cache entry on a tag was stuck until they manually
wiped the store, even after the remote tag was updated to a valid
agent artifact. Treat an unannotated cache entry as a cache miss so
Pull falls through to crane.Pull, which re-validates the freshly
fetched manifest and overwrites the bad entry.
2. PullAgent's cache fast-paths (digest-ref shortcut and the
network-error fallback) returned cached bytes without re-checking
the annotation, while Pull validates it on the registry path.
Defense-in-depth fix: readCached now refuses entries missing the
annotation, so the cache cannot be used to bypass the validation
Pull would otherwise apply.
Tests:
- TestReadCached covers the three cases (missing / unannotated /
annotated).
- TestPull_RecoversFromUnannotatedCache uses an in-memory OCI
registry to verify a poisoned cache entry no longer blocks a valid
re-pull.
- TestPullAgent_DigestRefDoesNotServeUnannotatedCache verifies the
digest-ref fast path no longer trusts an unannotated cache entry.
docker-agent
left a comment
There was a problem hiding this comment.
Assessment: 🟡 NEEDS ATTENTION
Reviewed the OCI pull path refactor. Found 2 issues in the changed code.
Summary of changes reviewed:
isProxySocketError: typed error detection viaerrors.Is/errors.As(replaces string matching)PullAgent: new unified entry point with cache fallback on network errors + corruption recoveryshare pull: reverted to usePulldirectly for strict registry error propagationreadCached: annotation guard added to prevent bypassing validation via cache
| } | ||
|
|
||
| // Store went missing right after a successful pull. Re-pull once. | ||
| if !errors.Is(err, content.ErrStoreCorrupted) { |
There was a problem hiding this comment.
[MEDIUM] Corruption recovery re-pull is skipped for non-ErrStoreCorrupted store errors
After a successful Pull, PullAgent calls store.GetArtifact(storeKey) and only triggers the forced re-pull when the error satisfies errors.Is(err, content.ErrStoreCorrupted). However, GetArtifact passes through errors from GetArtifactImage without wrapping them — specifically, resolveIdentifier can return ErrInvalidDigest (or another non-ErrStoreCorrupted error) if the refs file is corrupt. In that case:
if !errors.Is(err, content.ErrStoreCorrupted) {
return nil, "", fmt.Errorf("loading artifact from store: %w", err) // fatal — no recovery re-pull
}A user whose refs file is corrupted would be permanently stuck even though a forced re-pull would regenerate the file and fix the situation. The recovery guard should be broader — either use errors.Is(err, content.ErrStoreCorrupted) || errors.As(err, &content.ErrInvalidDigest), or simply attempt the re-pull for any post-pull GetArtifact error (since Pull just succeeded, the only plausible cause is store corruption).
| // failure dialing a Unix socket. Detection is type-based (net.OpError / | ||
| // syscall errno) rather than string matching, so it stays correct across | ||
| // Go versions and locales. | ||
| func isProxySocketError(err error) bool { |
There was a problem hiding this comment.
[LOW] isProxySocketError refactor may drop coverage for two legacy error patterns
The old string-based implementation matched two patterns that the new typed implementation no longer covers:
"unix socket"— any error whose message contained this substring was caught. This is not covered bynet.OpError{Op: "dial", Net: "unix"}(which only matches when the error arrives as a typed*net.OpError)."proxyconnect tcp"as a plain string — if a customhttp.RoundTripperor proxy middleware returns a plainerrors.New("proxyconnect tcp: ..."), it will not be unwrapped to a*net.OpErrorand will be missed.
For standard Go net/http proxy errors these paths are likely unreachable (the stdlib always wraps in *net.OpError). However, if any code in the transport stack (e.g., WSL path, custom DialContext) ever produces a bare error string, the proxy fallback will silently fail to trigger, causing all subsequent requests to fail instead of falling back to direct.
Suggest adding a test that generates a net.OpError{Op: "dial", Net: "unix"} (without a proxyconnect wrapper) to confirm the typed path is exercised, or adding a comment explaining why "unix socket" string errors cannot occur in practice.
Reviewing how we pull agent YAMLs from OCI artifacts surfaced a few wins. Each commit is independently reviewable.
What changes
4242b42b4refactor(remote): detect proxy socket errors via type, not stringsisProxySocketErrordecides whether the Docker Desktop proxy is unavailable so the transport can fall back to a direct connection. It used substring matching onerr.Error()("no such file or directory","connection refused","dial unix", …), which breaks across Go versions and locales and gives false positives whenever any error text happens to contain those phrases.Use
errors.Isagainstsyscall.ENOENT/syscall.ECONNREFUSEDanderrors.Asagainst*net.OpError(Op == "proxyconnect"or unix-socket dial). Tests now construct realistic typed error chains, including the*url.Errorwrapperhttp.Clientadds in production.169d6dff8refactor(remote): collapse OCI agent read path intoPullAgentThe flow to load an agent YAML from an OCI reference was split across three places:
cmd/root/pull.go(Pull + open store + GetArtifact),ociSource.Read(Pull + load + corruption-aware retry + cache fallback), andremote.Pullitself (its own cache check). Each duplicated parts of the others.Introduce
remote.PullAgent(ctx, ref, force) ([]byte, digest, error)as the single entry point: it handles the digest-ref shortcut, the network-error fallback to cache, and the one-shot recovery on store corruption.ociSource.Readshrinks to a single call.Also: skip the upfront
crane.DigestHEAD when the cache is empty. With no local copy to compare against we'd have tocrane.Pullanyway, so the HEAD is pure latency. Cold pulls drop from two manifest round-trips to one.0ffbe5895fix(remote): keepshare pullstrict on registry errorsRouting the
share pullcommand throughPullAgentmeant a registry 404 or auth failure would silently overwrite the on-disk YAML with a stale cached copy.PullAgent's cache-fallback policy is the right behaviour forociSource.Read(running an agent shouldn't break when the registry blips) but the wrong behaviour for an explicit user-driven refresh.share pullnow callsremote.Pulldirectly so registry failures surface as errors.385816cb5fix(remote): treat unannotated cache as cache-miss, not as terminal errorTwo related defense-in-depth fixes:
Pullbailed out with a permanent error if local metadata lacked the cagent annotation, before contacting the registry. A user with a stale or bad cache entry on a tag was stuck until they manually wiped the store, even after the remote tag was updated to a valid agent artifact. Treat an unannotated cache entry as a cache miss soPullfalls through tocrane.Pull, which re-validates the freshly fetched manifest and overwrites the bad entry.PullAgent's cache fast-paths (digest-ref shortcut and the network-error fallback) returned cached bytes without re-checking the annotation, whilePullvalidates it on the registry path.readCachednow refuses entries missing the annotation, so the cache cannot be used to bypass that validation.Tests:
TestReadCached— missing / unannotated / annotated cases.TestPull_RecoversFromUnannotatedCache— uses an in-memory OCI registry to verify a poisoned cache entry no longer blocks a valid re-pull.TestPullAgent_DigestRefDoesNotServeUnannotatedCache— verifies the digest-ref fast path no longer trusts an unannotated cache entry.Validation
task lint— cleantask test— clean (130 packages)