fix: detect service unavailability and fail fast with clear error#118
Merged
fix: detect service unavailability and fail fast with clear error#118
Conversation
When the Claude API returns persistent 500s, the SDK exhausts retries and returns a result with subtype 'success' but is_error: true. Our code only checked subtype, so it treated the error as success and proceeded with validation retries — burning ~9 minutes on 30 hopeless API calls before showing a raw JSON error. Now: - handleSDKMessage checks is_error on result messages - 500/server_error/internal_error classified as SERVICE_UNAVAILABLE - abortRetries flag skips validation retries on fatal SDK errors - CLI adapter shows "AI service temporarily unavailable" instead of raw JSON - Headless adapter emits service_unavailable error code
…essages Extend error classification to cover additional failure modes: - 429/rate limit: "AI service is currently rate-limited" - ECONNREFUSED/ETIMEDOUT/ENOTFOUND: "Could not connect to the AI service" - Process exit: "AI agent process exited unexpectedly" Rate limits also abort validation retries (same as 500s).
P1: The adapter regex /service.unavailable/ only matched a single char
between "service" and "unavailable", so it missed our own friendly
message "The AI service is temporarily unavailable". Fixed to
/service.*unavailable/. Also removed the "Agent SDK error:" prefix
from all framework integrations so user-friendly messages pass through
cleanly.
P2: 429 rate limits were folded into SERVICE_UNAVAILABLE_PREFIX, which
rewrote them to "temporarily unavailable" before adapters could see the
rate-limit signal. Now 429s get a separate RATE_LIMITED_PREFIX with
distinct messaging ("currently rate-limited"), while still aborting
validation retries.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the installer hanging for ~9 minutes with no useful feedback when the Claude API is down (reported by MG in Slack).
Root cause: The SDK returns
subtype: 'success'withis_error: truewhen API retries are exhausted. Our code only checkedsubtype, so it treated the error as success and proceeded with validation retries — 3 cycles of 10 retries each (30 total API calls, ~9 minutes) before surfacing a raw JSON error.Before: 30 API calls over ~9 min, user sees:
Claude Code returned an error result: API Error: 500 {"error":{"type":"internal_error","message":"An unexpected error occurred"}}After: 10 API calls over ~3 min (first cycle only), user sees:
The AI service is temporarily unavailable. Please try again in a few minutes.Also adds clear error messages for other failure modes that previously showed raw/cryptic errors:
ECONNREFUSED/ETIMEDOUTprocess exited with code 1--debughintChanges
agent-interface.ts: Checkis_erroron result messages, classify 500/server_error/429 asSERVICE_UNAVAILABLE, addabortRetriesflag to skip validation retries on fatal errorscli-adapter.ts: Detect service, rate limit, network, and process exit errors; show clean messages instead of raw API JSONheadless-adapter.ts: Emit structured error codes (service_unavailable,rate_limited,network_error,process_error) for JSON consumersagent-interface.spec.ts: 4 new tests covering is_error detection, service classification, and retry abortTest plan
pnpm testpasses (1538 tests)pnpm typecheckpassespnpm buildpasses