Skip to content

fix: detect service unavailability and fail fast with clear error#118

Merged
nicknisi merged 4 commits intomainfrom
fix/service-unavailable-error-handling
Apr 3, 2026
Merged

fix: detect service unavailability and fail fast with clear error#118
nicknisi merged 4 commits intomainfrom
fix/service-unavailable-error-handling

Conversation

@nicknisi
Copy link
Copy Markdown
Member

@nicknisi nicknisi commented Apr 2, 2026

Summary

Fixes the installer hanging for ~9 minutes with no useful feedback when the Claude API is down (reported by MG in Slack).

Root cause: The SDK returns subtype: 'success' with is_error: true when API retries are exhausted. Our code only checked subtype, so it treated the error as success and proceeded with validation retries — 3 cycles of 10 retries each (30 total API calls, ~9 minutes) before surfacing a raw JSON error.

Before: 30 API calls over ~9 min, user sees: Claude Code returned an error result: API Error: 500 {"error":{"type":"internal_error","message":"An unexpected error occurred"}}

After: 10 API calls over ~3 min (first cycle only), user sees: The AI service is temporarily unavailable. Please try again in a few minutes.

Also adds clear error messages for other failure modes that previously showed raw/cryptic errors:

Error Before After
API 500/503 Raw JSON after 9 min "AI service temporarily unavailable" (~3 min)
429 rate limit Raw JSON "AI service is currently rate-limited"
Network failure ECONNREFUSED / ETIMEDOUT "Could not connect to the AI service"
Process crash process exited with code 1 "AI agent process exited unexpectedly" + --debug hint

Changes

  • agent-interface.ts: Check is_error on result messages, classify 500/server_error/429 as SERVICE_UNAVAILABLE, add abortRetries flag to skip validation retries on fatal errors
  • cli-adapter.ts: Detect service, rate limit, network, and process exit errors; show clean messages instead of raw API JSON
  • headless-adapter.ts: Emit structured error codes (service_unavailable, rate_limited, network_error, process_error) for JSON consumers
  • agent-interface.spec.ts: 4 new tests covering is_error detection, service classification, and retry abort

Test plan

  • pnpm test passes (1538 tests)
  • pnpm typecheck passes
  • pnpm build passes
  • New tests cover: is_error detection, SERVICE_UNAVAILABLE classification, non-service error fallback, validation retry skip

nicknisi added 4 commits April 2, 2026 16:59
When the Claude API returns persistent 500s, the SDK exhausts retries
and returns a result with subtype 'success' but is_error: true. Our
code only checked subtype, so it treated the error as success and
proceeded with validation retries — burning ~9 minutes on 30 hopeless
API calls before showing a raw JSON error.

Now:
- handleSDKMessage checks is_error on result messages
- 500/server_error/internal_error classified as SERVICE_UNAVAILABLE
- abortRetries flag skips validation retries on fatal SDK errors
- CLI adapter shows "AI service temporarily unavailable" instead of raw JSON
- Headless adapter emits service_unavailable error code
…essages

Extend error classification to cover additional failure modes:
- 429/rate limit: "AI service is currently rate-limited"
- ECONNREFUSED/ETIMEDOUT/ENOTFOUND: "Could not connect to the AI service"
- Process exit: "AI agent process exited unexpectedly"

Rate limits also abort validation retries (same as 500s).
P1: The adapter regex /service.unavailable/ only matched a single char
between "service" and "unavailable", so it missed our own friendly
message "The AI service is temporarily unavailable". Fixed to
/service.*unavailable/. Also removed the "Agent SDK error:" prefix
from all framework integrations so user-friendly messages pass through
cleanly.

P2: 429 rate limits were folded into SERVICE_UNAVAILABLE_PREFIX, which
rewrote them to "temporarily unavailable" before adapters could see the
rate-limit signal. Now 429s get a separate RATE_LIMITED_PREFIX with
distinct messaging ("currently rate-limited"), while still aborting
validation retries.
@nicknisi nicknisi merged commit 524c709 into main Apr 3, 2026
5 checks passed
@nicknisi nicknisi deleted the fix/service-unavailable-error-handling branch April 3, 2026 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant