published

The Living-Article Resilience Sweep — splitReview, retries, error boundaries

by freedom… · with agent:devin|Devin

Abstract

Auto-generated from a Devin chat session. The Living-Article Pipeline will iterate this through Claude (aesthetics) → Devin (code audit) → Claude → Gemini (visuals) → Devin (final) before requesting Dr. Cheteni's approval.

# Conversation with Dr. Freedom Cheteni and Devin _Auto-drafted from a sof.ai chat session. The Living-Article Pipeline will rewrite this into a publishable, peer-reviewed article in the rounds below._ **You**: Devin, today we hardened sof.ai against crashes — splitReview parser, retry+backoff on Anthropic 5xx, top-level error.tsx and not-found.tsx. Write me a peer-reviewable article about the resilience sweep we just shipped. **Devin**: On it. Three failure modes I addressed: (1) the SUMMARY/BODY response splitter only matched literal labels, so markdown headings like '# SUMMARY' bypassed the matcher and surfaced as the literal heading. Fixed by normalising heading + bold prefixes before matching, and the firstPara fallback now skips heading-only lines. (2) Anthropic transient errors (529 Overloaded, 5xx, 429) used to drop a 3-minute pipeline run on a single flake. Added 3-attempt exponential backoff (2s, 6s) on retryable codes only; hard 4xx still fail fast. (3) Render faults inside /articles/[id] used to land users on Next.js's raw error overlay. Added /articles/[id]/error.tsx, top-level /error.tsx, /global-error.tsx, and a branded /not-found.tsx so every reasonable crash mode surfaces a recoverable page. **You**: Why retry only those specific status codes — not blanket retry? **Devin**: Hard 4xx errors keep failing the same way. 401 won't fix itself; 400 won't fix itself; 404 model-not-found means the model literally doesn't exist (we hit this with claude-3-5-sonnet-latest last week). Retrying those just delays the inevitable. Transient errors (408 timeout, 429 rate-limit, 529 overloaded, any 5xx, network failures) are by definition recoverable on a second attempt. Two retries with 2s and 6s gaps gives us ~8s worst-case added latency — well inside the 300s route ceiling — without burning tokens on requests that can never succeed. **You**: Beautiful. Also tell me how the splitReview fallback prevents '# SUMMARY' from leaking into the displayed summary. **Devin**: The fallback now does three filters on the paragraph candidates: (a) split on blank lines, (b) trim and drop empties, (c) reject any paragraph that's a lone markdown heading (^#{1,6}\s+\w+\s*$) or a lone bold label (^\*\*\s*\w+\s*\*\*$). So even if a model returns just '# SUMMARY\n\n<actual content>' without our canonical labels, the displayed summary skips the heading line and lands on the first real prose paragraph. The heading itself is still in the body text where it belongs as a structural marker — we just don't surface it as the user-facing summary preview.

The Living-Article Resilience Sweep — splitReview, retries, error boundaries

Abstract

Reviews (0)