Group CLI crash reports on structured error signals#7901
Group CLI crash reports on structured error signals#7901stephanie-shopify wants to merge 2 commits into
Conversation
Replaces the single catch-all Bugsnag bucket (~1,170 unrelated GraphQL/API errors across app/theme/store/hydrogen) with grouping derived from the original typed error's HTTP status and GraphQL extensions.code, rather than regex- reparsing a stringified message. - New error-grouping.ts: structured-first decision table (THROTTLED/429->rate_limit, 401->authentication, 403/ACCESS_DENIED->permission, 5xx->server), shared keyword categorizer as fallback, and undefined for unknown errors so Bugsnag stack-trace grouping is preserved. - error-handler.ts: sets event.groupingHash only when a real category resolves, and emits error_grouping metadata (slice_name, http_status, error_code, error_class) so backend routing works regardless of CLI version. - error.ts: raw graphql-request ClientError 401/429/THROTTLED is now treated as expected and kept out of crash reporting. Shared error-categorizer.ts / storage.ts (Monorail taxonomy) are untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Knip flagged the interface as an unused export — it is only used internally as the return type of errorGroupingSignals; callers destructure inline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Differences in type declarationsWe detected differences in the type declarations generated by Typescript for this branch compared to the baseline ('main' branch). Please, review them to ensure they are backward-compatible. Here are some important things to keep in mind:
New type declarationspackages/cli-kit/dist/private/node/analytics/error-grouping.d.ts/**
* Structured signals extracted from an error, used to group it into a meaningful Bugsnag bucket.
*
* These are read from the *original* typed error (before it is flattened to a generic `Error`
* for reporting), so we group on facts the error already carries rather than by re-parsing a
* stringified message.
*/
interface ErrorGroupingSignals {
httpStatus?: number;
code?: string;
errorClass?: string;
}
/**
* Extracts structured grouping signals from an error object.
*
* @param error - The original error (any type).
* @returns The HTTP status, GraphQL error code, and error class name when available.
*/
export declare function errorGroupingSignals(error: unknown): ErrorGroupingSignals;
/**
* Builds a Bugsnag grouping hash of the form `${slice}:${category}:${signature}`.
*
* Categories are resolved structured-first (HTTP status / GraphQL code), falling back to the
* shared keyword categorizer only for untyped errors. When no meaningful category can be
* resolved, returns `undefined` so the caller leaves `event.groupingHash` unset and Bugsnag's
* default stack-trace grouping applies — this avoids merging genuinely distinct unknown bugs.
*
* @param error - The original error (any type).
* @param sliceName - The product slice (`app`, `theme`, `store`, `hydrogen`, or `cli`).
* @returns The grouping hash, or `undefined` to fall back to stack-trace grouping.
*/
export declare function errorGroupingHash(error: unknown, sliceName: string): string | undefined;
export {};
Existing type declarationsWe found no diffs with existing type declarations |
There was a problem hiding this comment.
should this be included? it has language that is specific to the current issue and not necessarily for long term use.
| const message = errorMessage(error) | ||
| // Object signals are authoritative; message-derived signals fill gaps (e.g. 5xx errors that the | ||
| // API layer flattens into an AbortError, dropping the structured status field). | ||
| const signals: ErrorGroupingSignals = {...signalsFromMessage(message), ...errorGroupingSignals(error)} |
There was a problem hiding this comment.
above in errorGroupingSignals only assign if there's a value i.e. never assign null or undefined.
🐛 Bug: The comment says message-derived signals fill gaps, but object spread copies keys even when their value is explicitly undefined. errorGroupingSignals can assign httpStatus or code as undefined; spreading those values over signalsFromMessage(message) can erase a valid status or code recovered from the message fallback. That makes the fallback less reliable for flattened API errors.
Suggestion: Consider merging field-by-field with nullish coalescing, or only assigning properties to signals when the extracted value is not undefined. For example, derive fromMessage and fromError separately, then build httpStatus, code, and errorClass as fromError.value ?? fromMessage.value.
| function categoryFromSignals(signals: ErrorGroupingSignals): string | undefined { | ||
| const {httpStatus, code} = signals | ||
|
|
||
| if (code === 'THROTTLED' || httpStatus === 429) return 'rate_limit' |
There was a problem hiding this comment.
🐛 Bug: The structured decision table handles HTTP 429 and extensions.code === 'THROTTLED', but existing retry logic also treats extensions.code === '429' as a GraphQL rate-limit signal, even when the HTTP status is 200. A raw ClientError with that established shape can currently miss the rate_limit bucket.
Suggestion: Consider adding code === '429' to the rate-limit branch and covering the HTTP-200/code-429 shape with a regression test.
| const {httpStatus, code} = signals | ||
|
|
||
| if (code === 'THROTTLED' || httpStatus === 429) return 'rate_limit' | ||
| if (httpStatus === 401) return 'authentication' |
There was a problem hiding this comment.
💬 Suggestion: The documented decision table treats ACCESS_DENIED as a permission signal, but the implementation checks httpStatus === 401 before code === 'ACCESS_DENIED'. If an error carries both HTTP 401 and ACCESS_DENIED, it will be grouped as authentication rather than permission.
Suggestion: Consider moving the httpStatus === 403 || code === 'ACCESS_DENIED' branch above the 401 branch if access-denied should win. If 401 is intentionally higher precedence, add a focused test or documentation note so that routing choice is explicit.
| */ | ||
| function firstExtensionCode(errors: unknown): string | undefined { | ||
| if (!Array.isArray(errors)) return undefined | ||
| const code = (errors[0] as {extensions?: {code?: unknown}} | undefined)?.extensions?.code |
There was a problem hiding this comment.
🐛 Bug: firstExtensionCode only reads errors[0].extensions.code, but GraphQL responses can include multiple errors and existing code in this package treats those arrays as multi-entry structures. A later error may carry THROTTLED, 429, or ACCESS_DENIED, and App Management errors can carry nested extensions.app_errors.errors[].category === 'access_denied'. Missing those shapes can route known API conditions to a fallback or default bucket instead of the intended structured grouping.
Suggestion: Consider replacing the first-error lookup with a small normalization helper that scans every GraphQL error, recognizes both THROTTLED and 429 as rate-limit signals, recognizes ACCESS_DENIED and nested App Management access_denied as permission signals, and is shared with the reporting-suppression logic where possible.
| if (groupingHash) { | ||
| event.groupingHash = groupingHash | ||
| } | ||
| const {httpStatus, code, errorClass} = errorGroupingSignals(error) |
There was a problem hiding this comment.
🐛 Bug: event.groupingHash is derived through errorGroupingHash(error, sliceName), which can combine typed error signals, message-derived HTTP/code signals, and fallback keyword categorization. The error_grouping metadata immediately below is built from errorGroupingSignals(error) alone, so it can omit facts that the hash used, such as a message-derived http_status for flattened errors or a fallback category. It also omits the semantic category field that the design document says will be emitted, which weakens backend routing that is supposed to depend on stable metadata tags rather than just the hash.
Suggestion: Consider exposing a single helper from error-grouping.ts that returns the fully resolved grouping result, for example {hash, category, signals}, and use that result for both event.groupingHash and event.addMetadata('error_grouping', ...). That would also avoid recalculating errorGroupingSignals(error) separately for the hash and the metadata.
| return false | ||
| } | ||
| const status = error.response?.status | ||
| if (status === 401 || status === 429) { |
| const errors = error.response?.errors | ||
| if (Array.isArray(errors)) { | ||
| const code = (errors[0] as {extensions?: {code?: unknown}} | undefined)?.extensions?.code | ||
| return code === 'THROTTLED' |
There was a problem hiding this comment.
🐛 Bug: isTransientApiError suppresses HTTP 429 and first-error THROTTLED, but it misses the established GraphQL extensions.code === '429' rate-limit shape and any transient code that appears after the first error entry. After retries are exhausted, those raw ClientErrors could still be treated as unexpected and sent to Bugsnag even though the retry layer classifies them as transient rate limiting.
Suggestion: Consider scanning all entries in error.response.errors and treating both THROTTLED and 429 as transient. A regression test like shouldReportErrorAsUnexpected(clientError(200, '429')) === false would keep the suppression logic aligned with the retry-layer fixture shape.
WHY are these changes introduced?
Fixes #7891. which comes from resiliency issue
A single Bugsnag bucket (
groupingHash 16864652937831232783) had become a catch-all for ~1,170 unrelated GraphQL/API errors spanningapp/theme/store/hydrogen. That makes the bucket un-routable (five owning teams in one group), mixes expected/transient noise with genuine regressions, and manufactures false P1s.The earlier fix attempt (#7894) addressed the symptom by regex-reparsing the stringified error message to reconstruct type information, and removed expected-error suppression entirely (reporting every
AbortError). This PR is the redesign: it groups on the structured signals the error already carries.The key insight: in
sendErrorToBugsnag, only the reportable copy is flattened tonew Error(error.message)— the original typed error is still in scope and already carriesstatusCode+errors[].extensions.code. So we group on facts, not on a string we just serialized.WHAT is this pull request doing?
error-grouping.ts— derives{httpStatus, code, errorClass}from the original error (GraphQLClientErrorand raw graphql-requestClientError) and maps them via an explicit decision table:THROTTLED/429→rate_limit401→authentication403/ACCESS_DENIED→permission5xx→servergroupingHashunset so Bugsnag's stack-trace grouping applies and distinct unknown bugs aren't merged.error-handler.ts— setsevent.groupingHashonly when a real category resolves, and emitserror_groupingmetadata (slice_name,http_status,error_code,error_class) so backend dashboards/routing work regardless of CLI version (the hash is client-side; old versions keep the old hash until adoption rolls).error.ts— rawClientErrorwith401/429/THROTTLEDis now treated as expected and kept out of crash reporting.error-categorizer.ts/storage.ts(Monorail analytics taxonomy) are untouched — their tests stay green, proving theerror:${category}:${signature}events are unaffected.error-grouping.DESIGN.mdrecords the problem, constraints, and the three decisions.How to test your changes?
pnpm vitest run packages/cli-kit/src/private/node/analytics/error-grouping.test.ts pnpm vitest run packages/cli-kit/src/public/node/error-handler.test.ts pnpm vitest run packages/cli-kit/src/public/node/error.test.ts pnpm vitest run packages/cli-kit/src/private/node/analytics/ # categorizer + storage stay greenDecision-table matrix (asserted in tests): 403
ACCESS_DENIED→theme:permission:http-403-access-denied; 401 →*:authentication:http-401;THROTTLED→*:rate_limit:*and not reported; 5xx →server; rawClientError; genericError→groupingHashunset (stack grouping).Rollout note
Setting
groupingHashre-buckets all CLI errors once — a one-time grouping migration across CLI error dashboards. This is intended and an improvement, but should be flagged to whoever owns the Bugsnag/Observe dashboards before merge. The structured metadata tags give backend routing that works across all CLI versions immediately, which is the mitigation while client adoption rolls.Follow-ups (out of scope)
category/codeon theFatalErrorhierarchy + API wrappers (the "proper" version of this).error-categorizer.ts(needs a coordinatedstorage.tsupdate + pinned tests).*:unknown:*andcli:*bucket growth so the catch-all surfaces itself.Checklist
error_groupingmetadata is additive.patchbump added via changeset.🤖 Generated with Claude Code