Skip to content

Group CLI crash reports on structured error signals#7901

Open
stephanie-shopify wants to merge 2 commits into
mainfrom
fix/error-grouping-structured
Open

Group CLI crash reports on structured error signals#7901
stephanie-shopify wants to merge 2 commits into
mainfrom
fix/error-grouping-structured

Conversation

@stephanie-shopify

@stephanie-shopify stephanie-shopify commented Jun 23, 2026

Copy link
Copy Markdown

WHY are these changes introduced?

Fixes #7891. which comes from resiliency issue

A single Bugsnag bucket (groupingHash 16864652937831232783) had become a catch-all for ~1,170 unrelated GraphQL/API errors spanning app/theme/store/hydrogen. That makes the bucket un-routable (five owning teams in one group), mixes expected/transient noise with genuine regressions, and manufactures false P1s.

The earlier fix attempt (#7894) addressed the symptom by regex-reparsing the stringified error message to reconstruct type information, and removed expected-error suppression entirely (reporting every AbortError). This PR is the redesign: it groups on the structured signals the error already carries.

The key insight: in sendErrorToBugsnag, only the reportable copy is flattened to new Error(error.message) — the original typed error is still in scope and already carries statusCode + errors[].extensions.code. So we group on facts, not on a string we just serialized.

WHAT is this pull request doing?

  • New error-grouping.ts — derives {httpStatus, code, errorClass} from the original error (GraphQLClientError and raw graphql-request ClientError) and maps them via an explicit decision table:
    • THROTTLED / 429rate_limit
    • 401authentication
    • 403 / ACCESS_DENIEDpermission
    • 5xxserver
    • else → shared keyword categorizer (fallback only)
    • unknown → leave groupingHash unset so Bugsnag's stack-trace grouping applies and distinct unknown bugs aren't merged.
  • error-handler.ts — sets event.groupingHash only when a real category resolves, and emits error_grouping metadata (slice_name, http_status, error_code, error_class) so backend dashboards/routing work regardless of CLI version (the hash is client-side; old versions keep the old hash until adoption rolls).
  • error.ts — raw ClientError with 401/429/THROTTLED is now treated as expected and kept out of crash reporting.
  • The shared error-categorizer.ts / storage.ts (Monorail analytics taxonomy) are untouched — their tests stay green, proving the error:${category}:${signature} events are unaffected.
  • A short error-grouping.DESIGN.md records the problem, constraints, and the three decisions.

How to test your changes?

pnpm vitest run packages/cli-kit/src/private/node/analytics/error-grouping.test.ts
pnpm vitest run packages/cli-kit/src/public/node/error-handler.test.ts
pnpm vitest run packages/cli-kit/src/public/node/error.test.ts
pnpm vitest run packages/cli-kit/src/private/node/analytics/   # categorizer + storage stay green

Decision-table matrix (asserted in tests): 403 ACCESS_DENIEDtheme:permission:http-403-access-denied; 401 → *:authentication:http-401; THROTTLED*:rate_limit:* and not reported; 5xx → server; raw ClientError; generic ErrorgroupingHash unset (stack grouping).

Rollout note

Setting groupingHash re-buckets all CLI errors once — a one-time grouping migration across CLI error dashboards. This is intended and an improvement, but should be flagged to whoever owns the Bugsnag/Observe dashboards before merge. The structured metadata tags give backend routing that works across all CLI versions immediately, which is the mitigation while client adoption rolls.

Follow-ups (out of scope)

  • At-source category/code on the FatalError hierarchy + API wrappers (the "proper" version of this).
  • Folding status/code grouping into error-categorizer.ts (needs a coordinated storage.ts update + pinned tests).
  • Alert on *:unknown:* and cli:* bucket growth so the catch-all surfaces itself.
  • Tie Bugsnag severity to SLO/error-budget burn rather than aggregate volume.

Checklist

  • I've considered possible cross-platform impacts (Mac, Linux, Windows) — pure TS, no platform-specific paths.
  • I've considered possible documentation changes — none needed (internal crash-reporting behavior).
  • I've considered analytics changes to measure impact — the shared Monorail taxonomy is deliberately untouched; new Bugsnag error_grouping metadata is additive.
  • The change is user-facing — patch bump added via changeset.

🤖 Generated with Claude Code

Replaces the single catch-all Bugsnag bucket (~1,170 unrelated GraphQL/API
errors across app/theme/store/hydrogen) with grouping derived from the original
typed error's HTTP status and GraphQL extensions.code, rather than regex-
reparsing a stringified message.

- New error-grouping.ts: structured-first decision table
  (THROTTLED/429->rate_limit, 401->authentication, 403/ACCESS_DENIED->permission,
  5xx->server), shared keyword categorizer as fallback, and undefined for unknown
  errors so Bugsnag stack-trace grouping is preserved.
- error-handler.ts: sets event.groupingHash only when a real category resolves,
  and emits error_grouping metadata (slice_name, http_status, error_code,
  error_class) so backend routing works regardless of CLI version.
- error.ts: raw graphql-request ClientError 401/429/THROTTLED is now treated as
  expected and kept out of crash reporting.

Shared error-categorizer.ts / storage.ts (Monorail taxonomy) are untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@stephanie-shopify stephanie-shopify requested review from a team as code owners June 23, 2026 00:48
@github-actions github-actions Bot added the Area: @shopify/cli @shopify/cli package issues label Jun 23, 2026
Knip flagged the interface as an unused export — it is only used internally
as the return type of errorGroupingSignals; callers destructure inline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Differences in type declarations

We detected differences in the type declarations generated by Typescript for this branch compared to the baseline ('main' branch). Please, review them to ensure they are backward-compatible. Here are some important things to keep in mind:

  • Some seemingly private modules might be re-exported through public modules.
  • If the branch is behind main you might see odd diffs, rebase main into this branch.

New type declarations

packages/cli-kit/dist/private/node/analytics/error-grouping.d.ts
/**
 * Structured signals extracted from an error, used to group it into a meaningful Bugsnag bucket.
 *
 * These are read from the *original* typed error (before it is flattened to a generic `Error`
 * for reporting), so we group on facts the error already carries rather than by re-parsing a
 * stringified message.
 */
interface ErrorGroupingSignals {
    httpStatus?: number;
    code?: string;
    errorClass?: string;
}
/**
 * Extracts structured grouping signals from an error object.
 *
 * @param error - The original error (any type).
 * @returns The HTTP status, GraphQL error code, and error class name when available.
 */
export declare function errorGroupingSignals(error: unknown): ErrorGroupingSignals;
/**
 * Builds a Bugsnag grouping hash of the form `${slice}:${category}:${signature}`.
 *
 * Categories are resolved structured-first (HTTP status / GraphQL code), falling back to the
 * shared keyword categorizer only for untyped errors. When no meaningful category can be
 * resolved, returns `undefined` so the caller leaves `event.groupingHash` unset and Bugsnag's
 * default stack-trace grouping applies — this avoids merging genuinely distinct unknown bugs.
 *
 * @param error - The original error (any type).
 * @param sliceName - The product slice (`app`, `theme`, `store`, `hydrogen`, or `cli`).
 * @returns The grouping hash, or `undefined` to fall back to stack-trace grouping.
 */
export declare function errorGroupingHash(error: unknown, sliceName: string): string | undefined;
export {};

Existing type declarations

We found no diffs with existing type declarations

@stephanie-shopify stephanie-shopify requested a review from mbarak June 24, 2026 18:55

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be included? it has language that is specific to the current issue and not necessarily for long term use.

const message = errorMessage(error)
// Object signals are authoritative; message-derived signals fill gaps (e.g. 5xx errors that the
// API layer flattens into an AbortError, dropping the structured status field).
const signals: ErrorGroupingSignals = {...signalsFromMessage(message), ...errorGroupingSignals(error)}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above in errorGroupingSignals only assign if there's a value i.e. never assign null or undefined.


🐛 Bug: The comment says message-derived signals fill gaps, but object spread copies keys even when their value is explicitly undefined. errorGroupingSignals can assign httpStatus or code as undefined; spreading those values over signalsFromMessage(message) can erase a valid status or code recovered from the message fallback. That makes the fallback less reliable for flattened API errors.

Suggestion: Consider merging field-by-field with nullish coalescing, or only assigning properties to signals when the extracted value is not undefined. For example, derive fromMessage and fromError separately, then build httpStatus, code, and errorClass as fromError.value ?? fromMessage.value.

function categoryFromSignals(signals: ErrorGroupingSignals): string | undefined {
const {httpStatus, code} = signals

if (code === 'THROTTLED' || httpStatus === 429) return 'rate_limit'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Bug: The structured decision table handles HTTP 429 and extensions.code === 'THROTTLED', but existing retry logic also treats extensions.code === '429' as a GraphQL rate-limit signal, even when the HTTP status is 200. A raw ClientError with that established shape can currently miss the rate_limit bucket.

Suggestion: Consider adding code === '429' to the rate-limit branch and covering the HTTP-200/code-429 shape with a regression test.

const {httpStatus, code} = signals

if (code === 'THROTTLED' || httpStatus === 429) return 'rate_limit'
if (httpStatus === 401) return 'authentication'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 Suggestion: The documented decision table treats ACCESS_DENIED as a permission signal, but the implementation checks httpStatus === 401 before code === 'ACCESS_DENIED'. If an error carries both HTTP 401 and ACCESS_DENIED, it will be grouped as authentication rather than permission.

Suggestion: Consider moving the httpStatus === 403 || code === 'ACCESS_DENIED' branch above the 401 branch if access-denied should win. If 401 is intentionally higher precedence, add a focused test or documentation note so that routing choice is explicit.

*/
function firstExtensionCode(errors: unknown): string | undefined {
if (!Array.isArray(errors)) return undefined
const code = (errors[0] as {extensions?: {code?: unknown}} | undefined)?.extensions?.code

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Bug: firstExtensionCode only reads errors[0].extensions.code, but GraphQL responses can include multiple errors and existing code in this package treats those arrays as multi-entry structures. A later error may carry THROTTLED, 429, or ACCESS_DENIED, and App Management errors can carry nested extensions.app_errors.errors[].category === 'access_denied'. Missing those shapes can route known API conditions to a fallback or default bucket instead of the intended structured grouping.

Suggestion: Consider replacing the first-error lookup with a small normalization helper that scans every GraphQL error, recognizes both THROTTLED and 429 as rate-limit signals, recognizes ACCESS_DENIED and nested App Management access_denied as permission signals, and is shared with the reporting-suppression logic where possible.

if (groupingHash) {
event.groupingHash = groupingHash
}
const {httpStatus, code, errorClass} = errorGroupingSignals(error)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Bug: event.groupingHash is derived through errorGroupingHash(error, sliceName), which can combine typed error signals, message-derived HTTP/code signals, and fallback keyword categorization. The error_grouping metadata immediately below is built from errorGroupingSignals(error) alone, so it can omit facts that the hash used, such as a message-derived http_status for flattened errors or a fallback category. It also omits the semantic category field that the design document says will be emitted, which weakens backend routing that is supposed to depend on stable metadata tags rather than just the hash.

Suggestion: Consider exposing a single helper from error-grouping.ts that returns the fully resolved grouping result, for example {hash, category, signals}, and use that result for both event.groupingHash and event.addMetadata('error_grouping', ...). That would also avoid recalculating errorGroupingSignals(error) separately for the hash and the metadata.

return false
}
const status = error.response?.status
if (status === 401 || status === 429) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is 401 transient?

const errors = error.response?.errors
if (Array.isArray(errors)) {
const code = (errors[0] as {extensions?: {code?: unknown}} | undefined)?.extensions?.code
return code === 'THROTTLED'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Bug: isTransientApiError suppresses HTTP 429 and first-error THROTTLED, but it misses the established GraphQL extensions.code === '429' rate-limit shape and any transient code that appears after the first error entry. After retries are exhausted, those raw ClientErrors could still be treated as unexpected and sent to Bugsnag even though the retry layer classifies them as transient rate limiting.

Suggestion: Consider scanning all entries in error.response.errors and treating both THROTTLED and 429 as transient. A regression test like shouldReportErrorAsUnexpected(clientError(200, '429')) === false would keep the suppression logic aligned with the retry-layer fixture shape.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: @shopify/cli @shopify/cli package issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

split observe analytics group for cleaner resiliency tracking

2 participants