Skip to content

bug: ExecSandbox hangs indefinitely after the command exits when a supervisor session resets mid-exec #1990

Description

@Gal-Zaidman

Agent Diagnostic

Investigated an OpenShell gateway running the Kubernetes compute driver under sustained
high concurrency (~20+ sandboxes), where agent trials hung after the agent had already
finished its work.

Tools / approach:

  • kubectl for pod state, in-sandbox supervisor logs (/var/log/openshell.*), and ps.
  • Gateway log analysis (supervisor sessions, relay channels, RPC latency).
  • openshell sandbox exec timing probes (quick, long-idle, and background-child commands).
  • Source review of the gateway exec/relay path (grpc/sandbox.rs, supervisor_session.rs,
    multiplex.rs) and the in-sandbox SSH server (openshell-sandbox/src/ssh.rs), plus the
    russh 0.57 channel-close behavior.

Findings:

  • On a stuck pod the agent process had exited with no orphan processes, yet the
    agent's ExecSandbox RPC never returned. A fresh exec into the same sandbox
    returned in <1s — so the supervisor session/relay was healthy and reconnected; only the
    original in-flight exec channel was wedged.
  • The gateway itself was healthy throughout (0 restarts, single-digit-ms RPC latency), so
    this is not saturation.
  • Observed recurring bursts of supervisor sessions resetting with
    h2 protocol error: error reading a body from connection, each followed by
    relay stream: inbound errored. A controlled long-running exec died at the exact instant
    of one such burst; others in the same situation hung instead.
  • The in-sandbox supervisor bounds its own teardown — it sends eofexit-status
    close wit
    the exec loop blocks on channel.wait() with no idle/liveness backstop, and an orphaned
    relay channel (left over from a reset session) never produces another message.
  • Confirmed not driver-specific: exec runs over the same supervisor reverse-relay path
    for every compute driver (local containers, VMs, Kubernetes). The defect is latent
    everywhere; it manifests in practice on networked / higher-concurrency deployments
    (where supervisor sessions actually reset often enough to orphan an in-flight exec) and
    is rarely seen on a local single-host gateway.

Tried: isolated long channel-silent execs and background-child execs both return fine on a
healthy session; the hang only appears when a supervisor session resets while an exec is in
flight, leaving the relay channel orphaned and the gateway exec loop parked.

Description

The bug occured constantly when trying to scale test openshell on K8S environment, specicily running swe-bench with Harbor framework which requires having a long running exec command.

Actual behavior: When a supervisor session resets while an ExecSandbox is in flight,
the relay channel carrying that exec is orphaned. The supervisor reconnects a fresh session
(new execs work immediately), but the gateway's exec loop stays blocked on channel.wait()
waiting for an SSH close/exit that will never arrive — there is no idle or liveness timeout
to break out. The streaming RPC never returns and the caller hangs until its own deadline.
It is most visible for long-running, channel-silent commands (e.g. an agent that redirects
stdout to a file), and wedged channels accumulate until a sandbox stops responding to new
execs.

Expected behavior: A completed or orphaned exec must always terminate the ExecSandbox
stream promptly — with a result or an error — regardless of which SSH control messages are
lost or whether the underlying supervisor session was reset.

Reproduction Steps

The defect is in the shared gateway relay path (every compute driver uses it), but it only
reproduced reliably on a networked, concurrent Kubernetes deployment — that is where
supervisor sessions actually reset often enough to orphan an in-flight exec. The steps that
worked:

  1. Deploy the (unfixed) OpenShell gateway on a Kubernetes cluster via Helm: single gateway
    replica
    , default ClusterIP Service (so sandbox→gateway traffic is NAT'd through
    kube-proxy/conntrack — this idle-eviction path is the key ingredient).
  2. Run a high-concurrency agent workload — ~20 sandboxes at once, each executing a
    long-running, channel-silent command (an agent that runs for many minutes and
    redirects its stdout to a file, so the exec SSH channel carries zero bytes). We used the
    Harbor harness with an agent CLI; any harness that holds ~20 long quiet ExecSandbox
    calls open against one gateway works.
  3. Let it run. In the gateway logs you'll see periodic bursts of supervisor-session
    resets
    (in our runs roughly every 40–50 min):
    supervisor session: stream error … "h2 protocol error: error reading a body from connection",
    followed by relay stream: inbound errored and supervisor session: ended. The
    supervisors reconnect immediately, but a subset of the in-flight execs are now orphaned.
  4. Result: a large fraction of trials hang after the agent has already finished — the
    ExecSandbox call never returns and the trial blocks until the harness's task timeout
    (~tens of minutes), even though the task often succeeded.

The indefinite hang specifically requires a half-open / silently-dropped idle relay flow
(NAT/conntrack eviction under load) where no FIN/RST reaches the gateway — which arises
naturally in the clustered setup above but not on a local loopback gateway.

Environment

  • OpenShell: 0.0.65 (gateway + supervisor); the defect predates this version.
  • Compute driver: Kubernetes (observed).
    **Defect is in the shared gateway relay path, so it applies to the Docker/Podman/VM drivers as well.
  • Gateway: single replica, plaintext h2c.

Logs

Around the time an exec wedges, the gateway logs a supervisor-session reset and the relay
channels under it erroring out:

WARN supervisor_session: supervisor session: stream error sandbox_id=… error=… "h2 protocol error: error reading a body from connection"
WARN supervisor_session: relay stream: inbound errored channel_id=…
INFO supervisor_session: supervisor session: ended sandbox_id=…

The affected `ExecSandbox` logs no completion after this — it simply stops, with no error
and no exit, while the caller keeps waiting. A fresh exec into the same sandbox succeeds.

Agent-First Checklist

  • I pointed my agent at the repo and had it investigate this issue
  • I loaded relevant skills (e.g., debug-openshell-cluster, debug-inference, openshell-cli)
  • My agent could not resolve this — the diagnostic above explains why

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions