bug: ExecSandbox hangs indefinitely after the command exits when a supervisor session resets mid-exec

### Agent Diagnostic


Investigated an OpenShell gateway running the Kubernetes compute driver under sustained
high concurrency (~20+ sandboxes), where agent trials hung after the agent had already
finished its work.

Tools / approach:
- `kubectl` for pod state, in-sandbox supervisor logs (`/var/log/openshell.*`), and `ps`.
- Gateway log analysis (supervisor sessions, relay channels, RPC latency).
- `openshell sandbox exec` timing probes (quick, long-idle, and background-child commands).
- Source review of the gateway exec/relay path (`grpc/sandbox.rs`, `supervisor_session.rs`,
  `multiplex.rs`) and the in-sandbox SSH server (`openshell-sandbox/src/ssh.rs`), plus the
  russh 0.57 channel-close behavior.

Findings:
- On a stuck pod the agent process had exited with **no orphan processes**, yet the
  agent's `ExecSandbox` RPC never returned. A **fresh** exec into the same sandbox
  returned in <1s — so the supervisor session/relay was healthy and reconnected; only the
  original in-flight exec channel was wedged.
- The gateway itself was healthy throughout (0 restarts, single-digit-ms RPC latency), so
  this is not saturation.
- Observed recurring **bursts of supervisor sessions resetting** with
  `h2 protocol error: error reading a body from connection`, each followed by
  `relay stream: inbound errored`. A controlled long-running exec died at the exact instant
  of one such burst; others in the same situation hung instead.
- The in-sandbox supervisor bounds its own teardown — it sends `eof` → `exit-status` →
  `close` wit
  the exec loop blocks on `channel.wait()` with no idle/liveness backstop, and an orphaned
  relay channel (left over from a reset session) never produces another message.
- Confirmed **not driver-specific**: exec runs over the same supervisor reverse-relay path
  for every compute driver (local containers, VMs, Kubernetes). The defect is latent
  everywhere; it manifests in practice on networked / higher-concurrency deployments
  (where supervisor sessions actually reset often enough to orphan an in-flight exec) and
  is rarely seen on a local single-host gateway.

Tried: isolated long channel-silent execs and background-child execs both return fine on a
healthy session; the hang only appears when a supervisor session resets while an exec is in
flight, leaving the relay channel orphaned and the gateway exec loop parked.


### Description

The bug occured constantly when trying to scale test openshell on K8S environment, specicily running swe-bench with Harbor framework which requires having a long running exec command.

**Actual behavior:** When a supervisor session resets while an `ExecSandbox` is in flight,
the relay channel carrying that exec is orphaned. The supervisor reconnects a fresh session
(new execs work immediately), but the gateway's exec loop stays blocked on `channel.wait()`
waiting for an SSH close/exit that will never arrive — there is no idle or liveness timeout
to break out. The streaming RPC never returns and the caller hangs until its own deadline.
It is most visible for long-running, channel-silent commands (e.g. an agent that redirects
stdout to a file), and wedged channels accumulate until a sandbox stops responding to new
execs.

**Expected behavior:** A completed or orphaned exec must always terminate the `ExecSandbox`
stream promptly — with a result or an error — regardless of which SSH control messages are
lost or whether the underlying supervisor session was reset.

### Reproduction Steps

The defect is in the shared gateway relay path (every compute driver uses it), but it only
reproduced reliably on a **networked, concurrent Kubernetes deployment** — that is where
supervisor sessions actually reset often enough to orphan an in-flight exec. The steps that
worked:

1. Deploy the (unfixed) OpenShell gateway on a Kubernetes cluster via Helm: **single gateway
   replica**, default **ClusterIP** Service (so sandbox→gateway traffic is NAT'd through
   kube-proxy/conntrack — this idle-eviction path is the key ingredient).
2. Run a **high-concurrency** agent workload — ~20 sandboxes at once, each executing a
   **long-running, channel-silent** command (an agent that runs for many minutes and
   redirects its stdout to a file, so the exec SSH channel carries zero bytes). We used the
   Harbor harness with an agent CLI; any harness that holds ~20 long quiet `ExecSandbox`
   calls open against one gateway works.
3. Let it run. In the gateway logs you'll see periodic **bursts of supervisor-session
   resets** (in our runs roughly every 40–50 min):
   `supervisor session: stream error … "h2 protocol error: error reading a body from connection"`,
   followed by `relay stream: inbound errored` and `supervisor session: ended`. The
   supervisors reconnect immediately, but a subset of the in-flight execs are now orphaned.
4. Result: a large fraction of trials **hang after the agent has already finished** — the
   `ExecSandbox` call never returns and the trial blocks until the harness's task timeout
   (~tens of minutes), even though the task often succeeded.

The indefinite hang specifically requires a half-open / silently-dropped idle relay flow
(NAT/conntrack eviction under load) where no FIN/RST reaches the gateway — which arises
naturally in the clustered setup above but not on a local loopback gateway.

### Environment

- OpenShell: 0.0.65 (gateway + supervisor); the defect predates this version.
- Compute driver: Kubernetes (observed).
**Defect is in the shared gateway relay path, so it applies to the Docker/Podman/VM drivers as well.
- Gateway: single replica, plaintext h2c.

### Logs

```shell
Around the time an exec wedges, the gateway logs a supervisor-session reset and the relay
channels under it erroring out:

WARN supervisor_session: supervisor session: stream error sandbox_id=… error=… "h2 protocol error: error reading a body from connection"
WARN supervisor_session: relay stream: inbound errored channel_id=…
INFO supervisor_session: supervisor session: ended sandbox_id=…

The affected `ExecSandbox` logs no completion after this — it simply stops, with no error
and no exit, while the caller keeps waiting. A fresh exec into the same sandbox succeeds.
```

### Agent-First Checklist

- [x] I pointed my agent at the repo and had it investigate this issue
- [x] I loaded relevant skills (e.g., `debug-openshell-cluster`, `debug-inference`, `openshell-cli`)
- [x] My agent could not resolve this — the diagnostic above explains why

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: ExecSandbox hangs indefinitely after the command exits when a supervisor session resets mid-exec #1990

Agent Diagnostic

Description

Reproduction Steps

Environment

Logs

Agent-First Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

bug: ExecSandbox hangs indefinitely after the command exits when a supervisor session resets mid-exec #1990

Description

Agent Diagnostic

Description

Reproduction Steps

Environment

Logs

Agent-First Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions