Agent Diagnostic
Investigated an OpenShell gateway running the Kubernetes compute driver under sustained
high concurrency (~20+ sandboxes), where agent trials hung after the agent had already
finished its work.
Tools / approach:
kubectl for pod state, in-sandbox supervisor logs (/var/log/openshell.*), and ps.
- Gateway log analysis (supervisor sessions, relay channels, RPC latency).
openshell sandbox exec timing probes (quick, long-idle, and background-child commands).
- Source review of the gateway exec/relay path (
grpc/sandbox.rs, supervisor_session.rs,
multiplex.rs) and the in-sandbox SSH server (openshell-sandbox/src/ssh.rs), plus the
russh 0.57 channel-close behavior.
Findings:
- On a stuck pod the agent process had exited with no orphan processes, yet the
agent's ExecSandbox RPC never returned. A fresh exec into the same sandbox
returned in <1s — so the supervisor session/relay was healthy and reconnected; only the
original in-flight exec channel was wedged.
- The gateway itself was healthy throughout (0 restarts, single-digit-ms RPC latency), so
this is not saturation.
- Observed recurring bursts of supervisor sessions resetting with
h2 protocol error: error reading a body from connection, each followed by
relay stream: inbound errored. A controlled long-running exec died at the exact instant
of one such burst; others in the same situation hung instead.
- The in-sandbox supervisor bounds its own teardown — it sends
eof → exit-status →
close wit
the exec loop blocks on channel.wait() with no idle/liveness backstop, and an orphaned
relay channel (left over from a reset session) never produces another message.
- Confirmed not driver-specific: exec runs over the same supervisor reverse-relay path
for every compute driver (local containers, VMs, Kubernetes). The defect is latent
everywhere; it manifests in practice on networked / higher-concurrency deployments
(where supervisor sessions actually reset often enough to orphan an in-flight exec) and
is rarely seen on a local single-host gateway.
Tried: isolated long channel-silent execs and background-child execs both return fine on a
healthy session; the hang only appears when a supervisor session resets while an exec is in
flight, leaving the relay channel orphaned and the gateway exec loop parked.
Description
The bug occured constantly when trying to scale test openshell on K8S environment, specicily running swe-bench with Harbor framework which requires having a long running exec command.
Actual behavior: When a supervisor session resets while an ExecSandbox is in flight,
the relay channel carrying that exec is orphaned. The supervisor reconnects a fresh session
(new execs work immediately), but the gateway's exec loop stays blocked on channel.wait()
waiting for an SSH close/exit that will never arrive — there is no idle or liveness timeout
to break out. The streaming RPC never returns and the caller hangs until its own deadline.
It is most visible for long-running, channel-silent commands (e.g. an agent that redirects
stdout to a file), and wedged channels accumulate until a sandbox stops responding to new
execs.
Expected behavior: A completed or orphaned exec must always terminate the ExecSandbox
stream promptly — with a result or an error — regardless of which SSH control messages are
lost or whether the underlying supervisor session was reset.
Reproduction Steps
The defect is in the shared gateway relay path (every compute driver uses it), but it only
reproduced reliably on a networked, concurrent Kubernetes deployment — that is where
supervisor sessions actually reset often enough to orphan an in-flight exec. The steps that
worked:
- Deploy the (unfixed) OpenShell gateway on a Kubernetes cluster via Helm: single gateway
replica, default ClusterIP Service (so sandbox→gateway traffic is NAT'd through
kube-proxy/conntrack — this idle-eviction path is the key ingredient).
- Run a high-concurrency agent workload — ~20 sandboxes at once, each executing a
long-running, channel-silent command (an agent that runs for many minutes and
redirects its stdout to a file, so the exec SSH channel carries zero bytes). We used the
Harbor harness with an agent CLI; any harness that holds ~20 long quiet ExecSandbox
calls open against one gateway works.
- Let it run. In the gateway logs you'll see periodic bursts of supervisor-session
resets (in our runs roughly every 40–50 min):
supervisor session: stream error … "h2 protocol error: error reading a body from connection",
followed by relay stream: inbound errored and supervisor session: ended. The
supervisors reconnect immediately, but a subset of the in-flight execs are now orphaned.
- Result: a large fraction of trials hang after the agent has already finished — the
ExecSandbox call never returns and the trial blocks until the harness's task timeout
(~tens of minutes), even though the task often succeeded.
The indefinite hang specifically requires a half-open / silently-dropped idle relay flow
(NAT/conntrack eviction under load) where no FIN/RST reaches the gateway — which arises
naturally in the clustered setup above but not on a local loopback gateway.
Environment
- OpenShell: 0.0.65 (gateway + supervisor); the defect predates this version.
- Compute driver: Kubernetes (observed).
**Defect is in the shared gateway relay path, so it applies to the Docker/Podman/VM drivers as well.
- Gateway: single replica, plaintext h2c.
Logs
Around the time an exec wedges, the gateway logs a supervisor-session reset and the relay
channels under it erroring out:
WARN supervisor_session: supervisor session: stream error sandbox_id=… error=… "h2 protocol error: error reading a body from connection"
WARN supervisor_session: relay stream: inbound errored channel_id=…
INFO supervisor_session: supervisor session: ended sandbox_id=…
The affected `ExecSandbox` logs no completion after this — it simply stops, with no error
and no exit, while the caller keeps waiting. A fresh exec into the same sandbox succeeds.
Agent-First Checklist
Agent Diagnostic
Investigated an OpenShell gateway running the Kubernetes compute driver under sustained
high concurrency (~20+ sandboxes), where agent trials hung after the agent had already
finished its work.
Tools / approach:
kubectlfor pod state, in-sandbox supervisor logs (/var/log/openshell.*), andps.openshell sandbox exectiming probes (quick, long-idle, and background-child commands).grpc/sandbox.rs,supervisor_session.rs,multiplex.rs) and the in-sandbox SSH server (openshell-sandbox/src/ssh.rs), plus therussh 0.57 channel-close behavior.
Findings:
agent's
ExecSandboxRPC never returned. A fresh exec into the same sandboxreturned in <1s — so the supervisor session/relay was healthy and reconnected; only the
original in-flight exec channel was wedged.
this is not saturation.
h2 protocol error: error reading a body from connection, each followed byrelay stream: inbound errored. A controlled long-running exec died at the exact instantof one such burst; others in the same situation hung instead.
eof→exit-status→closewitthe exec loop blocks on
channel.wait()with no idle/liveness backstop, and an orphanedrelay channel (left over from a reset session) never produces another message.
for every compute driver (local containers, VMs, Kubernetes). The defect is latent
everywhere; it manifests in practice on networked / higher-concurrency deployments
(where supervisor sessions actually reset often enough to orphan an in-flight exec) and
is rarely seen on a local single-host gateway.
Tried: isolated long channel-silent execs and background-child execs both return fine on a
healthy session; the hang only appears when a supervisor session resets while an exec is in
flight, leaving the relay channel orphaned and the gateway exec loop parked.
Description
The bug occured constantly when trying to scale test openshell on K8S environment, specicily running swe-bench with Harbor framework which requires having a long running exec command.
Actual behavior: When a supervisor session resets while an
ExecSandboxis in flight,the relay channel carrying that exec is orphaned. The supervisor reconnects a fresh session
(new execs work immediately), but the gateway's exec loop stays blocked on
channel.wait()waiting for an SSH close/exit that will never arrive — there is no idle or liveness timeout
to break out. The streaming RPC never returns and the caller hangs until its own deadline.
It is most visible for long-running, channel-silent commands (e.g. an agent that redirects
stdout to a file), and wedged channels accumulate until a sandbox stops responding to new
execs.
Expected behavior: A completed or orphaned exec must always terminate the
ExecSandboxstream promptly — with a result or an error — regardless of which SSH control messages are
lost or whether the underlying supervisor session was reset.
Reproduction Steps
The defect is in the shared gateway relay path (every compute driver uses it), but it only
reproduced reliably on a networked, concurrent Kubernetes deployment — that is where
supervisor sessions actually reset often enough to orphan an in-flight exec. The steps that
worked:
replica, default ClusterIP Service (so sandbox→gateway traffic is NAT'd through
kube-proxy/conntrack — this idle-eviction path is the key ingredient).
long-running, channel-silent command (an agent that runs for many minutes and
redirects its stdout to a file, so the exec SSH channel carries zero bytes). We used the
Harbor harness with an agent CLI; any harness that holds ~20 long quiet
ExecSandboxcalls open against one gateway works.
resets (in our runs roughly every 40–50 min):
supervisor session: stream error … "h2 protocol error: error reading a body from connection",followed by
relay stream: inbound erroredandsupervisor session: ended. Thesupervisors reconnect immediately, but a subset of the in-flight execs are now orphaned.
ExecSandboxcall never returns and the trial blocks until the harness's task timeout(~tens of minutes), even though the task often succeeded.
The indefinite hang specifically requires a half-open / silently-dropped idle relay flow
(NAT/conntrack eviction under load) where no FIN/RST reaches the gateway — which arises
naturally in the clustered setup above but not on a local loopback gateway.
Environment
**Defect is in the shared gateway relay path, so it applies to the Docker/Podman/VM drivers as well.
Logs
Agent-First Checklist
debug-openshell-cluster,debug-inference,openshell-cli)