fix(supervisor): drop sandbox child capability bounding set#2001
fix(supervisor): drop sandbox child capability bounding set#2001alangou wants to merge 1 commit into
Conversation
|
|
||
| #[cfg(target_os = "linux")] | ||
| fn kernel_cap_last_cap() -> u32 { | ||
| std::fs::read_to_string(PROC_CAP_LAST_CAP_PATH) |
There was a problem hiding this comment.
My first thought here was whether there's an existing crate that would provide the functionality we need. Codex suggests that https://crates.io/crates/capctl may be applicable:
- Best fit: capctl (https://crates.io/crates/capctl) / GitHub (https://github.com/cptpcrd/capctl)
Pure Rust, direct Linux prctl() capability API, and it has the exact bounding-set surface we need: bounding::drop,
bounding::ensure_dropped, bounding::clear, bounding::clear_unknown, bounding::read/probe, plus effective-set inspection through
CapState. The important bit: it explicitly handles newer kernel capabilities via clear_unknown(), which matches the PR’s current /proc/
sys/kernel/cap_last_cap intent better than most wrappers.
There was a problem hiding this comment.
I agree that using a dedicated crate would make the parsing cleaner.
For this patch, I intentionally reused the same low-level mechanism that the codebase already uses for Linux process hardening. The goal was to keep the change narrow, avoid adding a new dependency, and avoid rewriting working capability-related code as part of a defense-in-depth fix.
My concern with adding a capabilities crate only for parsing is that it would be a bit inconsistent: if we introduce a crate that can also wrap the prctl capability operations, then we should probably use it consistently for the whole Linux capabilities path instead of mixing manual prctl calls with crate-based parsing.
So I’d prefer to keep this patch minimal, then consider a separate cleanup/refactor later if we want to move all Linux capability handling to a dedicated crate.
If we are ok to delay this fix for the NeMoClaw team I'm happy to add the crate and rewrite the capability handling with that crate. If the fix can't wait more then I can tackle this in a separate PR as well
There was a problem hiding this comment.
I'm ok to do this as a follow-up and apply it consistently.
elezar
left a comment
There was a problem hiding this comment.
A codex-assisted review:
I think this still fails closed incorrectly for the Podman path. drop_capability_bounding_set() returns Ok(()) when CAP_SETPCAP is
not effective, and also when the first PR_CAPBSET_DROP returns EPERM. But the Podman driver currently drops SETPCAP from the
supervisor container, so that path leaves the child bounding set unchanged while still spawning the workload/connect shell.
Could we either keep SETPCAP available to the supervisor until child setup, or fail the spawn when the bounding set cannot be cleared
and is still nonempty? This may also be a good place to use capctl rather than custom /proc parsing and raw prctl;
capctl::caps::bounding::clear() plus an explicit “EPERM is only OK if the bounding set is already empty” check would make the invariant
clearer.
The current regression test skips when CAP_SETPCAP is unavailable, so it would not catch the Podman-relevant failure mode.
+1. Seems like different drivers can do different drops on their own which would impact this common code that runs driver-agnostically. |
4316fff to
0c0c87e
Compare
Reduce the Linux capability bounding set in the common privilege-drop path before executing sandbox workloads or connect shells and use capctl Signed-off-by: Adrien Langou <alangou@nvidia.com>
0c0c87e to
b629752
Compare
|
@johntmyers @elezar I have implemented the requested changes |
|
|
||
| let driver = | ||
| driver_sandbox_spec_from_public(&public, None).expect("driver spec should map"); | ||
| driver_sandbox_spec_from_public(&public, "docker").expect("driver spec should map"); |
Summary
Drop the Linux capability bounding set in the common sandbox child privilege-drop path so workloads and
openshell connectshells cannot regain container-granted capabilities after exec.Related Issue
Closes #1452
Changes
drop_privileges()afterinitgroups/setgidand beforesetuid.0..=cap_last_cap, withCAP_SETPCAPdropped last and a graceful fallback whenCAP_SETPCAPis unavailable.architecture/sandbox.md.Testing
mise run pre-commitpassesAdditional checks run:
mise run fmt && mise run e2e && mise run e2e:kubernetes && mise run e2e:dockercargo test -p openshell-supervisor-process --lib process::tests -- --nocapturecargo test -p openshell-supervisor-process --lib ssh::tests::pre_exec_always_calls_drop_privileges -- --nocaptureChecklist