Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .agents/skills/debug-openshell-cluster/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ Common findings:
- Sandbox image missing or pull denied: verify image reference and registry credentials.
- Docker driver cannot initialize because it cannot find `openshell-sandbox`: verify `OPENSHELL_DOCKER_SUPERVISOR_BIN`, the sibling binary next to `openshell-gateway`, or the configured supervisor image contains `/openshell-sandbox`.
- Sandbox never registers: check gateway logs and supervisor callback endpoint.
- Supervisor image exits before printing `openshell-sandbox --version`: the image should be the scratch supervisor image from `deploy/docker/Dockerfile.supervisor` and must contain a static executable at `/openshell-sandbox`.
- Supervisor image exits before printing `openshell-sandbox --version`: the image should be the scratch supervisor image from `deploy/container/Dockerfile.supervisor` and must contain a static executable at `/openshell-sandbox`.
- `mise run e2e:docker:gpu` fails with `docker info --format json did not report any discovered NVIDIA CDI GPU devices`: Docker may report `CDISpecDirs` while still having no generated NVIDIA CDI specs. Verify `.DiscoveredDevices` contains entries such as `nvidia.com/gpu=all`, verify `/etc/cdi` or `/var/run/cdi` contains a generated NVIDIA spec, and check that `nvidia-cdi-refresh.service` and `nvidia-cdi-refresh.path` from NVIDIA Container Toolkit are enabled and healthy. The service is a one-shot unit, so `inactive (dead)` can be normal after a successful run; use `systemctl status` and `journalctl` to distinguish success from a skipped or failed refresh. NVIDIA recommends enabling the path and service units, and restarting `nvidia-cdi-refresh.service` to regenerate missing or stale CDI specs. If specs are generated but Docker still reports no discovered devices, restart Docker or reload the daemon and re-check `docker info`.

For source checkout development, restart the local gateway with:
Expand Down Expand Up @@ -170,11 +170,11 @@ kubectl -n openshell get statefulset openshell -o jsonpath="{.spec.template.spec
helm -n openshell get values openshell | grep -E 'repository|tag|supervisorImage'
```

The gateway image built from `deploy/docker/Dockerfile.gateway` and the scratch supervisor image built from `deploy/docker/Dockerfile.supervisor` should use the same build tag in branch and E2E deploys. A stale supervisor image can make sandbox behavior lag behind gateway policy or proto changes.
The gateway image built from `deploy/container/Dockerfile.gateway` and the scratch supervisor image built from `deploy/container/Dockerfile.supervisor` should use the same build tag in branch and E2E deploys. A stale supervisor image can make sandbox behavior lag behind gateway policy or proto changes.

For local/external pull mode (the default local path via `mise run cluster`), local images are tagged to the configured local registry base, pushed to that registry, and pulled by k3s via the `registries.yaml` mirror endpoint. The `cluster` task pushes prebuilt local tags (`openshell/*:dev`, falling back to `localhost:5000/openshell/*:dev` or `127.0.0.1:5000/openshell/*:dev`).

Gateway image builds stage a partial Rust workspace from `deploy/docker/Dockerfile.images`. If cargo fails with a missing manifest under `/build/crates/...`, or an imported symbol exists locally but is missing in the image build, verify that every current gateway dependency crate, including `openshell-driver-docker`, `openshell-driver-kubernetes`, and `openshell-ocsf`, is copied into the staged workspace there.
Gateway image builds stage a partial Rust workspace from `deploy/container/Dockerfile.images`. If cargo fails with a missing manifest under `/build/crates/...`, or an imported symbol exists locally but is missing in the image build, verify that every current gateway dependency crate, including `openshell-driver-docker`, `openshell-driver-kubernetes`, and `openshell-ocsf`, is copied into the staged workspace there.

For plaintext local evaluation, confirm the chart has:

Expand Down
2 changes: 1 addition & 1 deletion .claude/agent-memory/arch-doc-writer/MEMORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@
- Four runtime images: sandbox (5 stages), gateway (2 stages), cluster (k3s base), pki-job (Alpine)
- Two build-only images: python-wheels (Linux multi-arch), python-wheels-macos (osxcross cross-compile)
- CI image: Dockerfile.ci (Ubuntu 24.04, pre-installs docker/buildx/aws/kubectl/helm/mise/uv/sccache/socat)
- Cross-compilation: `deploy/docker/cross-build.sh` shared by sandbox + gateway Dockerfiles
- Cross-compilation: `deploy/container/cross-build.sh` shared by sandbox + gateway Dockerfiles
- Sandbox image has coding-agents stage: Claude CLI (native installer), OpenCode, Codex (npm)
- Helm chart deploys a StatefulSet (NOT Deployment), PVC 1Gi at /var/openshell
- Cluster image does NOT bundle image tarballs -- components pulled at runtime from distribution registry
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/ci-image.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ on:
push:
branches: [main]
paths:
- 'deploy/docker/Dockerfile.ci'
- 'deploy/container/Dockerfile.ci'
- 'mise.toml'
- 'mise.lock'
- 'tasks/**'
Expand Down Expand Up @@ -72,7 +72,7 @@ jobs:
--cache-to "type=gha,mode=max,scope=ci-image-${{ matrix.arch }}" \
--push \
-t "$ARCH_IMAGE" \
-f deploy/docker/Dockerfile.ci \
-f deploy/container/Dockerfile.ci \
.

- name: Smoke check CI image
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/docker-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -222,7 +222,7 @@ jobs:
set -euo pipefail
binary="${{ needs.resolve.outputs.binary_name }}"
download_dir="prebuilt-rust-binary"
stage="deploy/docker/.build/prebuilt-binaries/${{ matrix.arch }}"
stage="deploy/container/.build/prebuilt-binaries/${{ matrix.arch }}"
found="$(find "$download_dir" -type f -name "$binary" -print -quit)"
if [[ -z "$found" ]]; then
echo "missing downloaded artifact file: $binary" >&2
Expand All @@ -238,7 +238,7 @@ jobs:
DOCKER_BUILDER: openshell
run: |
set -euo pipefail
mise exec -- tasks/scripts/docker-build-image.sh "${{ inputs.component }}" \
mise exec -- tasks/scripts/container-build-image.sh "${{ inputs.component }}" \
--cache-from "type=gha,scope=${{ inputs.component }}-${{ matrix.arch }}" \
--cache-to "type=gha,mode=max,scope=${{ inputs.component }}-${{ matrix.arch }}"

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/driver-vm-macos.yml
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ jobs:
run: |
set -euo pipefail
docker buildx build \
--file deploy/docker/Dockerfile.driver-vm-macos \
--file deploy/container/Dockerfile.driver-vm-macos \
--build-arg OPENSHELL_CARGO_VERSION="${{ inputs['cargo-version'] }}" \
--build-arg OPENSHELL_IMAGE_TAG="${{ inputs['image-tag'] }}" \
--build-arg CARGO_TARGET_CACHE_SCOPE="${{ github.sha }}" \
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/release-dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -352,7 +352,7 @@ jobs:
run: |
set -euo pipefail
docker buildx build \
--file deploy/docker/Dockerfile.cli-macos \
--file deploy/container/Dockerfile.cli-macos \
--build-arg OPENSHELL_CARGO_VERSION="${{ needs.compute-versions.outputs.cargo_version }}" \
--build-arg OPENSHELL_IMAGE_TAG=dev \
--build-arg CARGO_TARGET_CACHE_SCOPE="${{ github.sha }}" \
Expand Down Expand Up @@ -512,7 +512,7 @@ jobs:
run: |
set -euo pipefail
docker buildx build \
--file deploy/docker/Dockerfile.gateway-macos \
--file deploy/container/Dockerfile.gateway-macos \
--build-arg OPENSHELL_CARGO_VERSION="${{ needs.compute-versions.outputs.cargo_version }}" \
--build-arg OPENSHELL_IMAGE_TAG=dev \
--build-arg CARGO_TARGET_CACHE_SCOPE="${{ github.sha }}" \
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/release-tag.yml
Original file line number Diff line number Diff line change
Expand Up @@ -385,7 +385,7 @@ jobs:
run: |
set -euo pipefail
docker buildx build \
--file deploy/docker/Dockerfile.cli-macos \
--file deploy/container/Dockerfile.cli-macos \
--build-arg OPENSHELL_CARGO_VERSION="${{ needs.compute-versions.outputs.cargo_version }}" \
--build-arg OPENSHELL_IMAGE_TAG="${{ needs.compute-versions.outputs.semver }}" \
--build-arg CARGO_TARGET_CACHE_SCOPE="${{ github.sha }}" \
Expand Down Expand Up @@ -631,7 +631,7 @@ jobs:
run: |
set -euo pipefail
docker buildx build \
--file deploy/docker/Dockerfile.gateway-macos \
--file deploy/container/Dockerfile.gateway-macos \
--build-arg OPENSHELL_CARGO_VERSION="${{ needs.compute-versions.outputs.cargo_version }}" \
--build-arg OPENSHELL_IMAGE_TAG="${{ needs.compute-versions.outputs.semver }}" \
--build-arg CARGO_TARGET_CACHE_SCOPE="${{ github.sha }}" \
Expand Down
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ _build/
rootfs/

# Docker build artifacts (image tarballs, packaged helm charts)
deploy/docker/.build/
deploy/container/.build/

# Helm subchart tarballs (regenerated by `helm dependency build`)
deploy/helm/openshell/charts/
Expand Down
4 changes: 2 additions & 2 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,9 +178,9 @@ ocsf_emit!(event);

- Always use `uv` for Python commands (e.g., `uv pip install`, `uv run`, `uv venv`)

## Docker
## Containers

- Always prefer `mise` commands over direct docker builds (e.g., `mise run docker:build` instead of `docker build`)
- Always prefer `mise` commands over direct container builds (e.g., `mise run build:container` instead of `docker build` or `podman build`)

## Cluster Infrastructure Changes

Expand Down
2 changes: 1 addition & 1 deletion TESTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ Suites:
GPU device-selection tests compare OpenShell sandboxes against a plain Docker or
Podman container that requests `--device nvidia.com/gpu=all`. The probe image
defaults to the image used by the `gateway` stage in
`deploy/docker/Dockerfile.images`; set `OPENSHELL_E2E_GPU_PROBE_IMAGE` to
`deploy/container/Dockerfile.images`; set `OPENSHELL_E2E_GPU_PROBE_IMAGE` to
override it. Per-device checks run only for NVIDIA CDI device IDs reported by
the runtime's discovered devices list, so WSL2 hosts that expose only
`nvidia.com/gpu=all` skip the index-based cases.
Expand Down
14 changes: 5 additions & 9 deletions architecture/build.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,7 @@ OpenShell builds these main artifacts:
|---|---|
| Gateway binary | `crates/openshell-server` |
| CLI package and Python SDK | `python/openshell` plus Rust binaries where packaged |
| Gateway container image | `deploy/docker/Dockerfile.gateway` |
| Supervisor container image | `deploy/docker/Dockerfile.supervisor` |
| Gateway and supervisor container images | `deploy/container/Dockerfile.images` |
| Helm chart | `deploy/helm/openshell` |
| VM driver/runtime assets | `crates/openshell-driver-vm` |
| Published docs site | `docs/` rendered by Fern config in `fern/` |
Expand All @@ -31,12 +30,10 @@ glibc 2.31 floor.

## Container Builds

The Docker image pipeline is a two-step flow: build the Rust binary natively
for the target architecture, then assemble the container image from the
prebuilt binary. The gateway image is built from `deploy/docker/Dockerfile.gateway`
and the supervisor image from `deploy/docker/Dockerfile.supervisor`. Neither
Dockerfile compiles Rust — both copy a staged binary out of
`deploy/docker/.build/prebuilt-binaries/<arch>/` into the final image.
The container image pipeline stages prebuilt Rust binaries, then builds container
images from `deploy/container/Dockerfile.images`. CI builds native artifacts on the
target architecture, stages them under `deploy/container/.build/`, and then uses
Buildx to publish per-architecture images and multi-architecture tags.

Binary staging is driven by `tasks/scripts/stage-prebuilt-binaries.sh`. Gateway
binaries use `cargo zigbuild` with GNU targets pinned to glibc 2.31, including
Expand All @@ -59,7 +56,6 @@ Runtime layout:
Static linkage is required because the image is mounted/extracted into
sandbox environments (Docker extraction, Podman image volumes, Kubernetes
init-container copy-self) and cannot rely on a dynamic loader.

Gateway image builds bake the corresponding supervisor image tag into the
gateway binary so Docker sandboxes do not depend on `:latest` by default.
Package formulas also pin Docker supervisor extraction to the matching release
Expand Down
6 changes: 3 additions & 3 deletions crates/openshell-driver-podman/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,8 @@ sequenceDiagram
C->>C: entrypoint: /opt/openshell/bin/openshell-sandbox
```

The supervisor image from `deploy/docker/Dockerfile.supervisor` copies the static
`openshell-sandbox` binary to `/openshell-sandbox`.
The `supervisor` target in `deploy/container/Dockerfile.images` copies the
`openshell-sandbox` binary to `/openshell-sandbox` in the supervisor image.
Mounting that image at `/opt/openshell/bin` makes the binary available as
`/opt/openshell/bin/openshell-sandbox`.

Expand Down Expand Up @@ -346,4 +346,4 @@ matter compared to cluster or rootful runtimes:
netns, proxy, and relay behavior shared by all drivers.
- Container engine abstraction: `tasks/scripts/container-engine.sh` for
build/deploy support across Docker and Podman.
- Supervisor image build: `deploy/docker/Dockerfile.supervisor`.
- Supervisor image build: `deploy/container/Dockerfile.images`.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
# wheel wrapping.
#
# Usage:
# docker buildx build -f deploy/docker/Dockerfile.cli-macos \
# docker buildx build -f deploy/container/Dockerfile.cli-macos \
# --build-arg OPENSHELL_CARGO_VERSION=0.6.0 \
# --output type=local,dest=out/ .

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# include_bytes!().
#
# Usage:
# docker buildx build -f deploy/docker/Dockerfile.driver-vm-macos \
# docker buildx build -f deploy/container/Dockerfile.driver-vm-macos \
# --build-arg OPENSHELL_CARGO_VERSION=0.6.0 \
# --build-context vm-runtime-compressed=/path/to/compressed-dir \
# --output type=local,dest=out/ .
Expand Down
122 changes: 122 additions & 0 deletions deploy/container/Dockerfile.images
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# syntax=docker/dockerfile:1.4

# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

# Shared OpenShell image build graph.
#
# Targets:
# gateway Final gateway image
# supervisor Final supervisor image (Ubuntu base, supervisor binary)
#
# Rust binaries are built natively before the image build and staged at:
# deploy/container/.build/prebuilt-binaries/<arch>/openshell-{gateway,sandbox}
#
# For local dev (Skaffold), pass --build-arg BUILD_FROM_SOURCE=1 to compile
# binaries inside Docker instead. BuildKit only executes the selected binary
# staging stage, so missing prebuilt files do not cause a build failure.

# Controls binary source: 0 = prebuilt (release), 1 = compile in Docker (local dev).
# Must be declared here (global scope) so it can be used in FROM instructions below.
ARG BUILD_FROM_SOURCE=0

# ---------------------------------------------------------------------------
# Optional in-Docker Rust build (BUILD_FROM_SOURCE=1, local dev only)
# ---------------------------------------------------------------------------
FROM rust:1.95.0-slim-bookworm AS rust-builder

RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
cmake \
pkg-config \
libssl-dev \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*

WORKDIR /build

COPY Cargo.toml Cargo.lock ./
COPY crates/ crates/
COPY proto/ proto/
COPY providers/ providers/

RUN --mount=type=cache,target=/usr/local/cargo/registry \
--mount=type=cache,target=/build/target \
cargo build --release \
--features "openshell-core/dev-settings" \
--bin openshell-gateway \
--bin openshell-sandbox \
&& mkdir -p /build/out \
&& install -m 0755 target/release/openshell-gateway /build/out/openshell-gateway \
&& install -m 0755 target/release/openshell-sandbox /build/out/openshell-sandbox

# ---------------------------------------------------------------------------
# Per-arch binary stages
# ---------------------------------------------------------------------------

# Prebuilt path (release default, BUILD_FROM_SOURCE=0)
FROM scratch AS gateway-binary-0
ARG TARGETARCH
# --chmod=755 preserves the executable bit through actions/upload-artifact +
# download-artifact, which strip exec perms during the roundtrip.
COPY --chmod=755 deploy/container/.build/prebuilt-binaries/${TARGETARCH}/openshell-gateway /build/out/openshell-gateway

# Source-built path (local dev, BUILD_FROM_SOURCE=1)
FROM rust-builder AS gateway-binary-1

FROM gateway-binary-${BUILD_FROM_SOURCE} AS gateway-binary

# Prebuilt path (release default, BUILD_FROM_SOURCE=0)
FROM scratch AS supervisor-binary-0
ARG TARGETARCH
# --chmod=755 preserves the executable bit through actions/upload-artifact +
# download-artifact, which strip exec perms during the roundtrip.
COPY --chmod=755 deploy/container/.build/prebuilt-binaries/${TARGETARCH}/openshell-sandbox /build/out/openshell-sandbox

# Source-built path (local dev, BUILD_FROM_SOURCE=1)
FROM rust-builder AS supervisor-binary-1

FROM supervisor-binary-${BUILD_FROM_SOURCE} AS supervisor-binary

# ---------------------------------------------------------------------------
# Final gateway image
# ---------------------------------------------------------------------------
FROM nvcr.io/nvidia/base/ubuntu:noble-20251013 AS gateway

RUN apt-get update && apt-get install -y --no-install-recommends \
ca-certificates && \
apt-get install -y --only-upgrade gpgv && \
rm -rf /var/lib/apt/lists/*

RUN useradd --create-home --user-group openshell

WORKDIR /app

COPY --from=gateway-binary /build/out/openshell-gateway /usr/local/bin/

RUN mkdir -p /build/crates/openshell-server
COPY --chmod=755 crates/openshell-server/migrations /build/crates/openshell-server/migrations

USER openshell
EXPOSE 8080

ENTRYPOINT ["openshell-gateway"]
CMD ["--bind-address", "0.0.0.0", "--port", "8080"]

# ---------------------------------------------------------------------------
# Final supervisor image
# ---------------------------------------------------------------------------
# Supervisor image based on the same NVIDIA Ubuntu base used by the gateway.
#
# Used by:
# - Docker driver: binary is extracted from the image and run inside the
# agent container.
# - Podman driver: image is mounted as an OCI volume at /opt/openshell/bin.
# - Kubernetes driver: image runs as an init container that invokes the
# binary's `copy-self` subcommand to seed an emptyDir volume.
#
# An Ubuntu base provides glibc and the dynamic loader needed to exec the
# dynamically linked binary. `FROM scratch` would be smaller but cannot run
# the binary, breaking the Kubernetes init-container path.
FROM nvcr.io/nvidia/base/ubuntu:noble-20251013 AS supervisor
COPY --from=supervisor-binary /build/out/openshell-sandbox /openshell-sandbox
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --default-toolchain 1.95.0
RUN pip install --no-cache-dir maturin

COPY deploy/docker/cross-build.sh /usr/local/bin/
COPY deploy/container/cross-build.sh /usr/local/bin/

FROM base AS builder

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# Shared Rust cross-compilation helpers for multi-arch Docker builds.
#
# Source this script in Dockerfile RUN layers:
# COPY deploy/docker/cross-build.sh /usr/local/bin/
# COPY deploy/container/cross-build.sh /usr/local/bin/
# RUN . cross-build.sh && install_cross_toolchain && add_rust_target
# RUN . cross-build.sh && cargo_cross_build --release -p my-crate
#
Expand Down
Loading
Loading