khook DSL v1 — specification

Normative spec for apiVersion: khook.io/v1, kind: Khook. examples/*.yaml must always validate against this document; where they disagree, this document wins. Anything marked (roadmap) is not part of v1 core — see the command coverage matrix at the bottom and ROADMAP.md.

A machine-readable JSON Schema of this spec is committed at docs/schema/v1/khook.json (also printed by khook schema). For editor validation and autocomplete, point yaml-language-server at it from the first line of a spec:

# yaml-language-server: $schema=<path or URL to khook.json>

The examples use the canonical URL, https://khook.io/schema/v1/khook.json (the schema’s $id, served by the website); a repo-relative path to the committed artifact works too — offline, or before the site is reachable.

Document envelope

apiVersion: khook.io/v1   # required, fixed
kind: Khook                      # required, fixed
metadata:
  name: my-bootstrap             # required; used in logs/state record
defaults: { ... }                # optional
steps: [ ... ]                   # required, at least one

Variables

${NAME} and ${NAME:-default} are substituted textually across the whole document before YAML parsing (v0-proven approach; a variable can therefore hold any scalar). Sources, by precedence:

  1. CLI --set NAME=value
  2. CLI --var-file vars.yaml (flat NAME: value map)
  3. Environment variables prefixed KHOOK_SECRET_ (--secret-prefix to override) — same substitution as below, but the value is redacted from all khook output (logs, plan, diff, errors); see docs/cli.md
  4. Environment variables prefixed KHOOK_VAR_ (KHOOK_VAR_FOO${FOO}); prefix configurable with --var-prefix
  5. ${NAME:-default} fallback written in the spec

A ${NAME} with no source and no default is a validation error; all missing variables are reported at once.

Pipelines — sprig functions on values

A reference can pipe the resolved value through sprig functions, helm-template style:

metadata:
  name: ${APP | lower | trunc 63}
stringData:
  tokenB64: ${TOKEN | b64enc}
timeout: ${WINDOW:-5|printf "%sm"}     # default applies first, then the pipes

Grammar: ${NAME}, ${NAME:-default}, ${NAME|pipeline}, ${NAME:-default|pipeline}. The pipeline is ordinary sprig — fn, fn arg, chained with | — applied to the value (helm’s . | pipeline form). Rules:

defaults:

Fallbacks for the per-step fields of the same name.

Field Type Default Notes
timeout duration 5m per-step execution timeout
retries int 0 retry attempts after a failed try
retryDelay duration 10s pause between tries
onError fail | continue fail fail stops scheduling new steps; continue marks the step failed and keeps going

state: — the run-state record

Optional and off by default. When present, khook journals the run in an in-cluster Secret — a per-step input hash plus every step’s outcome — and a re-run resumes: steps whose inputs are unchanged since they last succeeded are skipped (reported skipped with reason unchanged since it succeeded in a previous run (state record)) and still satisfy needs. Change detection is per step: editing one step re-runs that step and only that step — the rest of the spec still resumes. The record is a journal, not an ownership ledger: delete the Secret and nothing breaks — the next run just re-converges everything.

state:
  enabled: true          # optional; writing `state: {}` already opts in.
                         # `enabled: ${USE_STATE:-false}` toggles per env.
  namespace: kube-system # optional; default "default"
  name: my-bootstrap     # optional; default "khook-state-<metadata.name>"
Field Type Default Notes
enabled bool true when the block is present absent state: block = disabled
namespace string default where the record Secret lives (RFC 1123 label)
name string khook-state-<metadata.name> record Secret name (RFC 1123 subdomain)

Semantics, precisely:

When to enable it

Enable it when a re-run costs real time or real disruption:

Optional for mid-size specs where the per-op skipIf checks already make re-runs cheap. State still adds two things: resume decisions without any cluster probing, including for step types with no natural existence check (patch:, wait:, rollout:), and khook status visibility into what the last run did.

Skip it when:

It is never required: khook without state is already idempotent per step. State is an optimization for resume speed and run observability, not a correctness requirement.

Tradeoffs of enabling it

Steps — common fields

steps:
  - name: cilium          # required, unique, DNS-label-ish ([a-z0-9-])
    needs: [other-step]   # optional; DAG edges, must reference existing names
    when: vars.ENV == "prod"   # optional; CEL condition, see below
    timeout: 10m          # optional, overrides defaults
    retries: 2            # optional, overrides defaults
    retryDelay: 30s       # optional, overrides defaults
    onError: continue     # optional, overrides defaults
    helm: { ... }         # exactly ONE action key per step:
                          #   helm | apply | delete | patch | wait | rollout | job

The action key determines the step type — there is no type: field. Zero or two+ action keys is a validation error.

Execution: steps are topologically sorted (cycle → validation error) and run in parallel levels; a step runs only when all needs succeeded. If a step fails with onError: fail, running steps finish, nothing new starts, and every not-yet-run step is reported skipped.

when: — conditional steps

when: holds a CEL expression; the step runs only when it evaluates to true. Conditions see the merged variable map and nothing else — they are decided once, at spec load time, before anything touches the cluster (plan shows the outcome, offline included).

The expression environment, on top of the CEL standard library (&&, ||, !, ==, in, startsWith, endsWith, contains, matches, ternaries):

Expression Meaning
vars the merged variable map; every value is a string
vars.NAME the variable’s value — an error if unset (like a bare ${NAME})
vars.get("NAME", "default") the variable’s value, or the default when unset
has(vars.NAME) whether the variable is set

The expression must type-check to a bool; anything else is a validation error.

steps:
  - name: argocd
    when: vars.get("ENABLE_ARGOCD", "false") == "true"
    helm: { ... }

Semantics:

skipIf — skip a step that’s already done

One policy, one field name, across the step types that have a natural notion of “already done”. skipIf names a predicate checked against the cluster before the step runs; when it holds, the step is skipped as success (satisfying needs). Each type accepts exactly the one predicate that fits it — anything else fails validation, and the JSON schema autocompletes the right word per type:

Step type Predicate Skips when
helm: skipIf: installed the release already exists (any version, any values)
apply: skipIf: exists every manifest resource already exists
job: skipIf: succeeded this step’s Job already completed successfully

The remaining types need no skip policy: delete: treats “already absent” as a no-op (ignoreNotFound defaults to true), and wait: / rollout: status already no-op when the condition holds. skipIf trades convergence for stability — the step stops enforcing the spec’s version/values/content once its target exists, which is exactly right for “don’t touch it if it’s there” steps and wrong for steps that must converge on every run.

helm: — install or upgrade a chart release

Declarative install-or-upgrade (the release history decides which; v0-proven).

Field Type Required Default Notes
chart string yes   chart source — bare name, oci:// reference, or local path; see Chart sources
repo URL bare-name charts   HTTP(S) chart repository URL; no repo “name” needed
version string no latest exact chart version (recommended: always pin); may instead be written inline — chart: name:1.2.3
auth map no   username: + password: for a private repo/registry; see Private sources
release string no step name Helm release name
namespace string no default target namespace
createNamespace bool no false create namespace if missing
skipIf installed no   skip (success) if the release already exists, regardless of version/values — see skipIf
atomic bool no false roll back on failure
wait bool no false wait for resources ready before step succeeds
values map no   inline values (merged over chart defaults)
valuesFrom list no   ordered list of sources, each exactly one of - file: path / - url: https://...; later entries and values override earlier ones
helm:
  chart: cilium
  repo: https://helm.cilium.io/
  version: 1.18.4
  namespace: kube-system
  atomic: true
  valuesFrom:
    - file: ./values/cilium.yaml
  values:
    hubble:
      enabled: true

Chart sources

The shape of chart: implies where the chart comes from — there is no discriminator field:

chart: cilium                                  # bare name — resolved in repo: (required)
chart: cilium:1.18.4                           # same, version inline (names cannot contain ":")
chart: oci://ghcr.io/org/charts/my-app:1.2.3   # OCI registry reference; repo: forbidden
chart: ./charts/my-app                         # local directory
chart: ./charts/my-app-1.2.3.tgz               # packaged chart

Rules:

Private sources — auth

Credentials come either inline as URL userinfo or as an auth: block — never both. Pass secrets as variables (KHOOK_SECRET_* values are redacted from all output); khook strips inline credentials from every log, plan, diff, and error line regardless.

# auth: block — raw values, no encoding needed
helm:
  chart: oci://123456789.dkr.ecr.us-east-1.amazonaws.com/charts/my-app:1.2.3
  auth:
    username: AWS
    password: ${ECR_TOKEN}

# inline userinfo — URL-encode anything special: ${VAR|urlquery}
helm:
  chart: my-app
  repo: https://${CHART_USER}:${CHART_PASS|urlquery}@charts.corp.example

The inline form must be a full <username>:<password>@ pair, URL-encoded where the values contain URL-special characters (|urlquery does this; ECR tokens, being base64, always need it inline — prefer the auth: block for such tokens).

ECR recipe — khook never calls AWS itself (see roadmap non-goals); whatever runs khook mints the token:

export KHOOK_SECRET_ECR_TOKEN="$(aws ecr get-login-password)"

The token is ordinary basic auth with username AWS, as in the auth: example above.

apply: — declaratively apply manifests

Server-side-agnostic kubectl apply via the dynamic client. Multi-document YAML is supported in every source.

Field Type Required Default Notes
manifests list yes   ordered list of sources, each exactly one of inline: (YAML string), file: (path), url: (HTTP(S)), kustomize: (local kustomization directory)
namespace string no   default namespace for namespace-less namespaced resources
createNamespace bool no false create namespace if missing
skipIf exists no   skip (success) if all resources already exist — see skipIf
serverSide bool no false server-side apply
waitFor string no   block until every applied object meets this — wait.for’s grammar minus delete
apply:
  namespace: argocd
  waitFor: condition=Established
  manifests:
    - inline: |
        apiVersion: argoproj.io/v1alpha1
        kind: Application
        ...
    - file: ./manifests/extra.yaml
    - url: https://example.com/manifest.yaml
    - kustomize: ./overlays/prod

waitFor — apply + wait in one step, the kubectl apply -f x && kubectl wait --for=... -f x equivalent: after the apply, the step polls exactly the objects it applied until each meets the condition, bounded by the step timeout. It also runs when skipIf: exists short-circuits — the condition must hold whether this run created the objects or found them, so re-runs behave like first runs. Waiting on other resources (or a subset) is a separate wait: step.

kustomize: sources are rendered in-process (sigs.k8s.io/kustomize) and must be local paths (./, ../, or /). Kustomizations referencing remote bases fail: kustomize shells out to git for those, which the zero-runtime-deps rule excludes.

delete: — remove resources

Three mutually exclusive forms. helm: owns presence, delete: owns absence — uninstalling a Helm release is the third form here, not a mode of helm:.

By manifests (delete what these define; same source list as apply.manifests, kustomize: included):

delete:
  manifests:
    - file: ./manifests/old.yaml

By reference/selector (the bootstrap-critical form — e.g. removing aws-node before installing Cilium):

Field Type Required Notes
resource string yes type (pods) or kind/name (daemonset/aws-node)
namespace string no mutually exclusive with allNamespaces
allNamespaces bool no  
selector string no label selector
fieldSelector string no field selector
ignoreNotFound bool no (true) absent resources are success, not failure
delete:
  resource: daemonset/aws-node
  namespace: kube-system

By release (helm uninstall):

Field Type Required Notes
release string yes Helm release name
namespace string no (default) the release’s namespace
ignoreNotFound bool no (true) an absent release is success, not failure

The step waits until the release’s resources are gone (like the other forms), bounded by the step timeout. selector, fieldSelector, and allNamespaces do not apply.

delete:
  release: nginx-ingress
  namespace: ingress

patch: — modify a resource in place

kubectl patch: change fields of a resource whose full manifest this spec does not own (a chart-installed DaemonSet, a default StorageClass, a CR). When the spec does own the manifest, prefer re-apply:ing it.

Field Type Required Default Notes
target string yes   kind/name (daemonset/aws-node)
namespace string no default ignored for cluster-scoped kinds
type string no strategic strategic | merge | json
patch map or list yes   the patch body: a mapping for strategic/merge, a list of operations for json
patch:
  target: daemonset/aws-node
  namespace: kube-system
  patch:
    spec:
      template:
        spec:
          nodeSelector:
            khook.io/non-existing: "true"

Semantics:

# merge for CRs / cluster-scoped targets; json for removals
patch:
  target: storageclass/gp3
  type: merge
  patch:
    metadata:
      annotations:
        storageclass.kubernetes.io/is-default-class: "true"

wait: — block until a condition holds

Field Type Required Notes
for string yes condition=<Name>[=<value>], jsonpath=<expr>[=<value>], or delete
on string yes resource type (pods) or kind/name (deployment/argocd-server)
namespace string no mutually exclusive with allNamespaces
allNamespaces bool no  
selector string no label selector
fieldSelector string no field selector

The step-level timeout bounds the wait — there is no separate wait.timeout.

wait:
  for: condition=Ready
  on: pods
  allNamespaces: true

for: forms, matching kubectl wait:

wait:
  for: jsonpath={.status.readyReplicas}=2
  on: deployment/coredns
  namespace: kube-system

rollout: — imperative rollout commands

Exactly one of restart: / status:, each taking kind/name.

Field Type Required Notes
restart string one of deployment/x, daemonset/x, statefulset/x
status string one of same format; blocks until rollout complete (bounded by step timeout)
namespace string yes  
rollout:
  restart: daemonset/eks-pod-identity-agent
  namespace: kube-system

job: — run a container to completion

The escape hatch: anything the DSL does not model runs as a batch/v1 Job — the “run an arbitrary script” slot of the bootstrap. khook creates the Job, waits for it to finish (bounded by the step timeout), and on failure surfaces the pod’s last log lines in the step error.

Field Type Required Default Notes
image string yes   container image to run
command list no image entrypoint container command (entrypoint override)
args list no   container args
env map no   environment variables (NAME: value)
namespace string no default namespace the Job runs in
createNamespace bool no false create namespace if missing
serviceAccount string no namespace default ServiceAccount for the pod
skipIf succeeded no   skip (success) if this step’s Job already completed successfully — see skipIf
job:
  image: public.ecr.aws/aws-cli/aws-cli:2.17.0
  command: ["sh", "-c"]
  args: ["aws sts get-caller-identity"]
  env:
    AWS_REGION: us-east-1
  namespace: kube-system
  serviceAccount: bootstrap-admin

Semantics:

Jobs do not publish values back to the spec (non-goal: khook is not a data bus). To hand data to later steps, write a Secret/ConfigMap from the job and have consumers reference it by name (secretKeyRef, existingSecret-style chart values) — the name is static and plannable, the value flows through the API server.

Command coverage matrix

What a cluster bootstrap needs, mapped to the DSL. Non-goals excluded (see ROADMAP.md).

CLI equivalent khook Status
helm repo add + helm install/upgrade helm: v1 core
helm install --atomic/--wait/--create-namespace helm.atomic/wait/createNamespace v1 core
helm install -f values.yaml --set k=v helm.valuesFrom (file or url) / helm.values v1 core
helm install oci://... / local chart helm.chart: oci://... / path v1 core
helm uninstall delete.release v1 core
helm install --username/--password / helm registry login helm.auth / URL userinfo v1 core
kubectl apply -f file/url/- apply: v1 core
kubectl apply --server-side apply.serverSide v1 core
kubectl apply -k (kustomize) apply.manifests: - kustomize: (local) v1 core
kubectl apply -f x && kubectl wait -f x apply.waitFor v1 core
kubectl delete -f / by selector delete: v1 core
kubectl patch (strategic/merge/json) patch: v1 core
kubectl wait --for=condition=... wait: v1 core
kubectl wait --for=jsonpath=... wait.for: jsonpath=... / apply.waitFor v1 core
kubectl rollout restart/status rollout: v1 core
kubectl create namespace createNamespace: true / apply: v1 core
arbitrary in-cluster commands job: (container to completion) v1 core
helm rollback atomic: covers failed upgrades; re-applying the spec is the recovery path non-goal
kubectl apply --prune — pruning is reconciliation; Argo/Flux own it, delete: owns explicit absence non-goal
kubectl label / annotate patch: (or apply: a minimal manifest — existing objects are merge-patched) v1 core
kubectl scale patch: spec.replicas (or apply: a minimal manifest) v1 core
kubectl exec / cp / port-forward — interactive, out of scope non-goal
kubectl get/describe as output — read paths belong to plan/status non-goal

Appendix: design decisions vs the v0 prototype

The v1 DSL is a from-scratch redesign of a proven v0 prototype (its examples/ survive in-tree). Decisions made in the redesign, recorded so they are not re-litigated: