Skip to content

k-krew/omen

Repository files navigation

Claude Assisted CI

Omen

A lightweight Kubernetes chaos engineering operator with transparent target selection and optional manual approval.

Overview

Omen lets you declaratively define chaos experiments against your workloads. Each run:

  1. Selects a fixed set of target pods (preview)
  2. Optionally waits for manual approval
  3. Executes the chaos action against those exact targets
  4. Records per-target results and a summary

Two CRDs are provided:

  • Experiment — defines the schedule, target selector, action, safety limits, and approval policy
  • ExperimentRun — a single execution instance created by the controller, holding the target preview, approval state, and results

Roadmap

Curious about what's coming next? Check out our Roadmap to see our plans for network chaos, advanced target filtering, ChatOps integrations, and more!

Install via Helm

helm install omen oci://ghcr.io/k-krew/charts/omen \
  --namespace omen-system \
  --create-namespace \
  --version <version>

To customise the installation:

helm install omen oci://ghcr.io/k-krew/charts/omen \
  --namespace omen-system \
  --create-namespace \
  --version <version> \
  --set manager.leaderElect=true \
  --set resources.limits.memory=256Mi \
  --set manager.webhookTimeout=30s \
  --set manager.protectedNamespaces="{kube-system,omen-system,kube-public,my-critical-ns}"

Controller flags

Flag Default Description
--webhook-timeout 10s Timeout for outgoing approval webhook HTTP requests. Transient failures are retried by the controller with exponential backoff; the run only fails if still undelivered when the approval TTL expires.
--leader-elect false Enable leader election for HA deployments.
--metrics-bind-address 0 Address for the metrics endpoint (0 disables it).
--health-probe-bind-address :8081 Address for liveness/readiness probes.
--protected-namespaces kube-system,omen-system,kube-public Comma-separated list of namespaces that cannot be targeted by any experiment. Enforced at both the validating webhook and target selection time.

Safety: Protected Namespaces

Omen enforces a list of protected namespaces that can never be targeted, regardless of what an Experiment specifies. The defaults are kube-system, omen-system, and kube-public.

The list is configured via the --protected-namespaces flag (comma-separated) and exposed in the Helm chart as manager.protectedNamespaces:

manager:
  protectedNamespaces:
    - kube-system
    - omen-system
    - kube-public
    - my-critical-namespace

Protection is enforced in two places:

  • Validating webhook — rejects Experiment objects whose spec.selector.namespace is in the protected list at admission time.
  • Controller — filters out any pods in protected namespaces during target selection, even for cluster-scoped selectors.

Safety: Pod-level Opt-out

Individual pods can be excluded from all chaos experiments by adding the annotation chaos.kreicer.dev/ignore: "true". This is useful for pods running critical in-flight work (e.g., database migrations, stateful leaders) that must not be interrupted.

kubectl annotate pod <pod-name> chaos.kreicer.dev/ignore=true

Or in the pod template:

metadata:
  annotations:
    chaos.kreicer.dev/ignore: "true"

The annotated pod is removed from the eligible list before selection. If all matching pods carry the annotation, the run transitions to Skipped automatically.

Run locally (against Kind or Minikube)

Prerequisites

  • Go 1.26+
  • kubebuilder v4
  • kubectl pointing at a local cluster
# Install CRDs
GOTOOLCHAIN=local make install

# Run the controller locally (uses ~/.kube/config)
GOTOOLCHAIN=local make run

The controller reads POD_NAMESPACE to exclude its own pods from target selection. Set it when running locally:

POD_NAMESPACE=omen-system GOTOOLCHAIN=local make run

Example Experiments

One-shot, no approval, fixed count

apiVersion: chaos.kreicer.dev/v1alpha1
kind: Experiment
metadata:
  name: kill-one-pod
  namespace: default
spec:
  runPolicy:
    type: Once
  selector:
    namespace: default
    labels:
      app: my-app
  mode:
    type: random
    count: 1
  action:
    type: delete_pod
  safety:
    maxTargets: 1

One-shot, percentage-based target selection

apiVersion: chaos.kreicer.dev/v1alpha1
kind: Experiment
metadata:
  name: kill-third-of-fleet
  namespace: default
spec:
  runPolicy:
    type: Once
  selector:
    namespace: default
    labels:
      app: my-app
  mode:
    type: random
    percent: 33   # kill ~33% of matching pods, minimum 1
  action:
    type: delete_pod
  safety:
    maxTargets: 5

percent is mutually exclusive with count. The calculated pod count is always rounded up and floored at 1, so the experiment always has an effect even against small replica sets. safety.maxTargets is applied as a hard cap after the percentage is resolved.

Recurring, with approval

apiVersion: chaos.kreicer.dev/v1alpha1
kind: Experiment
metadata:
  name: weekly-chaos
  namespace: default
spec:
  runPolicy:
    type: Repeat
    schedule: "0 10 * * 1"   # every Monday at 10:00
    cooldown: 24h
    concurrencyPolicy: Forbid
  selector:
    namespace: staging
    labels:
      app: api-server
  mode:
    type: random
    count: 2
  action:
    type: delete_pod
  approval:
    required: true
    ttl: 30m
    webhook:
      url: https://hooks.example.com/omen-approval
  safety:
    maxTargets: 2

To approve the run, patch the generated ExperimentRun:

kubectl patch experimentrun <run-name> \
  --type=merge \
  -p '{"spec":{"approved":true}}'

Dry run

Set dryRun: true on the Experiment to preview target selection without executing the action. Targets are recorded in ExperimentRun.status.previewTargets and results are marked Success without any pods being deleted.

Observability

Every phase transition of an ExperimentRun emits a standard Kubernetes Event on the object. Use kubectl describe to follow the lifecycle:

kubectl describe experimentrun <run-name>

Events use Normal type for successful transitions (PreviewGenerated, Approved, Running, Completed) and Warning for failure states (Failed, Expired).

The TOTAL column in kubectl get expruns is populated as soon as targets are selected during the PreviewGenerated phase, so you can see how many pods will be affected before the run executes.

Safe Deletion

Experiment objects carry a finalizer (chaos.omen.com/finalizer). When an Experiment is deleted, the controller first deletes all owned ExperimentRuns and waits for them to be removed before releasing the finalizer. This prevents orphaned runs from executing chaos actions after the parent is gone.

Development

# Regenerate CRDs and RBAC after editing types
GOTOOLCHAIN=local make manifests generate

# Build the binary
GOTOOLCHAIN=local make build

# Run tests (requires setup-envtest)
go install sigs.k8s.io/controller-runtime/tools/setup-envtest@latest
export KUBEBUILDER_ASSETS=$(setup-envtest use --print path)
go test ./... -v