From c04fe758b3703a840704276ca5e2a668ebed0660 Mon Sep 17 00:00:00 2001 From: Jim Schaff Date: Wed, 6 May 2026 16:43:04 -0400 Subject: [PATCH] Document direct kubectl logs as fallback for /loki-query MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Loki is good at multi-pod sweeps, structured filters, and historical queries — but not every investigation fits that. Add a section to the /loki-query slash command covering when and how to read pod logs directly: - After a pod crash/exit (--previous), e.g. when JmsFailoverWatchdog recycles a wedged pod and we need the prior container's stack trace. - For real-time tail of a single pod, or windows that have aged out of Loki retention. - For the broker pods (activemqint, activemqsim, artemismq) — verified during the 2026-05-06 wedge investigation that these run supervisord as PID 1 and the actual broker logs to a file inside the pod (/var/log/activemq/activemq.log for ActiveMQ Classic). Neither `kubectl logs` nor Loki sees those events; the only access path is `kubectl exec ... cat`. Document this gap explicitly so future investigators don't waste time on a `kubectl logs` that returns ten supervisord lifecycle lines and call it a dead end. Mirrors the Loki kubeconfig convention (LOKI_KUBECONFIG → ~/.kube/kubeconfig_vxrails.yaml), provides a discovery command for actual deployment/statefulset names, and notes the in-pod log rotation horizon (~14h on activemqint) as a real operational gap worth fixing by reconfiguring these brokers to log to stdout. Co-Authored-By: Claude Opus 4.7 (1M context) --- .claude/commands/loki-query.md | 76 ++++++++++++++++++++++++++++++++++ 1 file changed, 76 insertions(+) diff --git a/.claude/commands/loki-query.md b/.claude/commands/loki-query.md index 8b13741829..1feaa5f1c3 100644 --- a/.claude/commands/loki-query.md +++ b/.claude/commands/loki-query.md @@ -81,6 +81,82 @@ bash tools/loki/loki-query.sh --output=raw --limit=20000 --since=1h \ For raw-output queries, log lines are JSON; useful fields include `["@timestamp"]`, `log_level`, `["log.logger"]`, `["process.thread.name"]`, `message`. Pipe through `jq -r '...'` to extract a clean digest. +## Direct kubectl logs (when Loki isn't enough) + +Use `kubectl logs` directly when: +- **A pod just crashed/restarted** — `--previous` shows the prior container's logs. Critical when a `JmsFailoverWatchdog` `System.exit` recycles a pod and you need to see why. +- **Broker pods** — `activemqint` (ActiveMQ Classic, legacy daemon-to-daemon traffic — counterpart to client-side JMS errors in `data`/`db`/`sched`/`submit`) and `activemqsim` (sim/solver traffic). These may not be in Loki's collection set; check Loki first and fall back here. +- **Real-time tail** of a single pod, or windows that have aged out of Loki retention. + +Setup (same kubeconfig as Loki): +```bash +KCFG="${LOKI_KUBECONFIG:-$HOME/.kube/kubeconfig_vxrails.yaml}" +NS=prod # or stage / dev +``` + +Discover the actual workload names — deployment vs statefulset, exact selectors: +```bash +kubectl --kubeconfig "$KCFG" -n "$NS" get deployments,statefulsets,pods \ + | grep -iE "data|activemq|api|db|sched|submit" +``` + +Common log patterns: +```bash +# Last 200 lines from the data pod (the one that wedged on 2026-05-06) +kubectl --kubeconfig "$KCFG" -n "$NS" logs deployment/data --tail=200 + +# Real-time tail +kubectl --kubeconfig "$KCFG" -n "$NS" logs -f deployment/data + +# Previous container instance — after a crash, OOM, or watchdog-driven exit +kubectl --kubeconfig "$KCFG" -n "$NS" logs deployment/data --previous --tail=500 + +# Time-bounded +kubectl --kubeconfig "$KCFG" -n "$NS" logs deployment/data --since=10m +kubectl --kubeconfig "$KCFG" -n "$NS" logs deployment/data --since-time="2026-05-06T06:30:00Z" + +``` + +### Broker pods are special — supervisord hides the broker log + +`activemqint`, `activemqsim`, and `artemismq` all run as `Deployments` in `prod` (and the same naming applies in `stage`/`dev` if present), but their container PID 1 is **supervisord**. The actual broker logs to a file inside the pod, not stdout. Consequences: + +- `kubectl logs deployment/activemqint` returns ~10 lines of supervisord lifecycle, frozen at pod startup (44d old in prod). It does **not** show broker activity. +- Loki's promtail isn't scraping these pods either (verified by `logcli series '{namespace="prod"}'` — no `activemq`/`artemis` containers appear). +- The only path to broker events is `kubectl exec` against the in-pod log file. + +```bash +# ActiveMQ Classic (activemqint, activemqsim) — log at /var/log/activemq/activemq.log +kubectl --kubeconfig "$KCFG" -n "$NS" exec deployment/activemqint -- \ + tail -200 /var/log/activemq/activemq.log + +# Around an incident window (in-pod awk filter — efficient on a multi-MB log) +kubectl --kubeconfig "$KCFG" -n "$NS" exec deployment/activemqint -- \ + bash -c 'awk "/2026-05-06 06:[34][0-9]/" /var/log/activemq/activemq.log' + +# Errors only +kubectl --kubeconfig "$KCFG" -n "$NS" exec deployment/activemqint -- \ + grep -E "WARN|ERROR" /var/log/activemq/activemq.log + +# Artemis (artemismq) — different layout; discover the log path +kubectl --kubeconfig "$KCFG" -n "$NS" exec deployment/artemismq -- \ + bash -c 'find / -name "*.log" -size +1k 2>/dev/null | head -10' +``` + +The in-pod ActiveMQ log file rotates after roughly 14h of activity. For older incidents, the broker side is not recoverable — a real operational gap when investigating wedges that took longer than that to manifest. + +If `deployment/` doesn't resolve (e.g. a broker is later deployed as a StatefulSet), discover the pod by label or by listing pods. + +**Worth fixing**: reconfigure these brokers to log to stdout (or have supervisord forward child stdout/stderr). Once stdout flows, both `kubectl logs` and Loki's promtail pick them up automatically — no more `exec` archaeology and no more 14h horizon. + +### Putting it together for a JMS-wedge incident + +Pull both sides around the same window: +- **Client side** (`data`/`db`/`sched`/`submit`) via Loki — the `IllegalStateException: Timer already cancelled.` WARNs surface here, but the *triggering* network event happened earlier and silently. +- **Broker side** (`activemqint`/`activemqsim`) via `kubectl exec` — `Transport Connection to: tcp://: failed: java.io.EOFException` lines tell you which client connection actually dropped, and at what timestamp. Compare to client-side reconnect attempts to confirm the wedge was already latent before the trigger. + +When to switch back to Loki: multi-pod sweeps, time-range queries spanning multiple containers, structured filters (`|~ "ERROR"`, regex), and historical incidents older than the kubelet's local log retention. + ## Workflow 1. Read the user's request and identify: namespace (prod/stage/dev), time window, suspected container(s), keywords.