Advanced 7h 27m

Klustered: Advanced Cluster Triage

Master the craft of rescuing broken clusters. This path walks through real war‑room incidents so you can debug the control plane, container runtime, and host OS when kubectl alone is useless.

Technologies Covered

KuberneteskubectlcontainerdetcdsystemdjournalctliptablestcLinuxGitOps

Your Guides

Profile picture of David Flanagan
David Flanagan

Prerequisites

  • Deep understanding of Kubernetes architecture (control plane, worker nodes, etcd, static pods).
  • Proficiency with `kubectl`, `kubectl debug`, and common triage commands (`describe`, `logs`, `events`).
  • Comfort administering Linux servers with systemd, journalctl, iptables, tc, and process tooling (ps, lsof).
  • Experience with container runtimes (containerd, CRI) and troubleshooting with `crictl`/`ctr`.
  • Familiarity with etcd administration (`etcdctl`, snapshots, compaction) and TLS fundamentals.

Klustered: Advanced Cluster Triage

Go beyond kubectl and develop a battle-tested methodology for diagnosing multi-layer outages. Each scenario highlights a different failure domain—nodes, networking, security, and host-level manipulation—so you can bring clusters back from the dead.

Learning Objectives

  • Adopt a repeatable triage framework: validate symptoms, isolate the failing layer, and confirm the fix.
  • Combine Kubernetes logs with host forensics (journalctl, systemctl, ps, lsof) when the API server is lying—or gone.
  • Repair control-plane failures by dissecting static pod manifests, TLS assets, and etcd health.
  • Detect malicious persistence (cron/systemd jobs, ld.so.preload, kubeconfig auth plugins) that reintroduce problems after a reboot.
  • Harden clusters after recovery by applying network controls, auditing admission webhooks, and documenting runbooks.

Videos in This Path

1. Marino Wijay & John Anderson

Why it matters: Kubelet crashes and CNI misfires are the bread-and-butter incidents that escalate quickly. This episode establishes a systematic triage cadence.

What you’ll learn:

  • Diagnose NodeNotReady states by correlating kubelet logs, CNI status, and certificate errors.
  • Inspect static pod manifests and component arguments under /etc/kubernetes/manifests.
  • Track down webhook failures and broken admission controllers that silently reject workloads.

Key takeaways:

  • Trust but verify your cluster context and credentials before making changes.
  • Pod lifecycle issues often stem from lower layers—container runtime, CNI, or certificates—not the workload itself.

Hands-on drill: Reproduce a NodeNotReady incident by breaking the CNI config in a test cluster, then repair it using the workflow demonstrated.

2. Hans Kristian Flaatten & Zach Wachtel

Why it matters: Slow control planes are harder than outright failures. This session shows how to hunt down etcd pain, I/O throttling, and sneaky persistence.

What you’ll learn:

  • Profile control-plane latency and leader-election behaviour when etcd is under duress.
  • Use low-level tools (crictl, journalctl, iostat) when kubectl no longer returns data.
  • Discover malicious cron/systemd jobs reintroducing broken configs after each reboot.

Key takeaways:

  • etcd health directly governs API responsiveness; treat disk I/O and quotas as first-class SLOs.
  • Persistent host tasks can undo your fixes—eradicate them or they’ll respawn the outage.

Hands-on drill: Simulate etcd I/O starvation (e.g., tc or cgroup limits) and practice restoring health plus cleaning persistence hooks.

3. Alex Jones & Alistair Hey

Why it matters: Networking mysteries rarely announce themselves. This duel demonstrates how to peel back NodePorts, policies, and host firewalls.

What you’ll learn:

  • Trace traffic with kubectl exec, iptables -t nat, and tc to find where packets die.
  • Validate kube-controller-manager and kubelet configs when scheduling subtly fails.
  • Guard yourself against sabotaged tooling (rogue aliases, modified kubeconfigs) that hinder troubleshooting.

Key takeaways:

  • Multi-layer debugging beats guesswork—test from pod → node → service to pinpoint the break.
  • Always confirm your tools haven’t been tampered with; even kubectl can be weaponized.

Hands-on drill: Inject latency with tc qdisc, observe service behaviour, then revert settings and record a runbook entry.

4. The Community vs. Rawkode

Why it matters: Security incidents often masquerade as outages. This group effort dissects a malicious kubeconfig helper and recovers the control plane.

What you’ll learn:

  • Audit kubeconfig exec plugins to detect hostile authentication helpers.
  • Use classic Linux tooling (ps, at, env) to uncover suspicious activity on nodes.
  • Restore control-plane availability by editing static manifests and restarting kubelet safely.

Key takeaways:

  • kubeconfig files are executable attack surfaces—treat them like code.
  • Static pod manifests provide single points of failure; secure and monitor them rigorously.

Hands-on drill: Craft a harmless kubeconfig exec helper, document how it executes, then disable it using RBAC and config policies.

5. Klustered #11

Why it matters: The finale showcases deep Linux forensics—hidden processes, etcd DoS, and host firewalls—under extreme pressure.

What you’ll learn:

  • Detect tampering techniques (ld.so.preload, rogue systemd units) that hide processes and binaries.
  • Recover from etcd quota exhaustion and network lockouts using snapshots, compaction, and targeted iptables rules.
  • Combine filesystem, network, and process evidence into a cohesive incident timeline.

Key takeaways:

  • Complex outages demand patience, logs, and note-taking before touching anything.
  • Hardened control planes require both Kubernetes and OS-level defenses (firewalls, file integrity, RBAC).

Hands-on drill: Build a lab where etcd hits its quota, practice compaction/defrag, then add iptables rules to lock down peer traffic.

Bonus Practice Scenarios

After-Action Suggestions

  1. Turn your notes into a written incident response playbook (symptom → hypothesis → command sequence).
  2. Automate recurring diagnostics with kubectl debug profiles or Systemd/Ansible scripts.
  3. Feed lessons learned into platform guardrails: enforce admission webhook health, lock down kubeconfig exec plugins, and monitor etcd quotas continuously.