AWS Incident Responder
Automated incident response where the runbook is a visual, version-controlled n8n workflow rather than glue code, with an LLM summarizing the incident in plain English.
/01Problem
Incident remediation logic buried in Lambda code is hard to read, hard to change, and invisible to anyone who is not the author. The aim was a runbook that is both automated and legible: a workflow a responder can read, reason about, and version-control.
It is the deliberate counterpart to an earlier Lambda-glued remediation project: same incident class, a visual runbook instead of code.
/02Approach
- A CloudWatch alarm on a target EC2 instance (CPU at or above 80% for two 5-minute periods) publishes to an SNS topic, delivered over HTTPS to n8n.
- The workflow confirms its own SNS subscription programmatically, then asks Claude Haiku for a plain-English incident summary and a recommended next step.
- It posts an incident card to Slack, reboots the instance via the EC2 API signed with SigV4, then waits and re-checks the alarm with DescribeAlarms to either resolve or escalate.
- Remediation runs under a least-privilege IAM user scoped to RebootInstances on the single target ARN plus read-only enrichment.
/03Architecture
n8n runs on ECS Fargate behind an ALB with an ACM certificate on a Route 53 subdomain. Secrets are pulled from SSM SecureString parameters at task start, so nothing sensitive lives in Terraform state.
The whole stack is provisioned in Terraform and built to deploy, demo, and destroy for about a dollar a day.
/04Outcome
A self-healing incident loop that detects, explains, notifies, remediates, and verifies, with the entire decision path visible as a workflow diagram rather than opaque code.
Paired with an Azure counterpart built on Container Apps and Entra service principals to demonstrate the same pattern across clouds.