Skip to content

anyshift-io/sre-skills

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sre-skills

License

SRE methodology skills for AI agents. Each skill packages one reliability workflow (investigating a live incident, handing over oncall, writing a postmortem) as a self-contained module your agent loads and runs.

Built and maintained by Anyshift.

Why use these

Your agent already writes code and runs commands. It does not know how a seasoned SRE actually works an incident: which signals to correlate first, when a deploy is the prime suspect, when to stop digging and page a human. These skills encode that methodology so the agent follows a real playbook instead of improvising.

Every skill runs end-to-end with no Anyshift account and no external credentials. The methodology, the worked examples, and the replay tests all work offline against fixtures.

Skills

Each skill targets one real product and one job over it: audit an IAM policy, triage a Terraform plan, resolve an S3 bucket's effective access. It does that job end-to-end, offline, against fixtures. Not a wrapper that dumps the API response back: each one carries the judgment a senior engineer applies to that one source, the thresholds and known-bad combinations that separate signal from a clean-looking config.

Then it stops. A single source only knows itself. The moment a question needs a join (this role to everything it can actually reach, this queue to its producers and consumers, this plan to the running infrastructure it will move) the data runs out. Each skill names exactly where that happens and what's missing, so the boundary is explicit instead of a silent wrong answer. That boundary is the same one every time: the join across resources, across sources, or across time.

Skill Domain What it does
sqs-queue-auditor AWS Audits redrive/DLQ wiring, maxReceiveCount, retention ordering against the DLQ, and a visibility timeout left at the risky default: the queue-side config that silently drops or re-delivers messages while every attribute reads as fine.
iam-deceptive-escalation-auditor AWS Resolves the effective permission set across every policy on a principal (Allow minus blanket Deny), flags the cross-statement escalation combos (PassRole+compute-launch, policy-rewrite-in-place, trust-policy rewrite) that no single statement reveals, and stays symmetric: it does not flag an escalation already killed by a Deny, a scoped wildcard, or a sealed Condition.
sg-deceptive-reachability-auditor AWS Builds a directed reachability graph from SG-to-SG references plus internet edges, composes the transitive closure from a named entry point, and reports the shortest path to the crown-jewel tier and the bridging hub SGs that a per-rule read misses, reporting clean when a segmented fleet has no reachable path.
s3-estate-calibration-auditor AWS Resolves each bucket's effective verdict by composing all four layers (Block Public Access, bucket policy, ACL, access points), then calibrates across an estate: it names the one bucket that is genuinely public or cross-account exposed without over-flagging the many siblings that read as exposed but are neutralised, and reports clean when nothing is live.
terraform-plan-risk-reporter IaC Ranks plan changes by blast risk, isolating destroys and force-replacements of stateful or irreplaceable resources from the harmless in-place updates they hide among.
github-actions-flake-reporter CI/CD Detects flaky jobs (pass-on-rerun on an unchanged SHA), clusters failures by cause, and flags duration regressions across run history, not just the last red run.

sqs-queue-auditor, iam-deceptive-escalation-auditor, sg-deceptive-reachability-auditor, and s3-estate-calibration-auditor are built out, each with fixture-based replay tests and a committed control-vs-treatment lift eval; the rest are planned. kubectl-investigator stays as the methodology-shaped reference template: it shows the directory shape, the worked-example format, and the fixture-based replay tests every skill above follows.

Using a skill

Claude Code (recommended)

These skills ship as a plugin in Anyshift's Claude Code marketplace. In a Claude Code session:

/plugin marketplace add anyshift-io/claude-plugins
/plugin install sre-skills@anyshift

The skills are now loaded. The agent reaches for the right one whenever you ask something that maps to an incident, a change review, an oncall handover, a postmortem, or a reliability audit. Pull new skills and versions later with /plugin marketplace update anyshift.

Any other agent

Clone the repo and point your agent at the skill you want. Each skill directory is self-contained: the methodology, the worked examples, and the fixture-based replay tests live together, so you can run the skill against the fixtures before pointing it at your own infrastructure.

git clone https://github.com/anyshift-io/sre-skills.git

Every skill also documents its failure modes: where it is likely to be wrong, and where the agent should escalate to a human instead of acting.

Going deeper with Anyshift (optional)

The skills work standalone. Two optional layers add infrastructure context:

  • Anyshift MCP as a context primer. Skills can opt into richer context from the Anyshift MCP server. When the integration is wired up for a skill, that skill publishes a measured "with vs without" delta, so the added value is explicit rather than assumed.
  • Annie, pre-loaded. Running Anyshift's Annie agent gives you these skills already loaded, with your Terraform state, cloud inventory (AWS / GCP / Azure), and recent deploys wired in.

What each skill guarantees

  • Two worked examples drawn from real incidents or canonical scenarios.
  • Fixture-based replay tests that run without external credentials.
  • An explicit failure-modes section: where the skill is wrong, where the agent should escalate to a human.

Looking for more

For a curated index of SRE skills (ours and others), MCP servers, and reading, see anyshift-io/awesome-sre-skills.

Contributing

Contributions to the vendor-neutral skills are welcome. See CONTRIBUTING.md.

License

Apache 2.0.

About

Open-source library of methodology-shaped SRE skills for AI agents (Apache 2.0)

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors