This repository was born out of the need to better monitor the health and uptime of an ever-growing environment of services. It contains everything required to bootstrap and manage my cloud infrastructure. It is based on the GitOps methodology and uses Argo CD as the GitOps delivery mechanism.
At the highest possible level, this repo and CaC workflow consists of two parts:
- terraform contains the stage 1 bootstrapping for the cluster nodes. This includes the partial infrastructure bootstrapping on OCI, installation of a base OS and setup of a kubernetes distribution. After completion of this stage, the cluster is ready and capable of running workloads.
- argo contains the final stage 2 GitOps cluster configuration. This includes everything running inside kubernetes in the cluster and ranges from basic system infrastructure like Envoy Gateway, CCM and CSI to more user-style applications such as uptime monitoring apps. The argo installation on the cluster is not yet performed automatically, but could also be triggered from stage 1 easily. The contained argo applications are automatically installed and/or reconciled on the cluster without* user interaction. After completion of this stage, the cluster is fully set up and performs its monitoring and alerting duties.
| Component | Purpose | Notes |
|---|---|---|
| terraform | Infrastructure Bootstrap | |
| Ubuntu Server 24.04 | Base Operating System | |
| k3s | k8s Distribution / Install Mechanism | stacked HA controlplanes |
| ArgoCD | GitOps Automation inside the Cluster | |
| SOPS | Secrets Management | via ksops, using age rather than pgp |
| tailscale | Overlay Mesh VPN |
This cluster runs 2 k3s server nodes with embedded etcd to fit within Oracle's Always Free tier (2 OCPUs / 12 GB RAM total). This has important implications:
- No fault tolerance: Embedded etcd requires a majority quorum (2 of 2 = both nodes). Losing either node loses quorum and the cluster becomes read-only / unavailable.
- Recovery: If a node is permanently lost, the remaining node must be re-bootstrapped as a new single-node cluster, or both nodes destroyed and recreated via
terraform destroy+terraform apply. - Accepted trade-off: Full recreate is fast (~5 min) and no persistent data requires migration. This is acceptable for a monitoring/status page workload.
| Name | Purpose | Notes | |
|---|---|---|---|
| OCI CCM / CSI | Oracle Cloud Infrastucture k8s Automation | ||
| system-upgrade-controller | k8s Upgrade Controller | ||
| kured | Node Reboot Daemon | ||
| external-dns | DNS Management Automation | ||
| cert-manager | Automated Certificate Management | Let's Encrypt via ACME DNS | |
| Envoy Gateway | Gateway API & Ingress | ||
| CloudNativePG | Cloud-Native PostgreSQL Operator | ||
| reloader | Hot-Reload for ALL Workloads |
| Name | Purpose | Notes | |
|---|---|---|---|
| Gatus | Endpoint Monitor and Status Page | Monitor configuration as code |
