Role expectations
DevOps/DevOpsSec/CloudOps/GitOps
Ability to write technical documentation: README, tutorials, installation and configuration guides. See docs.enclaive.cloud
familiarity with Tech Stack
Ops
Terraform
Ansible
Docker
Helm
Linux
Virtualization (KVM/QEMU, Proxmox, Libvirt, Openshift, Rancher)
Kubernetes (Openshift, RKE, GKE, AKE, EKS)
Cloud: AWS, GCP, Azure and On-premise
Challenge: “Ship a Service — End-to-End CI/CD on Managed Kubernetes”
Scenario
Our team owns a small HTTP API (a simple /healthz endpoint is enough). You must:
provision cloud infrastructure for a managed Kubernetes cluster,
containerize and test the app,
build a CI/CD pipeline that goes from commit → container → security checks → Helm deploy,
implement safe rollout and rollback,
add basic observability.
You can choose one: Amazon EKS, Azure AKS, or Google GKE. Apply for a free tier, if you do not have an account.
What we’re assessing
Terraform fluency for cloud & k8s provisioning
Ansible for configuring CI runners or build hosts (or image pre-bake)
Docker image design & best practices
Helm packaging & environment promotion
CI/CD pipeline orchestration and quality gates
Kubernetes rollout strategies & rollback
Secrets & IAM hygiene
Observability & troubleshooting
Clean architecture, reproducibility, and documentation
Requirements
1) Infrastructure (Terraform)
Create a minimal, production-sensible K8s baseline on EKS/AKS/GKE:
VPC/VNet with at least 2 subnets (multi-AZ / multi-zone).
Managed node pool (or autopilot if GKE, but justify choice).
Private container registry (ECR/ACR/Artifact Registry).
IAM/role assignments tightly scoped for the CI job to push images & deploy with kubectl/helm.
Outputs: kubeconfig (securely handled), registry URL, and cluster name.
Include a destroy path and document cost-guardrails (e.g., small node sizes, TTL labels).
2) Build Host / Runner (Ansible)
Use Ansible to configure a self-hosted CI runner VM or to build a reusable image that includes:
docker/buildx, kubectl, helm, Terraform, and your cloud CLI.
Login to registry via OIDC or short-lived credentials.
Provide an Ansible playbook and inventory (local, cloud, or containerized runner).
Idempotence matters.
3) Application (Docker)
A tiny HTTP service (any language) with:
/healthz returns 200 and some JSON payload. Payload contains the output of a system variable, e.g. SYS_ENV=helloworld
Dockerfile must:
Use multi-stage builds
Run as non-root.
Set the environment SYS_ENV=helloworld
Set a minimal base and sensible HEALTHCHECK.
Tag images with app: and app:main (or :latest for dev only if justified).
4) Helm Deployment
Create a Helm chart charts/app with:
Configurable replicas, resources, liveness/readiness probes.
values.dev.yaml and values.prod.yaml.
Ingress (or Gateway) + Service.
HorizontalPodAutoscaler (HPA) based on CPU (and optionally RPS/custom metrics if you like).
Implement a safe rollout strategy:
Pick one: rolling update with surge/unavailable limits, or canary/blue-green (Argo Rollouts acceptable, but keep it simple and documented).
Provide an automated rollback step triggered when health checks fail.
5) CI/CD Pipeline
Use GitHub Actions. Pipeline should include:
On Pull Request to main:
Lint & test app.
Docker build (no push), Trivy image scan (fail on high/critical).
Terraform fmt/validate/plan (no apply).
Helm lint and chart unit tests (helm-unittest or chart-testing).
IaC security scan (e.g., Checkov or tfsec) with non-zero exit for high issues.
On Merge to main:
Build & push image to registry with tags : and :main.
Terraform apply to ensure infra is reconciled.
Deploy to dev using Helm with values.dev.yaml.
Post-deploy smoke test: hit /healthz via a job or script; fail pipeline if non-200.
Promotion to prod (manual approval job):
Deploy to prod with values.prod.yaml.
Apply rollout strategy; verify health checks.
If failure, automated rollback to previous release.
Artifacts & reporting:
Upload SBOM (e.g., Syft) and scan results.
Publish deployment summary with image tag, chart version, and links/logs.
6) Secrets & IAM
Use cloud-native secret storage HashiCorp Vault for CI
In cluster, mount secrets as env vars, specifically SYS_ENV=helloworld, or files via ExternalSecrets (bonus) or native Secret objects encrypted at rest (explain trade-offs).
Prefer OIDC-based auth for CI to cloud (no long-lived keys).
7) Observability
Expose Prometheus-style metrics endpoint in app (even a counter is fine).
Install minimal metrics stack:
Option A: kube-state-metrics + Prometheus (can be lightweight).
Option B: Cloud-native managed metrics (e.g., CloudWatch metrics for EKS).
Add basic logging guidance (e.g., structured logs; rely on cloud logs).
Provide a simple dashboard or kubectl query recipe to validate app health & HPA behavior.
Deliverables
Repository with this structure:
├─ app/
│ ├─ src/… # simple HTTP server
│ ├─ tests/… # unit tests
│ ├─ Dockerfile
│ └─ README.md # how to run locally
├─ charts/
│ └─ app/… # Helm chart + values.dev.yaml + values.prod.yaml
├─ infra/
│ ├─ terraform/
│ │ ├─ main.tf # providers, cluster, node pool, registry, IAM
│ │ ├─ variables.tf
│ │ ├─ outputs.tf
│ │ └─ README.md
│ └─ ansible/
│ ├─ inventories/
│ ├─ roles/
│ ├─ playbooks/runner.yml
│ └─ README.md
├─ .github/workflows/ or .gitlab-ci.yml
├─ SECURITY.md # threat model & hardening notes
├─ OPERATIONS.md # runbooks: deploy, rollback, debugging, cleanup
└─ README.md # top-level overview & quickstart
Docs to include:
README.md: cloud chosen, prerequisites, how to run CI locally, how to authenticate, high-level flow diagram.
OPERATIONS.md:
Deploy: dev→prod.
Rollback: helm history/rollback or Argo Rollouts revert.
Troubleshooting: common kubectl commands, logs, events.
Cleanup: terraform destroy order & caveats.
SECURITY.md:
IAM roles/policies overview; why least-privilege is sufficient.
Secrets approach & rotation story.
Supply-chain controls: SBOM, image/IaC scanning, provenance (bonus: cosign).
Tutorial.md:
Topic: Automating the deployment and integration of a web service in GKE/AKE/EKS Kubernetes Cluster
Write a tutorial for/in the style of docs.enclaive.cloud. Use markup.
Success Criteria (Scoring Rubric, 120 pts)
Terraform (20 pts)
Correct cluster, registry, and IAM (10)
Variables, modules, state handling, and destroy path (6)
Cost-aware and documented (4)
Ansible (10 pts)
Idempotent runner setup / golden image (6)
Clear inventory & docs (4)
Docker (15 pts)
Multi-stage, minimal, non-root, healthcheck (8)
Tagged images & caching strategy (4)
Unit tests wired into build (3)
Helm (20 pts)
Clean chart, sensible values, probes, resources (10)
HPA and ingress (6)
Rollout strategy implemented (4)
CI/CD (25 pts)
PR checks: tests, lint, scans, plan (10)
Main: build/push, deploy dev, smoke test (8)
Manual prod gate + rollback automation (7)
Security & Observability (10 pts)
OIDC or short-lived creds; secrets managed properly (5)
Metrics/logging accessible; basic dashboard or commands (5)
Docs & Tutorial (20)
Clear READMEs, runbooks, diagrams (+6)
Tutorial (+14)