Lekshmi Kolappan

Production-ready AWS EKS cluster with complete GitOps platform toolkit, automated deployment, monitoring, and observability

What is the significance of this?

Complete Infrastructure as Code: Provision entire AWS EKS cluster and all platform services with a single command
GitOps Automation: ArgoCD automatically deploys and manages all platform applications from Git repositories
Production-Ready Platform: Comprehensive observability stack with monitoring, logging, and testing tools
High Availability: Multi-AZ deployment with auto-scaling worker nodes and Horizontal Pod Autoscaler (HPA)
Zero-Downtime Deployments: Kubernetes rolling updates ensure continuous service availability
Complete Observability: Prometheus, Grafana, Loki, and Promtail for metrics, logs, and dashboards
SRE-Ready: Availability testing and sanity checks built-in

How is automation accomplished?

Terraform Infrastructure: Complete AWS infrastructure provisioning (VPC, EKS, IAM, Security Groups) as code
Docker Containerization: Terraform runs in Docker for consistent execution across team members
Helm Charts: Application deployment automation with Helm templates and values management
ArgoCD GitOps: App-of-apps pattern automatically deploys entire platform toolkit from Git
Makefile Automation: One-command deployment (make deploy) for entire infrastructure stack
Auto-Scaling: HPA for pod scaling and AWS Auto Scaling Groups for worker node scaling
Automated Sync: ArgoCD continuously syncs Git changes to cluster with self-healing capabilities

Prerequisites

AWS Account: Programmatic access with IAM user
IAM Policies: AmazonEC2FullAccess, IAMFullAccess, AutoScalingFullAccess, AmazonEKSClusterPolicy, AmazonEKSWorkerNodePolicy, AmazonVPCFullAccess, AmazonEKSServicePolicy, AmazonEKS_CNI_Policy
Docker: Installed and running for Terraform containerization
AWS CLI: Configured with credentials (~/.aws/credentials or AWS_PROFILE)
Git: For cloning repositories

Source Code

Infrastructure Repository: https://github.com/Lforlinux/k8s-infrastructure-as-code
Platform Toolkit Repository: https://github.com/Lforlinux/k8s-platform-toolkit

How to provision the infrastructure

Quick Start Deployment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


# Clone the infrastructure repository
git clone https://github.com/Lforlinux/k8s-infrastructure-as-code.git
cd k8s-infrastructure-as-code

# Configure AWS credentials
aws configure
# OR set AWS_PROFILE environment variable
export AWS_PROFILE=your-profile

# Deploy entire stack with one command
make deploy

Review the Terraform plan and type yes to proceed. The deployment includes:

VPC and networking components (subnets, route tables, internet gateway)
EKS cluster with managed node groups
Metrics Server for HPA
NodeJS application (via Helm)
ArgoCD for GitOps
Platform toolkit applications (automatically deployed via ArgoCD)

Deployment Workflow

EKS GitOps Workflow

The infrastructure repository deploys ArgoCD, which then automatically references the platform toolkit repository through the app-of-apps pattern, deploying all platform applications.

Architecture

Architecture Diagram

Infrastructure Components

AWS VPC: Multi-AZ network with public and private subnets
EKS Cluster: Managed Kubernetes control plane
Worker Nodes: Auto-scaling node groups in private subnets
Application Load Balancer: External access to services
Security Groups: Network security and access control
IAM Roles: Service accounts and permissions

Platform Services (via ArgoCD)

The platform toolkit repository provides:

Monitoring Stack

Prometheus: Metrics collection and alerting
Grafana: Visualization dashboards
kube-state-metrics: Kubernetes object metrics
node-exporter: Node-level system metrics

Logging Stack

Loki: Centralized log aggregation
Promtail: Log collection agent (DaemonSet)

Testing & Validation

Sanity Test: Automated health check testing for microservices
Availability Test: SRE-style availability and reliability testing

Demo Applications

Online Boutique: Google’s microservices demo (11 services)
NodeJS Application: Sample application deployed via Helm

Key Features

Infrastructure as Code

Terraform Modules: Reusable infrastructure components
Docker-Based Execution: Consistent Terraform version across team
State Management: Remote state storage for team collaboration
Output Management: Automatic kubeconfig generation and access information

GitOps with ArgoCD

App-of-Apps Pattern: Hierarchical application management
Automated Sync: Continuous deployment from Git repositories
Self-Healing: Automatic reconciliation of cluster state
Multi-Repository: Infrastructure and platform toolkit separation
Sync Waves: Ordered deployment of dependent applications

High Availability & Auto-Scaling

Multi-AZ Deployment: High availability across availability zones
Horizontal Pod Autoscaler: Automatic pod scaling based on CPU/memory
Worker Node Auto-Scaling: AWS Auto Scaling Groups for node capacity
Load Balancing: Application Load Balancer for traffic distribution

Observability

Metrics Collection: Prometheus scrapes metrics from all services
Log Aggregation: Centralized logging with Loki and Promtail
Dashboards: Pre-configured Grafana dashboards for Kubernetes and applications
Alerting: Prometheus alerting rules for proactive monitoring

Testing & Validation

Sanity Testing: Automated health checks for all microservices
Availability Testing: SRE metrics (uptime %, MTTR) with real user simulation
Performance Testing: k6 performance testing with Prometheus integration and Grafana dashboards

Platform Toolkit Applications

1. Monitoring Stack

Purpose: Comprehensive observability and metrics collection
Namespace: monitoring

Prometheus: Metrics collection, storage, and alerting
Grafana: Visualization with pre-configured dashboards
kube-state-metrics: Kubernetes object state tracking
node-exporter: Node-level system metrics

Grafana Online Boutique Dashboard

The Grafana dashboard provides real-time monitoring of the Online Boutique microservices, including CPU usage by pod, pod status metrics, and application logs for comprehensive observability.

2. Logging Stack

Purpose: Centralized log aggregation and analysis
Namespace: monitoring

Loki: Log aggregation server with Prometheus-inspired storage
Promtail: Log collector agent (DaemonSet) for all pods

3. Sanity Test

Purpose: Automated health check testing
Namespace: sanity-test

Periodic health checks for all microservices (every 60 seconds)
Web UI dashboard with test results
REST API for programmatic access
Response time metrics and error tracking

Sanity Test Health Check Dashboard

The Sanity Test dashboard provides a comprehensive health check overview for all Online Boutique microservices, displaying total test runs, pass/fail status, and service health metrics in a Jenkins-like interface.

4. Availability Test

Purpose: SRE-style availability and reliability testing
Namespace: availability-test

Real user workflow simulation (add to cart, remove from cart)
Automated tests every 5 minutes
SRE metrics calculation (uptime %, MTTR)
Jenkins-like dashboard with green/red status

Availability Test Build Status

The Availability Test dashboard displays detailed build status and test case results, including complete user journey validation (visit → browse → add to cart → remove from cart) and microservices integration verification, providing comprehensive SRE-style reliability testing.

5. Online Boutique

Purpose: Microservices demonstration application
Namespace: online-boutique

11 microservices (frontend, cart, checkout, payment, shipping, etc.)
Redis for cart storage
gRPC and REST API communication
Real-world microservices architecture patterns

Accessing the Cluster

Kubernetes Access

Option 1: Using kubectl

1

aws eks --region $(terraform output -raw region) update-kubeconfig --name $(terraform output -raw cluster_name)

Option 2: Using kubeconfig file If running make deploy locally, a kubeconfig.yaml file is created in the project directory. Use this with Kubernetes tools like Lens, k9s, or other IDEs.

Application Access

View all LoadBalancer services:

1

kubectl get svc --all-namespaces -o wide | grep LoadBalancer

Available Services:

ArgoCD Server (argocd namespace): GitOps UI and API
NodeJS Application (default namespace): Main application
Grafana (monitoring namespace): Monitoring dashboards
Prometheus (monitoring namespace): Metrics collection
Microservices Demo Frontend (online-boutique namespace): Demo application
Availability Test (availability-test namespace): Testing service
Sanity Test (sanity-test namespace): Health check service

ArgoCD Access

ArgoCD is deployed with a LoadBalancer service. Access information is displayed after make deploy:

URL: Provided in deployment output
Username: admin
Password: Generated automatically (displayed in output)

To get password manually:

1

kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

Performance Testing

The platform includes k6 performance testing capabilities with full Prometheus integration and Grafana dashboard visualization. k6 is a modern load testing tool that provides comprehensive performance metrics and real-time monitoring.

How k6 Performance Testing Works

The k6 performance testing framework runs as Kubernetes jobs and integrates seamlessly with Prometheus for metrics collection. Tests are executed against the Online Boutique microservices application, testing both frontend HTTP endpoints (homepage, product pages) and backend health check endpoints. All performance metrics including response times, error rates, and threshold violations are automatically exported to Prometheus, where they can be visualized in real-time through pre-configured Grafana dashboards.

k6 Performance Test Dashboard

Running Performance Tests

Performance tests can be executed using the k6 test runner script with different test types:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


# Run smoke test (basic functionality)
performance-testing/deployment/run-test.sh smoke

# Run load test (normal production load)
performance-testing/deployment/run-test.sh load

# Run stress test (find breaking point)
performance-testing/deployment/run-test.sh stress

# Run spike test (sudden traffic surge)
performance-testing/deployment/run-test.sh spike

Available Test Types

Smoke Test - Basic functionality verification

Duration: ~4 minutes
Load: 1 virtual user
Purpose: Verify basic app functionality

Load Test - Normal production load

Duration: ~16 minutes
Load: 50-100 virtual users
Purpose: Test under expected production load

Stress Test - Find application breaking point

Duration: ~40 minutes
Load: 100-500 virtual users (gradual increase)
Purpose: Discover maximum capacity and limits

Spike Test - Sudden traffic spike

Duration: ~6 minutes
Load: 10 → 500 → 1000 virtual users (sudden spikes)
Purpose: Test handling of sudden traffic surges

Viewing Test Results

View test execution logs:

1

kubectl logs job/k6-<test-type>-test -n performance-testing

Monitor pod scaling during tests:

1
2


kubectl get hpa
kubectl get pods -w

Prometheus Integration & Grafana Dashboard

All k6 test metrics are automatically exported to Prometheus, including:

Response Times: P50, P90, P95, P99 percentiles
Request Rates: Requests per second (RPS)
Error Rates: HTTP error percentages
Threshold Violations: Failed threshold checks
Virtual User Metrics: Active and total virtual users

The Grafana dashboard provides real-time visualization of all performance metrics, allowing you to monitor test execution, identify bottlenecks, and analyze application performance under different load conditions. Metrics are stored in Prometheus for historical analysis and trend monitoring.

Application Lifecycle Management

Zero-Downtime Deployments

The infrastructure supports zero-downtime deployments through Kubernetes rolling updates:

Update Application Code
- Make changes to the NodeJS application in Nodejs-Docker/
- Build and push Docker image to your registry
Update Helm Chart
- Create a feature branch
- Update image tag in charts/helm-nodejs-app/values.yaml
- Create a Pull Request
Deploy Changes
1

make deploy
- Helm performs rolling updates
- Kubernetes ensures zero downtime during deployment

GitOps Workflow (ArgoCD)

For GitOps-based deployments:

Push application manifests to Git repository
ArgoCD automatically syncs changes
Rolling updates handled by Kubernetes
Monitor deployment status in ArgoCD UI

Project Structure

Infrastructure Repository (`k8s-infrastructure-as-code`)

.
├── argocd/                 # ArgoCD application manifests
│   └── app-of-apps.yaml
├── charts/                 # Helm charts
│   └── helm-nodejs-app/
├── Nodejs-Docker/          # NodeJS application source
├── *.tf                    # Terraform configuration files
├── Makefile               # Automation scripts
└── README.md              # Project documentation

Platform Toolkit Repository (`k8s-platform-toolkit`)

.
├── application/          # Demo microservices applications
│   ├── k8s-demo/        # Platform dashboard application
│   └── online-boutique/ # Google's microservices demo
├── argocd/              # ArgoCD configuration and application definitions
│   ├── apps/            # Individual ArgoCD application manifests
│   └── install/         # ArgoCD installation manifests
├── availability-test/   # SRE availability testing application
├── dashboards/          # Grafana dashboard configurations
├── monitoring/          # Monitoring stack (Prometheus, Grafana, Loki)
└── sanity-test/         # Health check testing application

Repository Relationship

k8s-infrastructure-as-code: Contains complete Kubernetes infrastructure (EKS cluster, networking, security groups, IAM, etc.) and deploys ArgoCD
k8s-platform-toolkit: Supplies the app-of-apps repository location and stores all application source code and manifests

When the infrastructure repository deploys ArgoCD, it automatically references the platform toolkit repository through the app-of-apps pattern, which then deploys all platform applications defined there.

Troubleshooting

Cluster Access Issues

1
2
3
4
5


# Verify AWS credentials
aws sts get-caller-identity

# Check cluster status
aws eks describe-cluster --name <cluster-name> --region <region>

ArgoCD Access Issues

1
2
3
4
5


# Get ArgoCD password manually
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

# Check ArgoCD server status
kubectl get svc -n argocd argocd-server

Application Not Accessible

1
2
3
4
5
6
7
8


# Check service status
kubectl get svc

# Check pod status
kubectl get pods

# View pod logs
kubectl logs <pod-name>

ArgoCD Sync Issues

1
2
3
4
5
6
7
8


# View all applications
kubectl get applications -n argocd

# Check application sync status
kubectl describe application <app-name> -n argocd

# Manual sync
argocd app sync <app-name>

Security Notes

Ensure IAM policies follow least privilege principles
Rotate ArgoCD admin password after first login
Use AWS Secrets Manager for sensitive data
Enable VPC flow logs for network monitoring
Regularly update container images and dependencies
Implement network policies for pod-to-pod communication
Use RBAC for Kubernetes access control

Future Enhancements

Planned Features

Service Mesh: Istio or Linkerd integration for advanced traffic management
CI/CD Integration: Jenkins/GitHub Actions pipeline automation
Multi-Cluster: Support for multiple EKS clusters
Disaster Recovery: Automated backup and restore procedures

Technical Improvements

Helm Chart Registry: Separate registry for Helm chart distribution
GitOps Automation: Automated PR-based deployments via GitHooks
Advanced Monitoring: Custom metrics and alerting rules
Security Hardening: Enhanced network policies and pod security standards

Contributing

Development Setup

Fork the repositories
Create feature branch: git checkout -b feature/your-feature
Make changes and test locally
Commit changes: git commit -m "Add your feature"
Push to branch: git push origin feature/your-feature
Create Pull Request

Code Standards

Terraform: Follow HashiCorp best practices and module structure
Kubernetes: Adhere to Kubernetes resource naming conventions
Helm: Follow Helm chart best practices and versioning
Documentation: Clear setup and troubleshooting guides

Conclusion

This Kubernetes GitOps Platform project demonstrates enterprise-grade infrastructure and platform operations practices, showcasing:

Complete Infrastructure as Code with Terraform for AWS EKS
GitOps Automation with ArgoCD app-of-apps pattern
Production-Ready Observability with Prometheus, Grafana, and Loki
SRE Best Practices with availability testing
High Availability with multi-AZ deployment and auto-scaling
Zero-Downtime Deployments through Kubernetes rolling updates

The project serves as both a functional Kubernetes platform and a comprehensive example of modern DevOps and GitOps practices, making it an excellent addition to any infrastructure engineer’s portfolio.

Live Demo: Kubernetes GitOps Platform
Infrastructure Repository: https://github.com/Lforlinux/k8s-infrastructure-as-code
Platform Toolkit Repository: https://github.com/Lforlinux/k8s-platform-toolkit

Kubernetes GitOps Platform - Complete EKS Infrastructure & Platform Toolkit

Built with

What is the significance of this?

How is automation accomplished?

Prerequisites

Source Code

How to provision the infrastructure

Quick Start Deployment

Deployment Workflow

Architecture

Infrastructure Components

Platform Services (via ArgoCD)

Monitoring Stack

Logging Stack

Testing & Validation

Demo Applications

Key Features

Infrastructure as Code

GitOps with ArgoCD

High Availability & Auto-Scaling

Observability

Testing & Validation

Platform Toolkit Applications

1. Monitoring Stack

2. Logging Stack

3. Sanity Test

4. Availability Test

5. Online Boutique

Accessing the Cluster

Kubernetes Access

Application Access

ArgoCD Access

Performance Testing

How k6 Performance Testing Works

Running Performance Tests

Available Test Types

Viewing Test Results

Prometheus Integration & Grafana Dashboard

Application Lifecycle Management

Zero-Downtime Deployments

GitOps Workflow (ArgoCD)

Project Structure

Infrastructure Repository (k8s-infrastructure-as-code)

Platform Toolkit Repository (k8s-platform-toolkit)

Repository Relationship

Troubleshooting

Cluster Access Issues

ArgoCD Access Issues

Application Not Accessible

ArgoCD Sync Issues

Security Notes

Future Enhancements

Planned Features

Technical Improvements

Contributing

Development Setup

Code Standards

Conclusion

Comments

Infrastructure Repository (`k8s-infrastructure-as-code`)

Platform Toolkit Repository (`k8s-platform-toolkit`)