- *Position Summary**
- -------------------
We are seeking a highly skilled **Site Reliability Engineer (SRE)** to design, build, automate, and operate scalable, secure, and highly available cloud\-native platforms. The ideal candidate will have strong expertise in **Kubernetes ecosystem technologies**, **Google Cloud Platform (GCP)**, **Infrastructure as Code (Terraform)**, **GitOps**, **Observability**, **Service Mesh**, and **Secrets Management**.
The SRE will work closely with Development, Platform Engineering, Security, and DevOps teams to ensure reliability, performance, scalability, and operational excellence across production environments.
- *Key Responsibilities**
- -----------------------
### **Kubernetes Platform Engineering**
- Design, deploy, and manage large\-scale Kubernetes clusters in production environments.
- Administer and optimize Kubernetes networking using:
+ Cilium
+ Istio Service Mesh
+ Kubernetes Ingress Controllers
- Build highly available and resilient container platforms.
- Implement cluster lifecycle management, upgrades, scaling, and capacity planning.
- Troubleshoot complex Kubernetes infrastructure and application issues.
### **Cloud Infrastructure (GCP)**
- Design and operate cloud\-native infrastructure on Google Cloud Platform.
- Manage services such as:
+ GKE (Google Kubernetes Engine)
+ VPC Networking
+ IAM
+ Cloud Load Balancers
+ Cloud Storage
+ Monitoring and Logging services
- Ensure security, scalability, and cost optimization of cloud environments.
- Implement multi\-environment and multi\-region deployment strategies.
### **Infrastructure as Code (Terraform)**
- Develop and maintain reusable Terraform modules.
- Automate provisioning and management of cloud infrastructure.
- Implement infrastructure standards and governance.
- Maintain version\-controlled infrastructure repositories.
- Ensure repeatable, auditable, and scalable infrastructure deployments.
### **Kubernetes Package Management (Helm)**
- Create and maintain Helm charts for platform and application deployments.
- Standardize deployment practices across teams.
- Manage Helm repositories and release strategies.
- Support blue\-green, canary, and rolling deployment methodologies.
### **GitOps \& Continuous Delivery**
- Build and maintain GitOps workflows using ArgoCD.
- Automate application deployment pipelines.
- Implement environment promotion strategies.
- Maintain deployment compliance and auditability.
- Drive CI/CD best practices across engineering teams.
### **Secrets \& Service Discovery Management**
- Manage secrets, certificates, and application credentials using Vault.
- Implement secure secret injection patterns for Kubernetes workloads.
- Configure and maintain Consul for service discovery and service networking.
- Establish access control and security policies for sensitive workloads.
### **Monitoring, Observability \& Reliability Engineering**
* Build comprehensive observability solutions using
+ Prometheus
+ Prometheus Operator
+ Grafana
+ Loki
+ Tempo
+ Alloy
+ Mimir
+ Pyroscope
* Define and implement
+ Service Level Indicators (SLIs)
+ Service Level Objectives (SLOs)
+ Error Budgets
- Create dashboards, alerts, and operational runbooks.
- Conduct root cause analysis (RCA) and postmortems.
- Improve system reliability, performance, and operational visibility.
### **Incident Response \& Operations**
- Participate in on\-call rotations.
- Lead incident management during production outages.
- Troubleshoot infrastructure, networking, application, and platform issues.
- Develop automation to reduce operational toil.
- Create disaster recovery and business continuity procedures.
### **Automation \& Platform Engineering**
- Develop automation scripts and operational tooling.
- Improve platform self\-service capabilities.
- Drive reliability engineering best practices.
- Eliminate manual operational processes through automation.
- *Required Technical Skills**
- ----------------------------
### **Container \& Kubernetes Ecosystem**
- Kubernetes (Production\-grade administration)
- Cilium
- Istio Service Mesh
- Kubernetes Ingress Controllers
- Container Networking
- Cluster Security and RBAC
### **Cloud Platforms**
- Google Cloud Platform (GCP)
- GKE
- Cloud Networking
- IAM and Security Controls
### **Infrastructure as Code**
- Terraform
- Infrastructure Automation
- Configuration Management Concepts
### **Deployment \& GitOps**
- ArgoCD
- GitOps Methodologies
- GitLab
- CI/CD Pipelines
### **Secrets \& Service Networking**
### **Monitoring \& Observability**
- Prometheus
- Prometheus Operator
- Grafana
- Loki
- Tempo
- Alloy
- Mimir
- Pyroscope
### **Operating Systems \& Networking**
- Linux Administration
- TCP/IP
- DNS
- Load Balancing
- SSL/TLS
- Network Troubleshooting
- *Preferred Qualifications**
- ---------------------------
- Experience managing large\-scale Kubernetes platforms.
- Experience supporting mission\-critical production systems.
- Strong understanding of distributed systems concepts.
- Knowledge of cloud security best practices.
- Experience implementing SRE principles such as:
+ SLI/SLO/Error Budgets
+ Capacity Planning
+ Incident Management
+ Reliability Engineering
- Experience with multi\-cluster Kubernetes environments.
- Relevant certifications such as:
+ Certified Kubernetes Administrator (CKA)
+ Certified Kubernetes Security Specialist (CKS)
+ Google Cloud Professional Certifications
+ HashiCorp Terraform Associate
Experience
- **5–10\+ years** of overall infrastructure/platform engineering experience.
- **3–5\+ years** of hands\-on Kubernetes production experience.
- Strong experience in cloud\-native platforms, observability, automation, and GitOps\-driven operations.