Required Skills

gcpkubernetes

Job Description

*Position Summary**
-------------------

We are seeking a highly skilled **Site Reliability Engineer (SRE)** to design, build, automate, and operate scalable, secure, and highly available cloud\-native platforms. The ideal candidate will have strong expertise in **Kubernetes ecosystem technologies**, **Google Cloud Platform (GCP)**, **Infrastructure as Code (Terraform)**, **GitOps**, **Observability**, **Service Mesh**, and **Secrets Management**.

The SRE will work closely with Development, Platform Engineering, Security, and DevOps teams to ensure reliability, performance, scalability, and operational excellence across production environments.

*Key Responsibilities**
-----------------------

### **Kubernetes Platform Engineering**

Design, deploy, and manage large\-scale Kubernetes clusters in production environments.
Administer and optimize Kubernetes networking using:

+ Cilium

+ Istio Service Mesh

+ Kubernetes Ingress Controllers

Build highly available and resilient container platforms.
Implement cluster lifecycle management, upgrades, scaling, and capacity planning.
Troubleshoot complex Kubernetes infrastructure and application issues.

### **Cloud Infrastructure (GCP)**

Design and operate cloud\-native infrastructure on Google Cloud Platform.
Manage services such as:

+ GKE (Google Kubernetes Engine)

+ VPC Networking

+ IAM

+ Cloud Load Balancers

+ Cloud Storage

+ Monitoring and Logging services

Ensure security, scalability, and cost optimization of cloud environments.
Implement multi\-environment and multi\-region deployment strategies.

### **Infrastructure as Code (Terraform)**

Develop and maintain reusable Terraform modules.
Automate provisioning and management of cloud infrastructure.
Implement infrastructure standards and governance.
Maintain version\-controlled infrastructure repositories.
Ensure repeatable, auditable, and scalable infrastructure deployments.

### **Kubernetes Package Management (Helm)**

Create and maintain Helm charts for platform and application deployments.
Standardize deployment practices across teams.
Manage Helm repositories and release strategies.
Support blue\-green, canary, and rolling deployment methodologies.

### **GitOps \& Continuous Delivery**

Build and maintain GitOps workflows using ArgoCD.
Automate application deployment pipelines.
Implement environment promotion strategies.
Maintain deployment compliance and auditability.
Drive CI/CD best practices across engineering teams.

### **Secrets \& Service Discovery Management**

Manage secrets, certificates, and application credentials using Vault.
Implement secure secret injection patterns for Kubernetes workloads.
Configure and maintain Consul for service discovery and service networking.
Establish access control and security policies for sensitive workloads.

### **Monitoring, Observability \& Reliability Engineering**

* Build comprehensive observability solutions using

+ Prometheus

+ Prometheus Operator

+ Grafana

+ Loki

+ Tempo

+ Alloy

+ Mimir

+ Pyroscope

* Define and implement

+ Service Level Indicators (SLIs)

+ Service Level Objectives (SLOs)

+ Error Budgets

Create dashboards, alerts, and operational runbooks.
Conduct root cause analysis (RCA) and postmortems.
Improve system reliability, performance, and operational visibility.

### **Incident Response \& Operations**

Participate in on\-call rotations.
Lead incident management during production outages.
Troubleshoot infrastructure, networking, application, and platform issues.
Develop automation to reduce operational toil.
Create disaster recovery and business continuity procedures.

### **Automation \& Platform Engineering**

Develop automation scripts and operational tooling.
Improve platform self\-service capabilities.
Drive reliability engineering best practices.
Eliminate manual operational processes through automation.

*Required Technical Skills**
----------------------------

### **Container \& Kubernetes Ecosystem**

Kubernetes (Production\-grade administration)
Cilium
Istio Service Mesh
Kubernetes Ingress Controllers
Container Networking
Cluster Security and RBAC

### **Cloud Platforms**

Google Cloud Platform (GCP)
GKE
Cloud Networking
IAM and Security Controls

### **Infrastructure as Code**

Terraform
Infrastructure Automation
Configuration Management Concepts

### **Deployment \& GitOps**

ArgoCD
GitOps Methodologies
GitLab
CI/CD Pipelines

### **Secrets \& Service Networking**

HashiCorp Vault
Consul

### **Monitoring \& Observability**

Prometheus
Prometheus Operator
Grafana
Loki
Tempo
Alloy
Mimir
Pyroscope

### **Operating Systems \& Networking**

Linux Administration
TCP/IP
DNS
Load Balancing
SSL/TLS
Network Troubleshooting

*Preferred Qualifications**
---------------------------

Experience managing large\-scale Kubernetes platforms.
Experience supporting mission\-critical production systems.
Strong understanding of distributed systems concepts.
Knowledge of cloud security best practices.
Experience implementing SRE principles such as:

+ SLI/SLO/Error Budgets

+ Capacity Planning

+ Incident Management

+ Reliability Engineering

Experience with multi\-cluster Kubernetes environments.
Relevant certifications such as:

+ Certified Kubernetes Administrator (CKA)

+ Certified Kubernetes Security Specialist (CKS)

+ Google Cloud Professional Certifications

+ HashiCorp Terraform Associate

Experience

**5–10\+ years** of overall infrastructure/platform engineering experience.
**3–5\+ years** of hands\-on Kubernetes production experience.
Strong experience in cloud\-native platforms, observability, automation, and GitOps\-driven operations.

Similar Jobs

Browse all jobs

Upload resume for AI match score

Job Overview

Job type: Full-time
Work mode: Remote
Location: Anywhere in India
Posted: 19h ago
Source: Scraped

LinkedIn 𝕏 / Twitter

Site Reliability Engineer