Required Skills

javascriptpythonjavaawskubernetespostgresql

Job Description

Closing date for applications: 16/06/2026

Location Gurugram, India

Job typePermanent \| Contract typeFull Time

\#R\-00278946

*Join our digital revolution in NatWest Digital X**
---------------------------------------------------

In everything we do, we work to one aim. To make digital experiences which are effortless and secure.

So we organise ourselves around three principles: engineer, protect, and operate. We engineer simple solutions, we protect our customers, and we operate smarter.

*Job description**
------------------

This role is based in India and as such all normal working days must be carried out in India.

Join us as a Site Reliability Engineer

In this key role, you’ll improve, drive, and embed non\-functional and operational characteristics such as availability, performance, efficiency, change management, monitoring, security, incident response, and capacity planning of our products and services
You’ll enjoy significant stakeholder interaction, working in collaboration with engineers to ensure a principled approach to deliver change in a safe and secure way
This is a chance to join an inclusive team with a collaborative ethos and a commitment to innovation and professional development
We're offering this role at vice president level

*What you'll do**
-----------------

As a Senior Site Reliability Engineer, you’ll act as a hands‑on expert responsible for ensuring the reliability, availability and performance of critical production platforms.

You’ll lead the adoption of SRE practices, embedding resilience, observability and operational excellence into distributed systems running on AWS and Kubernetes. You’ll also take ownership of **24/7 production support models**, ensuring systems are highly available and incidents are effectively managed and learned from.

In addition to this, you’ll

Designing and operating highly resilient AWS\-based Kubernetes platforms (EKS) aligned to enterprise standards
Owning and improving **production reliability, availability, and SLA/SLO frameworks**
Leading **incident management, escalation and 24/7 on\-call practices**, including post\-incident reviews
Embedding **SRE principles** such as error budgets, toil reduction, and reliability engineering into delivery teams
Implementing infrastructure and platform automation using Terraform and GitOps methodologies
Driving **self\-healing, auto\-scaling and failure recovery mechanisms** using tools such as Karpenter
Building secure, scalable networking and service communication (e.g. Cilium)
Defining and operating observability platforms using **Grafana, Prometheus, Loki, Tempo**
Partnering with DevOps and engineering teams to ensure production readiness and operational excellence
Leading complex troubleshooting across distributed systems and cloud\-native environments
Developing reusable “golden paths”, operational runbooks and reliability patterns

*The skills you'll need**
-------------------------

We’re looking for a highly experienced SRE who has a strong background in operating large\-scale, business\-critical platforms with a passion for reliability engineering

We’re also looking for

Deep expertise managing **production systems on AWS and Kubernetes (EKS)**
Strong experience in **24/7 support models, incident management and on\-call leadership**
Advanced knowledge of **SRE principles (SLIs, SLOs, error budgets, toil reduction)**
Proficiency in Terraform, GitOps, and cloud automation practices
Hands\-on experience with GitLab CI/CD and Argo CD
Strong understanding of **Kubernetes networking, security and service mesh technologies**, ideally Cilium
Experience scaling infrastructure using Karpenter and auto\-scaling strategies
Expertise in observability tooling (Grafana, Prometheus, Loki, Tempo)
Proven ability to troubleshoot and resolve complex, cross\-system production issues
Experience operating in regulated or high\-security environments
Strong leadership, mentoring, and stakeholder engagement capabilities
Ability to balance reliability, risk, and delivery in a fast\-paced environment

*Welcome to our Gurugram hub**
------------------------------

Spanning 437,000 sq. ft., our campus in Gurugram features two state\-of\-the\-art towers – 1A and 2A at the Candor TechSpace in Sector 21\.

### **Key facts:**

Surrounded by 28 acres
Space for 4,100 colleagues
Opened in 2010

### **Our tech stack**

Here’s just some of the technologies we use.

### **Front end**

JavaScript
ReactJS
AngularJS

### **Back end**

Python
Java
Microsoft Dynamics

### **DevOps**

AWS
GitLab
Google Cloud Platform

### **Data**

Kafka
Hadoop
PostgreSQL
Snowflake

Similar Jobs

Browse all jobs

Upload resume for AI match score

Job Overview

Job type: Full-time
Work mode: On-site
Location: Faridabad
Posted: 1d ago
Source: Scraped

LinkedIn 𝕏 / Twitter

Site Reliability Engineer (AWS & Kubernetes), VP