Required Skills

agileazuregcpgojavakubernetespythonsql

Job Description

*Customer**

The Customer Team empowers organizations to build deeper relationships with customers through innovative strategies, advanced analytics, GenAI, transformative technologies, and creative design. We enable Deloitte client service teams to enhance customer experience and drive sustained growth and customer value creation and capture, through customer and commercial strategies, digital product and innovation, marketing, commerce, sales, and service. We are a team of strategists, data scientists, operators, creatives, designers, engineers, and architects, balancing business strategy, technology, creativity, and ongoing managed services to help solve the biggest problems that impact customers, partners, constituents, and the workforce. We also offer Business Process as a Service, enabling organizations to streamline operations and achieve greater efficiency through scalable, technology\-enabled managed insights that guide ongoing transformation and operational excellence.

Position Summary

*Level:** Consultant Managed Service or equivalent

The Site Reliability Engineer (SRE) improves availability, latency, performance, efficiency, change safety, and resilience of production services on Microsoft Azure and Google Cloud Platform (GCP). The SRE defines and runs SLIs/SLOs/error budgets, builds reliability automation, leads incident response and blameless postmortems, and strengthens systems through resilience engineering and chaos experiments. The role uses industry observability platforms (e.g., Dynatrace, Splunk, Datadog) in addition to native cloud tooling to measure and improve customer outcomes

*Work you’ll do:**

SLO/SLI \& Error Budget Management

+ Define user\-journey SLIs and measurable SLOs for critical services; translate reliability goals into engineering actions.

+ Operationalize error budgets to guide release risk decisions and reliability investment.

+ Run regular SLO reviews, publish reliability scorecards, and maintain service reliability roadmaps.

Observability Engineering (Dynatrace / Splunk / Datadog \+ Cloud\-Native)

+ Design end\-to\-end observability across metrics, logs, traces, synthetics, and RUM (where applicable) mapped to SLIs.

+ Implement and govern telemetry standards (e.g., trace/metric conventions) and ensure coverage for critical paths.

+ Build actionable alerting (symptom\-based), reduce noise, and improve on\-call signal quality.

+ Create dashboards and investigations that connect platform signals to customer impact and SLO compliance.

Tools (examples)

+ Dynatrace: APM, distributed tracing, service flow, anomaly detection, SLO dashboards.

+ Splunk: log analytics, SIEM\-adjacent investigations (when needed for prod incidents), correlation searches, alert tuning.

+ Datadog: APM, infra monitoring, logs, synthetics, SLO management, incident workflows.

+ Cloud\-native: Azure Monitor / Log Analytics / Application Insights, GCP Cloud Monitoring / Logging / Trace.

Incident Response, On\-Call, and Postmortems

+ Participate in on\-call rotation; lead incident command for high\-severity events.

+ Drive rapid mitigation (rollback/roll\-forward), stakeholder comms, and stable recovery.

+ Facilitate blameless postmortems, identify systemic causes, and ensure corrective actions are implemented and verified.

Resilience, Capacity, Performance

+ Engineer reliability patterns: timeouts, retries (with jitter), circuit breakers, bulkheads, load shedding, graceful degradation.

+ Perform capacity planning, load testing, scaling strategy validation, and performance tuning aligned to SLOs.

+ Plan and test DR: define RTO/RPO, conduct failover tests and recovery drills.

Chaos Engineering (Added)

+ Design and run chaos experiments to validate resilience assumptions and reduce unknown failure modes.

+ Define hypotheses tied to SLOs (e.g., “regional dependency failure should degrade gracefully without breaching availability SLO”).

+ Implement controlled fault injection: dependency outages, latency/packet loss, CPU/memory pressure, pod/node termination, zonal failure simulations.

+ Establish safety guardrails: blast\-radius limits, approvals, monitoring/abort conditions, and learning\-focused postmortems.

+ Integrate game days into reliability programs and track reliability improvements from findings.

Toil Reduction \& Reliability Automation

+ Identify and reduce toil via automation (auto\-remediation, safe diagnostics, runbooks\-as\-code).

+ Build self\-service operational tooling to improve mean time to detect/restore and reduce manual intervention.

+ Own/drive production readiness reviews and reliability acceptance criteria for new services.

Cloud Scope (Azure \+ GCP)

Azure (examples)

AKS, App Service, Functions, VM Scale Sets, Azure SQL/Cosmos DB, Event Hubs/Service Bus; resilience via Availability Zones, regional strategies, traffic management; telemetry via Azure Monitor / App Insights.

GCP (examples)

GKE, Cloud Run, Compute Engine, Cloud SQL/Spanner, Pub/Sub; resilience via multi\-zone/region strategies and traffic management; telemetry via Cloud Monitoring/Logging/Trace.

Cross\-Cloud

Standardize SLOs, incident practices, and observability conventions across Azure and GCP; manage reliability of shared dependencies (identity, DNS, certificates, third parties).

*The team:**

*The team:**

Our **Digital Foundry Operate \& Innovations** (DFO\&I) team partners with organizations to rapidly design, build, and scale digital products and experiences that drive business growth and elevate customer engagement. As a multidisciplinary group of strategists, designers, engineers, and operations specialists, we deliver end\-to\-end solutions—from initial concept and agile development to ongoing digital operations—enabling clients to experiment, iterate, and scale digital initiatives with confidence and agility. We support clients across domains such as strategy, commerce, marketing, sales, and service, helping them realize their digital ambitions through flexible, scalable teams. Our expertise spans the full digital lifecycle, including customer research, experience design, platform development, content production, and marketing automation. By bridging the gap between strategy and execution, we empower organizations to achieve measurable outcomes and deliver exceptional customer experiences in an ever\-evolving digital landscape.**The team:**

Our **Digital Foundry Operate \& Innovations** (DFO\&I) team partners with organizations to rapidly design, build, and scale digital products and experiences that drive business growth and elevate customer engagement. As a multidisciplinary group of strategists, designers, engineers, and operations specialists, we deliver end\-to\-end solutions—from initial concept and agile development to ongoing digital operations—enabling clients to experiment, iterate, and scale digital initiatives with confidence and agility. We support clients across domains such as strategy, commerce, marketing, sales, and service, helping them realize their digital ambitions through flexible, scalable teams. Our expertise spans the full digital lifecycle, including customer research, experience design, platform development, content production, and marketing automation. By bridging the gap between strategy and execution, we empower organizations to achieve measurable outcomes and deliver exceptional customer experiences in an ever\-evolving digital landscape.**Qualifications**

*Must Have Skills/Project Experience/Certifications:**

3 to 6 years in SRE / Production Reliability for distributed, customer\-facing systems.
Hands\-on experience defining and operating SLIs/SLOs/error budgets.
Experience with Azure and GCP production workloads (especially AKS/GKE and managed services).
Strong incident response leadership and postmortem discipline.
Proficiency in at least one engineering language (Go/Python/Java/C\#) for automation and tooling.
Practical experience with at least one enterprise observability platform: Dynatrace, Splunk, and/or Datadog.

*Preferred Skills:**

Kubernetes reliability engineering (autoscaling behavior, upgrades, networking, workload resiliency).
OpenTelemetry\-based instrumentation and tracing practices.
Chaos engineering experience (game days, fault injection, experiment design and safety controls).
Regulated environment operations (auditability/change controls) while preserving SRE principles

*Education:**

BE/B.Tech/M.C.A./M.Sc (CS) degree or equivalent from accredited university

*Location:**

Bengaluru/Hyderabad/Pune/Chennai

*Our purpose**

Deloitte’s purpose is to make an impact that matters for our people, clients, and communities. At Deloitte, purpose is synonymous with how we work every day. It defines who we are. Our purpose comes through in our work with clients that enables impact and value in their organizations, as well as through our own investments, commitments, and actions across areas that help drive positive outcomes for our communities.

*Our people and culture**

Our inclusive culture empowers our people to be who they are, contribute their unique perspectives, and make a difference individually and collectively. It enables us to leverage different ways of thinking, ideas and perspectives, and bring more creativity and innovation to help solve our clients’ most complex challenges. This makes Deloitte one of the most rewarding places to work.

*Professional development**

At Deloitte, professionals have the opportunity to work with some of the best and discover what works best for them. Here, we prioritize professional growth, offering diverse learning and networking opportunities to help accelerate careers and enhance leadership skills. Our state\-of\-the\-art DU: The Leadership Center in India, located in Hyderabad, represents a tangible symbol of our commitment to the holistic growth and development of our people. Explore DU: The Leadership Center in India.

*Benefits to help you thrive**

At Deloitte, we know that great people make a great organization. Our comprehensive rewards program helps us deliver a distinctly Deloitte experience that helps that empowers our professionals to thrive mentally, physically, and financially—and live their purpose. To support our professionals and their loved ones, we offer a broad range of benefits. Eligibility requirements may be based on role, tenure, type of employment and/ or other criteria. Learn more about what working at Deloitte can mean for you.

*Recruiting tips**

From developing a stand out resume to putting your best foot forward in the interview, we want you to feel prepared and confident as you explore opportunities at Deloitte. Check out recruiting tips from Deloitte recruiters.

Requisition code: 353600

Similar Jobs

Browse all jobs

SRE-Consultant, Managed Services- Customer-DF&I

Required Skills

Job Description

Tools (examples)

Similar Jobs

Job Overview

Share