Position Description: We are looking for an experienced Availability Manager to join our team. The ideal candidate will be responsible for ensuring that IT services consistently meet agreed availability targets and align with current and future business needs. You will own the end\-to\-end Availability Management process, working closely with Service Owners, Incident, Problem, Change and Capacity Managers, infrastructure teams, application support, and vendors to proactively monitor, measure, analyse, and improve the availability of critical services. The role requires a strong understanding of ITIL practices, monitoring and observability tools, resilience engineering, and the ability to translate service performance data into actionable improvement plans. Job Title: Availability Manager Position: Senior System Engineer/Lead Analyst Experience: 7\- 13 Years Category: Senior System Engineer/Lead Analyst Shift: US Shift Time Main location: India, Karnataka, Bangalore, Electronic City Key Responsibilities 1\. Availability Management Process Ownership . Own and manage the Availability Management process aligned with ITIL best practices . Define availability strategies, policies, and standards . Ensure services meet agreed SLAs, OLAs, and underpinning contracts 2\. Service Availability Monitoring \& Reporting . Monitor system and application availability across environments . Define and track availability KPIs and SLAs . Produce dashboards and reports using tools like: ServiceNow Monitoring tools (Dynatrace, AppDynamics, SolarWinds) 3\. Availability Planning \& Design . Design availability models for new and existing services . Conduct capacity and resilience planning . Ensure high availability (HA), redundancy, and failover mechanisms are in place 4\. Incident \& Problem Management Integration . Work closely with Incident and Problem Management teams . Analyze outages and identify root causes impacting availability . Drive permanent fixes to reduce downtime 5\. Risk \& Resilience Management . Identify single points of failure (SPOFs) . Conduct risk assessments and recommend mitigation strategies . Ensure disaster recovery (DR) and business continuity plans are aligned 6\. Continuous Service Improvement (CSI) . Identify trends and recurring availability issues . Recommend improvements to enhance uptime and performance . Drive automation and predictive monitoring Key Skills \& Competencies Technical Skills . Strong knowledge of ITIL (Availability, Incident, Problem, Capacity Management) . Experience with monitoring tools: Dynatrace, AppDynamics, SolarWinds, Nagios . Experience with ITSM tools like: ServiceNow . Understanding of infrastructure (servers, networks, cloud platforms) Analytical Skills . Strong data analysis and trend identification . Ability to interpret availability reports and system metrics . Root cause analysis and problem\-solving mindset Soft Skills . Strong stakeholder communication . Ability to work across cross\-functional teams . Proactive and preventive mindset KPIs / Success Metrics . Service availability (%) vs SLA . Number and duration of outages . Mean Time Between Failures (MTBF) . Mean Time to Restore Service (MTRS/MTTR) . Reduction in recurring availability issues Qualifications . Bachelor's degree in IT, Engineering, or related field . ITIL Certification (v3 / v4\) preferred . 5–10 years of experience in IT Service Management Preferred Experience . Experience in enterprise or global IT environments . Exposure to cloud platforms (Azure, AWS, GCP) . Experience with high\-availability and disaster recovery design **Your future duties and responsibilities** * Define, implement, and continually improve the Availability Management process in line with ITIL best practices and client SLAs/OLAs/UCs. * Monitor, measure, analyse, and report on the availability, reliability, maintainability, and serviceability of IT services and supporting components. * Produce and maintain the Availability Plan, reflecting current and future business needs, and ensure it aligns with the Service Level and Capacity Management processes. * Proactively identify single points of failure, availability risks, and improvement opportunities; drive remediation through risk assessments, CFIA, FTA, and SOA techniques. * Investigate and lead root cause analysis for major availability\-impacting incidents and chronic issues, and ensure preventive actions are tracked to closure. * Define and validate availability and recovery requirements for new and changed services, and participate in design reviews, Change Advisory Boards (CAB), and major release readiness assessments. * Establish, track, and report KPIs such as service availability %, MTBF, MTRS, MTBSI, and unplanned downtime against agreed targets. * Collaborate with Incident, Problem, Change, Capacity, and Continuity Managers to ensure an integrated approach to service quality and resilience. * Engage with internal stakeholders, customers, and third\-party vendors to review service performance, agree on improvement actions, and present availability dashboards in governance forums. * Drive continual service improvement (CSI) initiatives, contribute to RCA reports, post\-incident reviews, and produce executive\-level availability reports. * Ensure availability requirements are documented, maintained, and accessible in the Service Knowledge Management System (SKMS)/CMDB. Must\-Have Skills: * Strong working knowledge of ITIL v3/v4 Service Management framework, with proven experience in Availability Management (ITIL Foundation mandatory; Intermediate / Specialist certification preferred). * Hands\-on experience designing and operating Availability Management processes in large, complex enterprise environments. * Experience with monitoring, observability, and APM tools such as Dynatrace, AppDynamics, Splunk, ServiceNow ITOM, SCOM, Nagios, Prometheus, or Grafana. * Strong experience with ServiceNow (or equivalent ITSM platforms) for incident, problem, change, and availability reporting workflows. * Solid understanding of infrastructure components (servers, storage, network, databases, middleware, cloud) and how they impact end\-to\-end service availability. * Proven ability to perform availability analysis techniques such as Component Failure Impact Analysis (CFIA), Fault Tree Analysis (FTA), Service Outage Analysis (SOA), and risk assessments. * Strong data analysis and reporting skills, with the ability to build dashboards and present trends, KPIs, and improvement recommendations to senior stakeholders. * Excellent problem\-solving, analytical, and decision\-making skills, especially under pressure during major incidents. * Strong communication, stakeholder management, and collaboration skills, with the ability to engage technical teams and business leadership. Good\-to\-Have Skills: * Experience with cloud platforms such as AWS, Azure, or GCP and understanding of cloud\-native availability and resilience patterns (HA, DR, multi\-region, auto\-scaling). * Exposure to Site Reliability Engineering (SRE) practices, SLO/SLI/Error Budget concepts, and chaos engineering. * Experience with IT Service Continuity Management (ITSCM) and Disaster Recovery planning and testing. * Familiarity with automation and scripting (PowerShell, Python, or Shell) for availability reporting and monitoring integration. * ITIL 4 Managing Professional, ITIL Specialist – Drive Stakeholder Value, or equivalent advanced certifications. * Experience working in a 24x7 global delivery model…
Senior Staff Developer - AI SOC Automation
Arctic Wolf Networks · Remote
Software Development Engineer
Favtutor · Remote
Sen. Mobile App Tester
Testvox · Mumbai