Required Skills

kubernetestensorflowpytorchnlprecruitment

Job Description

Technical Solutions Cons This role has been designed as ‘Hybrid’ with an expectation that you will work on average 2 days per week from an HPE office.**Who We Are:** Hewlett Packard Enterprise is the global edge\-to\-cloud company advancing the way people live and work. We help companies connect, protect, analyze, and act on their data and applications wherever they live, from edge to cloud, so they can turn insights into outcomes at the speed required to thrive in today’s complex world. Our culture thrives on finding new and better ways to accelerate what’s next. We know varied backgrounds are valued and succeed here. We have the flexibility to manage our work and personal needs. We make bold moves, together, and are a force for good. If you are looking to stretch and grow your career our culture will embrace you. Open up opportunities with HPE. **Job Description:** **HPE Operations** is our innovative IT services organization. It provides the expertise to advise, integrate, and accelerate our customers’ outcomes from their digital transformation. Our teams collaborate to transform insight into innovation. In today’s fast paced, hybrid IT world, being at business speed means overcoming IT complexity to match the speed of actions to the speed of opportunities. Deploy the right technology to respond quickly to market possibilities. Join us and redefine what’s next for you. **What you’ll do:** We are seeking a Subject Matter Expert (SME) – Admin, Operate \& Manage (HPE PCAI \& AI Factory Solutions) to manage and optimize HPE’s next\-generation AI infrastructure platforms. The ideal candidate will have deep hands\-on expertise in AI, HPC, and GPU\-accelerated environments, with strong knowledge of HPE Ezmeral, NVIDIA AI Enterprise, Containerized workloads, and Automation frameworks. This role focuses on the operational stability, lifecycle management, and continuous improvement of large\-scale Private Cloud for AI (PCAI) and AI Factory deployments. **Key Responsibilities:** 1\. **Platform Administration** • Administer and maintain HPE PCAI and AI Factory environments, ensuring optimal uptime and performance. * Manage compute nodes (HPE DL380a, DL325, Cray XD670\), GPU clusters (NVIDIA L40S/H100/H200\), and InfiniBand NDR networks. * Administer virtualization and container platforms such as vSphere, RHEL/RHOS, Ezmeral Runtime Enterprise, Kubernetes, and Rancher Harvester. * Perform configuration, patching, version upgrades, and firmware updates across hardware and software layers. 2\. **Operational Monitoring \& Incident Management** **•** Proactively monitor system health using DCGM, NetQ, Grafana, and Exivity dashboards. • Handle alerts, performance anomalies, and incidents across GPU, network, and storage layers. * Lead root cause analysis (RCA) and corrective action plans to prevent recurring issues. * Maintain operational documentation, runbooks, and incident logs. 3**. Lifecycle \& Configuration Management** * Manage cluster lifecycle through Ansible, AWX, HPE Performance Cluster Manager (HPCM), and SLURM. * Oversee automation for provisioning, scaling, and patch management of Compute and Containerized workloads. * Manage configuration changes, infrastructure templates, and version baselines in production and staging environments. 4\. **AI Platform \& Software Operations** * Operate HPE Ezmeral Unified Analytics, Data Fabric, and AI Essentials platforms. * Support NVIDIA AI Enterprise (NVAIE) components including NIMs, NeMO frameworks, and RAPIDS runtime. * Manage and monitor AI/ML workloads (LLM, NLP, Computer Vision, Chatbots) on containerized clusters. * Ensure smooth operation of development tools like Jupyter, Spark, Airflow, MLflow, Kubeflow, and Ray. 5\. **Storage \& Data Operations** * Administer VAST, WEKA, and Alletra MP storage solutions for file, object, and distributed storage. * Monitor storage performance, replication, and capacity utilization. * Coordinate with storage engineering teams for performance optimization and capacity planning. 6\. **Security, IAM \& Compliance** * Implement and maintain Keycloak for authentication and role\-based access control. * Ensure adherence to compliance, audit, and governance standards for AI workloads. * Support user and service account provisioning, credential management, and access reviews. 7\. **Continuous Improvement \& Knowledge Enablement** * Optimize automation workflows to reduce manual intervention and improve service response time. * Drive service health reviews, operational dashboards, and SLA compliance reporting. * Conduct enablement sessions for L1/L2 teams and act as the final escalation point for operational issues. * Collaborate with HPE Engineering for patch validation, release readiness, and operational feedback. Required Skills \& Technical Expertise: Core Infrastructure Skills * Administration of HPE DL380a, DL325, Cray XD670, and GPU\-based Compute environments. * Strong knowledge of NVIDIA GPU stack, InfiniBand NDR, and Spectrum\-X switches. * Experience in managing VAST, WEKA, or Alletra MP storage systems. Software \& Platform Operations * Virtualization: vSphere, RHEL, Ezmeral Runtime Enterprise • Containers: Kubernetes, Rancher Harvester, KubeSphere, Morpheus • Automation: Ansible, AWX, NetBox, HPCM, SLURM * Observability: Grafana, NetQ, Exivity, DCGM * Security: Keycloak, IAM integrations AI/ML Platform Administration * Experience in HPE Ezmeral Unified Analytics and Data Fabric operations • Familiarity with NVIDIA AI Enterprise, NIMs, NeMO, and Triton Inference Server • Working knowledge of TensorFlow, PyTorch, Spark, Kubeflow, MLflow, and Jupyter Preferred Certifications : • HPE ASE / Master ASE (Compute, Storage, or Ezmeral) * NVIDIA Certified Professional / NVAIE Certification * RHCE / Kubernetes Administrator (CKA) / VMware VCP Soft Skills: * Strong analytical and troubleshooting capabilities. * Excellent communication and collaboration skills across global teams. * Ability to lead operations improvement initiatives and mentor support engineers. * Focused on reliability, scalability, and service excellence. For Internal Job Movement: * Approval of the employee's current manager is required. * Employees are expected to notify their manager prior to an interview. * Employees in Performance Improvement Plan are not eligible to apply. * Minimum level should be EXP if applying as part of Internal Job Posting. Why Join Us: * Work on next\-generation AI infrastructure operations and automation . • Be part of a global team managing HPE’s AI Factory and PCAI platforms supporting large\-scale AI workloads. * Opportunity to contribute to service innovation and continuous improvement initiatives in AI infrastructure management **What you need to bring:** Bachelor’s / Master’s Degree in Computer Science, IT, or equivalent field. * 8\+ years of IT infrastructure administration experience, including 3\+ years in AI/HPC or GPUbased environments. * Proven experience in platform operations, monitoring, and lifecycle management of enterprise\-grade AI and HPC environments. *…

Similar Jobs

Browse all jobs

Upload resume for AI match score

Job Overview

Job type: Full-time
Work mode: On-site
Location: Bengaluru
Posted: 1d ago
Source: Scraped

LinkedIn 𝕏 / Twitter

Technical Solutions Cons

Required Skills

Job Description

Similar Jobs

Job Overview

Share