Join Our Community
A highly skilled Senior AI Solutions Engineer responsible for provisioning and deployment of solutions to support AI training and inferencing workloads. The ideal candidate is expected to have a strong understanding of Kubernetes, IaC (Terraform), CI/CD pipelines (ArgoCD, Jenkins) ,MLOps, LLMOps \- LLM pipelines virtualization, inference/model serving , compute, storage and networking within a data center, a strong grasp on Gen AI, LLMs, Machine Learning, Deep Learning, and have hands\-on experience deploying Kubernetes on both virtualized and bare metal environments. Knowledge of operating systems, virtualization, container orchestration, configuration management, automation, distributed systems and artificial intelligence and their capabilities is a must.
Maintain thorough documentation for processes, platform architecture, system configurations, and troubleshooting steps.
+ Bachelor’s degree in computer science, Information Technology, or related field (or equivalent work experience)
+ 5\+ years’ experience provisioning and administering container orchestration platforms to support mission critical AI workloads.
+ Experience working in projects involving compute, network and storage components within a datacentre.
+ Experience writing and formatting high\-level and low\-level technical documentation for proposed solutions
+ Experience in AIOps and coordinating with platform support team.
+ Understanding of machine learning, deep learning, neural networks, and foundation models.
+ Understanding of AI training and fine\-tuning workflows, inference pipelines, and feature engineering.
+ Hands\-on knowledge deploying, configuring and maintaining container orchestration tools such as Red Hat OpenShift, RKE2, or upstream Kubernetes.
+ Working knowledge of container fundamentals: container networking and storage volumes, as well as building and deploying Docker images.
+ Working knowledge of various type of Operating Systems, such as Unix, Linux and Windows.
+ Understanding of integration with observability and monitoring (Prometheus, Grafana) and logging.
+ Familiarity with scripting, Python, Ansible, Terraform, Git, and CI/CD pipelines.
+ Understanding of vGPU, pass\-through, MIG, or container\-based GPU orchestration options.
+ Familiarity with Agile and DevOps ways of working.
+ Strong stakeholder management, and communication skills.
+ Familiarity with Kubernetes ecosystem using helm charts, operators, and container registries (i.e. Quay).
+ Working knowledge of container fundamentals: container networking and storage volumes, as well as building and deploying Docker images.
+ Experience with automation tools for infrastructure provisioning and configuration such as Ansible and Terraform.
+ Experience with monitoring and logging tools such as Prometheus, Grafana, Zabbix or similar.
+ Understanding of vGPU, pass\-through, MIG, or container\-based GPU orchestration options.
+ Understanding of firewalls, security policies, NAT, VPN tunnels, RBAC, TLS, PKI and certificates.
+ Understanding of distributed systems requirements and design (scalability, fault tolerance, HA).
+ Understanding of AI training and fine\-tuning workflows, inference pipelines, and feature engineering.
+ Understanding of Gen AI, LLMs, RAG pipelines, and relevant use cases.
+ Familiarity with TensorFlow, PyTorch, Rapids, and other GPU\-accelerated libraries.
+ Professional level certifications.
Hands\-on experience with Machine Learning and Deep Learning frameworks, GPU virtualization, inference/model serving
This position may require evening and weekend work for time\-sensitive project implementations
Senior Manager - Agile Sourcing Operations
Bristol Myers Squibb · Hyderabad, Telangana, India
Project Manager — Agile/Scrum (IT Delivery)
Innspark · Kochi, Kerala, India
Agile Coaching Advisor - HIH - Evernorth
Evernorth Health Services · Hyderabad, Telangana, India