We're hiring a Full Stack Software Engineer to build the infrastructure that powers our AI agents and ML systems end\-to\-end — from UX/UI, fine\-tuning foundation models to shipping production\-grade agent harnesses. You'll work across the stack: Creating UX design and UI in ReactJS/TS, building MLOps pipelines, customizing LLMs, and deploying scalable agent systems on Kubernetes. This role sits at the intersection of UX design, ML engineering, platform engineering, and applied AI.
- Design UX and build UI for Agentic Ops
- Design and build agent harnesses in Python — the runtime scaffolding that enables AI agents to perceive, reason, plan, and act reliably
- Develop and maintain a robust MLOps framework using Kubeflow and complementary tooling (MLflow, Argo, Airflow, or similar) to orchestrate training, evaluation, and deployment workflows
- Fine\-tune foundation LLMs using techniques such as LoRA/QLoRA, SFT, and RLHF; manage datasets, training runs, and evaluation pipelines
- Deploy and operate services on Kubernetes, including model serving, autoscaling, and observability
- Build and integrate AI agents using modern agent frameworks (LangGraph, CrewAI, AutoGen, LlamaIndex, or similar)
- Apply software engineering rigor — SOLID principles, secure coding, static analysis, code reviews, and CI/CD — across all deliverables
- Bachelor’s or Master’s degree in Engineering, along with around 8\+ years of experience in Python development, including building and supporting production systems
- Hands\-on experience working with agent\-based or agentic systems, using at least one framework such as LangGraph, CrewAI, AutoGen, LangChain, or LlamaIndex
- Exposure to designing or contributing to MLOps pipelines, with familiarity with tools like Kubeflow
- Practical experience in fine\-tuning large language models (for example, open\-source models like Llama, Mistral, Qwen, or similar)
- Experience deploying containerized applications on Kubernetes, including areas like Helm, operators, networking, and resource management
- Familiarity with at least one major cloud platform (AWS, GCP, or Azure), including services related to compute, storage, identity access management, and machine learning
Understanding of software engineering practices such as modular design (SOLID principles), design patterns, secure coding practices, static analysis tools (for example, mypy, ruff, Bandit, SonarQube), and testing approaches (unit and integration testing)
- Exposure to distributed training approaches, using tools such as DeepSpeed, FSDP, or Accelerate
- Familiarity with vector databases, retrieval\-augmented generation (RAG) systems, and evaluation frameworks for language models
- Experience working with model serving solutions such as vLLM, TGI, KServe, or Triton