ID: 59403
7 \- 9 Years
1 Opening
Pune
### **Role description**
Site Reliability Engineering (SRE) is a discipline that combines software and systems engineering for building and running large\-scale, distributed, fault\-tolerant systems. SRE ensures that internal and external services meet or exceed reliability and performance expectations while adhering to engineering principles. SRE is also an engineering approach to building and running production systems \- we engineer solutions to operational problems. Our SREs are responsible for overall system operation and we use a breadth of tools and approaches to solve a broad set of problems. Practices such as limiting time spent on operational work, blameless postmortems, proactive identification, and prevention of potential outages. Our SRE culture of diversity, intellectual curiosity, problem solving and openness is key to its success.
brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big, and take risks in a blame\-free environment. We promote self\-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn, grow and take pride in our work.
Manage system(s) uptime across cloud\-native (AWS, GCP) and hybrid architectures. Build infrastructure as code (IAC) patterns that meet security and engineering standards using one or more technologies (Terraform, scripting with cloud CLI, and programming with cloud SDK). Build CI/CD pipelines for build, test and deployment of application and cloud architecture patterns, using platform (Jenkins) and cloud\-native toolchains. Build automated tooling to deploy service requests to push a change into production. Build runbooks that are comprehensive and detailed to manage detect, remediate and restore services. Solve problems and triage complex distributed architecture service maps. On call for high severity application incidents and improving run books to improve MTTR Lead availability blameless postmortem and own the call to action to remediate recurrences.
BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), or equivalent job experience required 7\-10 years of experience in software engineering, systems administration, database administration, and networking. 4\+years of experience developing and/or administering software in public cloud Experience in monitoring infrastructure and application uptime and availability to ensure functional and performance objectives. Hands on experience working in GCP infrastructure services and provisioning automated infrastructure using terraform Rebuilding GCP VM instances using automated Terraform and Jenkins pipelines Provisioning GCP Resources (ex GCE, GKE resources, Storage, network components etc using automated pipelines) Configuring the monitoring(ing and dashboards) for the microservices using stackdriver, datadog and appdynamics.
Creating the new automation scripts and updating the existing automation framework using terraform and Shell/Python scripting Set up IAM policy bindings through automation and troubleshoot issues with IAM policies in GCP and AWS Implementing Custom roles and binding custom roles in GCP projects Automate blue/green deployment strategies in GCP environments On call with pager duty for production incidents Hands on experience is setting up CICD pipelines using git and jenkins Good Experience in creating and updating helm charts for GKE resources Demonstrable cross\-functional knowledge with systems, storage, networking, security and databases What could set you apart An ability to demonstrate successful performance of our Success Profile skills, including: DevSecOps \- Leads DevSecOps operational practices and designs solutions that improve resilience of products/services. Designs, codes, verifies, tests, documents, modifies complex programs/scripts and integrated software services. Leads exploration of new software development methods, tools, and techniques. Continuously looks for opportunities to improve standard processes and tools to achieve a well\-engineered result. Conducts reviews of overall team performance and works directly with colleagues to improve team performance. Operational Excellence \- Drives work plans for short\-term assignments of moderate complexity, typically contained within their own function. Establishes the processes to monitor and measure systems against key metrics to ensure availability of systems. Reviews and recommends new ways of working to make processes run smoother and faster. Systems Thinking \- Ensures knowledge of best practices and how systems integrate with others to improve their own work and the work of less experienced colleagues. Assess technology trends, use knowledge and make recommendations on improving upon the defined expectations of systems availability. Technical Communication/Presentation \- Articulates complex messages and the impacts to stakeholders to build support and agreement.
Demonstrates strong written and verbal communication skills, and the ability to tailor to specific audiences. Work with others to achieve results and proactively address sources of conflict and emotion with focus on the best solution for Troubleshooting \- Applies a methodical approach to routine and moderately complex issue definition and resolution. Initiates and coordinates actions to investigate and resolve problems in systems, processes and services. Reviews and approves problem fixes/remedies. Plans and coordinates the implementation of agreed remedies. Ensure that patterns and trends are assessed and make recommendations for improved system reliability.
### **Skills**
site reliability engineering,terraform,cloud sdk,cicd,devsecops,amazon web services
### **About UST**
UST is a global digital transformation solutions provider. For more than 20 years, UST has worked side by side with the world’s best companies to make a real impact through transformation. Powered by technology, inspired by people and led by purpose, UST partners with their clients from design to operation. With deep domain expertise and a future\-proof philosophy, UST embeds innovation and agility into their clients’ organizations. With over 30,000 employees in 30 countries, UST builds for boundless impact—touching billions of lives in the process.
Senior Staff Developer - AI SOC Automation
Arctic Wolf Networks · Remote
Sen. Mobile App Tester
Testvox · Mumbai
GenAI / AI-ML Engineer
Premier IT Solutions · Ghaziabad