Posted on: 17/07/2025
Job Requirement :
What you'll do :
- Bridging the gaps b/w core infra, security, QA and development team.
- Owning the end-to-end Availability, Performance, Capacity of applications and their infrastructure and creating/maintaining the respective observability with Prometheus/New Relic/ELK/Loki.
- Providing 24X7 infra & app support, building processes and documenting "tribal" knowledge around the same time.
- Mentor and train L1 engineers and continually improve app and infra support processes.
- Managing application deployment & GKE platforms - automate and improve development and release processes.
- Creating, managing and maintaining datastores & data platform infra using IaC.
- Owning and onboarding new applications with the production readiness review process.
- Managing the SLO/Error Budgets/Alerts and performing root cause analysis for production errors.
- Working with Core Infra, Dev and Product teams to define SLO/Error Budgets/Alerts.
- Working with the Dev team to have an in-depth understanding of the application architecture and its bottlenecks.
- Identifying observability gaps in application & infrastructure and working with stakeholders to fix them.
- Managing outages and doing detailed RCA with developers and identifying ways to avoid that situation.
- Automate toil and repetitive work.
What We're Looking For :
- 6+ Years of experience in managing high traffic, large scale microservices and infrastructure with excellent troubleshooting skills.
- Experience in troubleshooting, managing and deploying containerized environments using Docker/containerd, Kubernetes is a must.
- Must be proficient with the helm with experience in service mesh like Istio, Linkerd.
- Must be very hands-on in managing and troubleshooting the Kubernetes environment.
- Extensive experience with Linux administration and a good understanding of the various Linux kernel subsystems (memory, storage, network etc).
- Extensive experience in DNS, TCP/IP, UDP, GRPC, Routing and Load Balancing.
- Expertise in GitOps, Infrastructure as a Code tool such as Terraform etc.. and Configuration Management
- Tools such as Chef, Puppet, Saltstack, Ansible.
- Expertise in Google Cloud (GCP) and/or other relevant Cloud Infrastructure solutions like AWS or Azure.
- Experience in building the CI/CD pipelines with tools such as Jenkins, GitLab, Spinnaker, Argo etc.
- Experience with multiple datastores is a plus (Kafka/RabbitMQ, Redis, Elasticsearch).
- Must be good in any of the DevOps scripting languages - python or go.
- A collaborative spirit with the ability to work across disciplines to influence, learn and deliver.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1514727
Interview Questions for you
View All