Posted on: 09/07/2025
Responsibilities :
- Bridging the gaps b/w core infra, security, and development team.
- Owning the end-to-end Availability, Performance, and Capacity of applications and their infrastructure, and creating/maintaining the respective observability with DataDog/New Relic/ECS.
- Providing 24X7 infra and app support, building processes, and documenting tribal knowledge at the same time.
- Managing application deployment and AWS ECS platforms - automate and improve development and release processes.
- Creating, managing, and maintaining data stores and data platform infra using IaC.
- Owning and onboarding new applications with the production readiness review process.
- Managing the SLO/Error Budgets/Alerts and performing root cause analysis for production errors.
- Working with the Dev team to have an in-depth understanding of the application architecture and its bottlenecks.
- Identifying observability gaps in application and infrastructure, and working with stakeholders to fix them.
- Managing outages by doing detailed RCA with developers and identifying ways to avoid that situation.
- Automate toil and repetitive work.
Requirements :
- 4 to 6 years of experience in managing large-scale microservices and infrastructure with excellent troubleshooting skills.
- Experience in troubleshooting, managing, and deploying containerized environments using Docker/containers, ECS is a must.
- Must be very hands-on in managing and troubleshooting the AWS environment.
- Extensive experience with Linux administration and a good understanding of the various Linux kernel subsystems (memory, storage, network, etc).
- Good experience in DNS, TCP/IP, UDP, GRPC, Routing, and Load Balancing.
- Expertise in GitOps, Infrastructure as a Code tool such as Terraform, etc., and Configuration Management Tools such as Chef, Puppet, Saltstack, and Ansible.
- Experience working with Cloud Infrastructure solutions like AWS.
- Experience in building CI/CD pipelines.
- Experience with multiple data stores is a plus (Redis, Elasticsearch).
- Must be good in any of the DevOps scripting languages - Python, Ruby, or Go.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1510034
Interview Questions for you
View All