Posted on: 23/09/2025
Key Responsibilities :
- Own the availability, scalability, and performance of production systems and services.
- Design and manage distributed systems and microservices architectures at scale.
- Develop and implement incident response strategies, root cause analysis, and create actionable postmortems.
- Drive improvements in infrastructure automation, CI/CD pipelines, and deployment strategies.
- Collaborate with cross-functional teams including engineering, product, and QA to embed SRE best practices.
- Implement observability tools (e.g., Prometheus, Grafana, ELK, Datadog) to monitor system performance and proactively detect issues.
- Manage and optimize cloud infrastructure on AWS, including services such as EC2, ELB,
AutoScaling, S3, CloudFront, and CloudWatch.
- Utilize Infrastructure-as-Code tools such as Terraform, CloudFormation, or Pulumi for provisioning and maintaining infrastructure.
- Apply strong Linux, networking, load balancing, and security principles to ensure platform
resilience.
- Leverage Docker and Kubernetes for container orchestration and scalable deployments.
- Build internal tools and automation using Python, Go, or Bash scripting.
- Support event-driven architectures leveraging Kafka or RabbitMQ for high-throughput, real-time systems.
- Proactively contribute to reliability-focused architecture and design discussions.
Required Skills & Experience :
- Minimum 3 years of experience leading SRE, DevOps, or Infrastructure teams.
- Proven track record managing distributed systems and microservices at scale.
- Deep understanding of Linux systems, networking fundamentals, load balancing, and infrastructure security.
- Strong hands-on experience with AWS services : EC2, ELB, AutoScaling, CloudFront, S3, and CloudWatch.
- Expert-level knowledge of Docker and Kubernetes in production environments.
- Proficient with Infrastructure-as-Code tools : Terraform, CloudFormation, or Pulumi.
- Hands-on experience with monitoring and observability tools : Prometheus, Grafana, ELK
Stack, or Datadog.
- Strong scripting or programming skills in Python, Go, Bash, or similar languages.
- Familiarity with Kafka or RabbitMQ for event-driven and messaging architectures.
- Excellent incident management skills, including triage, RCA, and communication.
- Ability to thrive in fast-paced environments and adapt to changing priorities.
Preferred Qualifications :
- Experience in startup or high-growth environments.
- Contributions to open-source DevOps or SRE tools are a plus.
- Certifications in AWS, Kubernetes, or other cloud-native technologies are advantageous.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1550420
Interview Questions for you
View All