Posted on: 08/12/2025
Description :
Key Responsibilities :
- Lead the design and development of highly scalable, reliable, and secure infrastructure on AWS/GCP.
- Own infrastructure architecture across microservices, streaming systems, storage, networking, and distributed systems.
- Build, improve, and scale CI/CD pipelines, GitOps workflows, and automated deployment frameworks.
- Drive observability strategy : monitoring, logging, alerting using Prometheus, Grafana, ELK/EFK, CloudWatch, etc.
- Oversee infrastructure supporting data engineering, ML workloads, and real-time analytics.
- Implement industry best practices around reliability, performance tuning, and cost optimization.
- Be capable of reviewing design documentation, contributing to code when required, debugging live issues, and supporting incident resolution.
- Guide infra teams on Kubernetes deployments, network routing, cloud security, IaC, and container lifecycle management.
- Build automation and tooling using Python, Go, Java, or similar languages.
- Lead, mentor, and grow high-performing DevOps, SRE, Platform, or Infra teams.
- Foster a culture of ownership, learning, and engineering excellence.
- Collaborate cross-functionally with product engineering, data engineering, and ML teams.
- Drive hiring, onboarding, skill development, and performance management.
- Ensure uptime, SLAs, and platform reliability for a large-scale consumer platform.
- Manage incident response, root-cause analysis, and preventive actions.
- Own budgets, cloud cost optimization, and capacity planning.
- Advocate for automation-first and infrastructure-as-code practices.
Required Skills & Qualifications :
- 5- 12 years of experience in infrastructure, DevOps, SRE, or platform engineering.
- 2+ years of leadership or engineering management experience.
- Deep expertise in AWS/GCP cloud services.
- Strong experience with Docker, Kubernetes, and orchestration systems.
- Expertise with Infrastructure-as-Code tools such as Terraform, Pulumi, or AWS CDK.
- Strong hands-on knowledge of streaming platforms like Kafka, RabbitMQ, Spark Streaming, etc.
- Experience with distributed systems, high-scale architecture, and real-time systems.
- Proven experience building and scaling CI/CD pipelines.
- Strong foundation in monitoring, observability, and logging systems.
- Working knowledge of programming languages such as Python, Go, Java, or Django/Spring.
- Experience managing infra-heavy, data-focused, or platform engineering teams.
Preferred / Nice-to-Have :
- Experience in OTT, streaming, or consumer internet platforms.
- Exposure to ML infrastructure, feature stores, and data governance.
- Contributions to open-source infrastructure or data tooling.
- Strong engineering community presence (conferences, blogs, open-source, meetups).
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Engineering Management
Job Code
1586516
Interview Questions for you
View All