We are seeking an experienced Senior DevOps Engineer to design, build, operate, and continuously optimize highly scalable cloud infrastructure on AWS. This role is hands-on and ownership-driven, focused on ensuring platform reliability, performance, security, and cost efficiency at scale. The ideal candidate has proven experience managing infrastructure for high-traffic B2C products and understands the challenges of running systems that serve millions of users with unpredictable load patterns.

You will be responsible for end-to-end DevOps practices, including cloud architecture, CI/CD automation, monitoring, security, scalability planning, and incident response. This role requires strong technical depth, proactive risk identification, and the ability to work closely with backend, frontend, and product engineering teams to support rapid feature delivery without compromising stability.

Key Responsibilities :

1. Cloud Infrastructure - AWS (Primary Focus) :

- Architect, deploy, and manage scalable, secure, and highly available infrastructure using core and advanced AWS services such as EC2, ECS/EKS, Lambda, S3, CloudFront, RDS, ELB/ALB, VPC, IAM, and Route53.

- Design infrastructure that supports high concurrency, low latency, and fault tolerance for consumer-facing workloads.

- Continuously optimize cloud costs, performance, and resource utilization across development, staging, and production environments.

- Ensure infrastructure is resilient to failures through redundancy, autoscaling, and disaster recovery strategies.

2. CI/CD Automation :

- Design, build, and maintain robust CI/CD pipelines using tools such as Jenkins, GitHub Actions, or GitLab CI.

- Automate build, test, and deployment workflows for microservices, backend APIs, and supporting systems.

- Implement deployment strategies such as blue/green and canary releases to enable zero-downtime production rollouts.

- Collaborate with engineering teams to streamline release cycles and reduce deployment-related risks.

3. Observability & Monitoring :

- Implement comprehensive logging, metrics collection, and alerting using tools such as Grafana, Prometheus, ELK stack, CloudWatch, New Relic, or equivalent platforms.

- Establish proactive monitoring to detect anomalies, performance degradation, and service disruptions before they impact users.

- Build real-time dashboards that provide visibility into system health, application performance, and traffic behavior during peak usage.

- Conduct post-incident analysis and continuously improve observability based on production learnings.

4. Security, Compliance & Risk Highlighting :

- Conduct regular risk assessments to identify vulnerabilities across cloud architecture, IAM access policies, secrets management, and network exposure.

- Implement security best practices including VPC isolation, least-privilege IAM roles, WAF rules, firewall configurations, and secure SSL/TLS management.

- Ensure secure handling of credentials, API keys, and sensitive data using appropriate secrets management solutions.

- Actively participate in incident response, root cause analysis, and remediation planning for security or availability issues.

5. Scalability & Reliability Engineering :

- Analyze traffic patterns and usage trends, including peak hours, weekends, and event-driven spikes.

- Identify scalability bottlenecks across microservices, caching layers, CDN distribution, and database workloads.

- Design and implement solutions to support rapid growth and sudden traffic surges without service degradation.

- Perform capacity planning, load testing, and stress testing to ensure the platform is prepared for 10x or higher traffic growth.

6. Database & Storage Support :

- Administer and optimize MongoDB for high-read, low-latency production workloads.

- Design and maintain backup, recovery, and replication strategies to ensure data durability and availability.

- Work closely with backend teams to improve query performance, indexing strategies, and overall database efficiency.

- Monitor database health and proactively address performance or scaling challenges.

7. Automation & Infrastructure as Code :

- Implement Infrastructure as Code using tools such as Terraform, CloudFormation, or Ansible.

- Automate repetitive infrastructure and operational tasks to ensure consistency, reliability, and faster provisioning.

- Maintain version-controlled infrastructure definitions to support auditability and repeatability across environments.

Required Skills & Experience :

Technical Must-Haves :

- 5+ years of DevOps or SRE experience in cloud-native, product-based environments

- Strong hands-on experience with AWS services and production-grade architectures

- Expertise in building and maintaining Jenkins-based CI/CD pipelines

- Solid experience managing MongoDB in production systems

- Strong understanding of networking concepts including VPCs, subnets, routing, NAT, and security groups

- Proficiency in scripting using Bash, Python, or Shell

- Experience with incident management, root cause analysis, and risk identification

Nice to Have :

- Experience working in high-traffic, content-heavy, or streaming-based platforms

- Familiarity with Docker, Kubernetes, EKS, and container orchestration patterns

- Understanding of CDN behavior, caching strategies, and media delivery pipelines

Personality & Mindset :

- Strong sense of ownership and accountability for system reliability and performance.

- Proactive problem-solver who anticipates scaling and operational challenges before they occur.

- Comfortable collaborating with cross-functional teams in fast-paced product environments.