HamburgerMenu
hirist

Sails Software - Senior Site Reliability Engineer - Cloud Infrastructure

SAILS SOFTWARE SOLUTIONS PRIVATE LIMITED
Vishakhapatnam/Vizag
6 - 8 Years

Posted on: 12/11/2025

Job Description

Job Summary :

We are looking for an experienced and driven Senior Site Reliability Engineer (SRE) to architect, implement, and maintain robust cloud infrastructure.

This role demands a deep understanding of AWS, Kubernetes, ECS, and the ability to build scalable, secure, and highly available infrastructure from scratch.

The ideal candidate will be a strong advocate for DevOps principles, automation, and reliability, and will possess the skills to support and optimize complex microservices-based architectures.

Key Responsibilities :

Infrastructure Design & Implementation :

- Design and build highly scalable, fault-tolerant, and secure cloud infrastructure using AWS, Kubernetes, and ECS.

- Lead efforts in infrastructure as code (IaC) using tools like Terraform or CloudFormation.

- Develop and enforce best practices for infrastructure provisioning, security, and cost optimization.

System Reliability & Performance :

- Ensure availability, performance, scalability, and security of production systems.

- Implement observability strategies including monitoring, logging, and alerting using tools such as Prometheus, Grafana, ELK, or Datadog.

- Analyse system performance metrics and proactively identify potential issues and bottlenecks.

DevOps & Automation :

- Build and maintain CI/CD pipelines to streamline code deployments across environments.

- Drive automation in infrastructure provisioning, configuration management, and operational tasks.

- Ensure repeatable and reliable deployments using containers and orchestration tools like Kubernetes and ECS.

Service Management :

- Own the SRE lifecycle, including incident management, postmortems, root cause analysis, and runbook creation.

- Collaborate closely with development and QA teams to ensure seamless microservices integration, deployment, and lifecycle management.

- Maintain service-level objectives (SLOs), service-level agreements (SLAs), and error budgets.

Security & Compliance :

- Implement and enforce cloud security best practices for networking, identity and access management, and data protection.

- Support audits, compliance assessments, and vulnerability remediation.

- Monitor for security anomalies and work with security teams to respond to threats.

Technical Skills :

- 6+ years of hands-on experience in Site Reliability Engineering, DevOps, or Cloud Engineering.

- Expertise in AWS services such as EC2, S3, RDS, IAM, VPC, Lambda, CloudWatch, etc.

- Strong knowledge of Kubernetes and container orchestration best practices.

- Experience managing services on Amazon ECS (Fargate or EC2).

- Proficient in infrastructure-as-code tools like Terraform, CloudFormation, or Pulumi.

- Skilled in scripting languages such as Python, Bash, or Go.

- Solid grasp of networking, load balancing, DNS, and firewall rules in cloud environments.

- Deep understanding of microservices architectures, API gateways, and service meshes.

Soft Skills :

- Proven leadership and cross-functional collaboration skills.

- Strong problem-solving and incident-resolution mindset.

- Clear communication, documentation, and stakeholder reporting abilities.

- Passion for continuous improvement and automation.

Preferred Qualifications :

- AWS certifications such as AWS Certified DevOps Engineer, Solutions Architect Professional, or equivalent.

- Familiarity with service meshes like Istio or Linkerd.

- Experience with serverless architectures and event-driven systems.

- Knowledge of regulatory compliance (SOC2, ISO 27001, GDPR) in cloud environments.


info-icon

Did you find something suspicious?