Position : Senior Site Reliability Engineer (SRE)

Experience : 10+ Years

Remote

Please Note : Candidates should be ready to work in 24X7 rotational shifts, on call support. Weekly Rotations .

About the Role :

We are seeking a Senior Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of large-scale production systems.

This role demands strong technical expertise, ownership, and strategic thinking- with a focus on automation, monitoring, and operational excellence.

You will design and implement SRE practices for our customers, drive improvements in availability and performance, and work closely with development teams to build resilient systems that scale.

Key Responsibilities :

- Design, implement, and maintain highly available and scalable systems on AWS and Azure.

- Define, configure, and report on SLIs (Service Level Indicators), SLOs (Service Level Objectives), and SLAs (Service Level Agreements).

- Conduct performance analysis, capacity planning, and load testing to identify and resolve bottlenecks.

- Drive system and tooling improvements implement monitoring stacks, tracing tools, and CI/CD pipelines.

- Develop and maintain automation frameworks and tools (Python, Go, Java) to eliminate manual tasks (toil).

- Manage infrastructure using Infrastructure as Code (IaC) tools such as Terraform or Ansible.

- Enhance and maintain CI/CD pipelines for reliable, secure deployments.

- Lead incident response efforts, reducing MTTD and MTTR, and conduct blameless postmortems and RCAs.

- Participate in on-call rotations to resolve production issues quickly.

- Collaborate with development teams to influence system design for improved reliability, operability, and security.

- Configure and manage monitoring and observability stacks (Prometheus, Grafana, ELK/Loki).

- Develop error budgets and build dashboards for reliability reporting.

- Write scripts to automate repetitive operational tasks and improve overall efficiency.

Required Skills & Experience :

- Deep knowledge of Linux/Unix administration, troubleshooting, and performance tuning.

- Strong understanding of networking fundamentals (TCP/IP, DNS, load balancing).

- Hands-on experience with AWS and/or Azure cloud platforms.

- Expertise in Infrastructure as Code (IaC) Terraform, CloudFormation, or Ansible.

- Proficiency in Docker and Kubernetes for containerization and orchestration.

- Experience with monitoring tools (Prometheus, Grafana) and logging solutions (ELK Stack : Elasticsearch, Logstash, Kibana).

- Familiarity with CI/CD pipelines (Azure DevOps, Jenkins, GitLab CI) and Git version control.

- Strong scripting skills in Python, Go, or Java.

- Excellent problem-solving, analytical, and collaboration skills.

What We Offer :

- Opportunity to work with cutting-edge infrastructure and cloud technologies.

- Ownership of critical reliability and automation initiatives.

- Collaborative work environment focused on learning and innovation.