Responsibilities :

- The Site Reliability Engineering (SRE) team is responsible for the reliability, scalability, stability and performance of systems and services.

- They work with cross-functional teams to design, build, and maintain systems, and they troubleshoot issues when they arise.

- They bridge the gap between development and operations teams.

- They work closely with business teams to define Service Level Objectives (SLO) and agreements (SLA) of critical systems.

- They also monitor and maintain the uptime of these systems in line with the defined SLOs and SLA's.

- They deploy and manage monitoring tools to gain insights into system health and performance.

- They analyze performance, identify bottlenecks, and implement solutions to improve a system's scalability and latency durations.

- They develop scripts, implement tools, and automation frameworks to reduce the manual intervention efforts of deployment, monitoring, and scaling.

- They work with development teams for the design and development of observability practices like logging, metrics, tracing, etc.

- They aim to diagnose and troubleshoot issues proactively.

- They create actionable alerts on monitoring systems to ensure rapid response for potential production incidents.

- They forecast resource needs and provision adequately for current and future demand.

- They design and execute chaos experiments to test the system's failure resiliency.

- They own, define, and implement the Disaster Recovery (DR) processes for systems.

- They also conduct planned and unplanned mock DR drills to test for response preparedness during production incidents.

- They ensure that security best practices are followed and implemented during the design and operations of systems.

- They also own and maintain documentation of processes, playbooks, and systems.

- They publish KPI reports and other system health updates on a regular basis to the business.

Requirements :

- Must-have - Bachelor's degree, preferably in CS or a related field, or equivalent experience.

- Must-have - 12+ years of overall IT experience.

- Must-have - 7+ years of proven work experience as a Senior Site Reliability Engineer or a similar position.

- Must-have - 5+ years of AWS Cloud experience with AWS Certified DevOps Engineer or SysOps, or Security, etc.

- Must-have : AWS experience 3+ years' experience with using a broad range of AWS technologies (e. g. EC2 RDS, ELB, S3 VPC, CloudWatch & Monitoring Tools) to develop and maintain an Amazon AWS-based cloud solution, with an emphasis on best practice cloud security.

- Must-have - 2+ years of experience in CDN and/or Cache systems like Fastly, Akamai, CloudFront, etc.

- Proven Understanding & strong experience with Cloud deployments ( AWS / Docker/ Kubernetes).

- Knowledge on provisioning IAC Tools like Terraform, Chef, Ansible, Shell, Groovy, Python, etc.

- Experience with monitoring systems such as CloudWatch, NewRelic, Datadog/Splunk, and ELK stack.

- Experience managing cloud network resources (AWS Preferred), such as CloudWatch, VPC, URL proxies, private link, DNS, ACLs, firewalls, and C2S access points.

- Platform or Application Engineering and Operational Knowledge in any of the CI/CD tooling, like GitHub Actions, Jenkins, etc.

- Experience in other tooling Technologies like JIRA, Bitbucket, Jenkins, Fortify, SonarQube, Nexus, Nexus IQ.

- Experience with configuration automation tools like Puppet/Ansible/Chef/Salt.

- Scripting Skills : Strong scripting (e. g. Bash & Python) and automation skills.

- Operating Systems : Windows and Linux system administration.

- Problem Solving : Ability to analyze and resolve complex infrastructure, resource, and application deployment issues.

- Strong attention to detail. Excellent verbal and written communication skills.

- Strong documentation skills.

Good To Have :

- Experience with Terraform/Ansible/Chef/Puppet.

- Experience with GitHub Actions.

- Experience with CloudFront, Fastly.

- Oversees team members performing these functions.

- Anticipates problems and future technical needs and takes necessary steps to address issues.

- Work primarily in server-side technologies and am comfortable with client side whenever required.

- Enthusiastically follow technology trends, software engineering best practices, and technologies.

Did you find something suspicious?

Posted By

Tanmay

Co-Founder at Dash Hire

Last Active: 5 Dec 2025

Job Views:
67

Applications: 28

Recruiter Actions: 0

Posted in

DevOps / SRE

Functional Area

Site Reliability Engineering

Job Code

1537762

Jobs by location

Interview Questions for you

View All

How to Write Leave Application for Urgent Work: Format & Samples (2025)

Top 90+ Machine Learning Interview Questions and Answers

Top 40+ Deep Learning Interview Questions and Answers