- Proven ability to reduce operational toil, improve system resilience, and partner closely with product and engineering teams to embed reliability by design.

We are looking for a highly skilled Senior Site Reliability Engineer (SRE) to join our engineering organization.

- As a senior member of the team, you will play a key role in designing, building, and operating highly scalable, reliable, and secure systems across cloud and on-prem environments.

- You will partner closely with product engineering, DevOps, security, and platform teams to drive reliability, improve developer velocity, and operational excellence.

- This role requires hands-on experience with large-scale distributed systems, deep expertise in automation and infrastructure engineering, and a passion for reducing toil through code.

Key Responsibilities :

Reliability & Performance :

- Ensure availability, resilience, scalability, and performance of production systems

- Define, implement, and enforce SLIs, SLOs, and error budgets

- Conduct capacity planning, load testing, and performance tuning

Automation & Operations Engineering :

- Automate manual operational tasks via tooling, scripts, and platform services

- Develop infrastructure as code (IaC) for cloud and on-premise environments

- Implement CI/CD improvements and production-safe rollout strategies (blue/green, canary, feature toggles)

Observability & Monitoring :

- Build, manage, and improve logging, metrics, tracing, and alerting

- Implement proactive monitoring strategies to detect issues before they impact customers

- Own incident management processes including postmortems and runbooks

Security & Compliance :

- Integrate security controls into pipelines and runtime environments

- Enforce least-privilege access, secret management, and vulnerability remediation

- Partner with SecOps to ensure compliance in regulated environments

Collaboration & Coaching :

- Work daily with engineering and DevOps teams to improve system reliability

- Mentor junior team members on design, reliability, cloud systems, and operational excellence

- Advocate SRE principles across engineering teams

- Incident Response & Continuous Improvement

- Lead incident triage and recovery

- Drive blameless post-incident reviews and systemic fixes

- Reduce MTTR through tooling, automation, and resilient architectures

Key Skills & Experience :

- Required 7 - 10 years of experience in SRE/Systems Engineering roles

- Expertise in Linux-based systems and distributed architectures

Proficiency in one or more programming/scripting languages :

- Python, Go, Bash, Java, or similar

Hands-on experience with :

- Kubernetes (managed or self-hosted on-prem)

- Docker and container ecosystems

Infrastructure automation tools :

- Terraform, Helm, etc.

- CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, Azure DevOps, etc.)

- Cloud experience with at least one major provider (AWS / Azure / GCP)

Strong understanding of :

- Networking concepts (DNS, load balancers, VPC, firewalls, NAT, routing)

- Observability stacks (Prometheus/Grafana, ELK, Splunk, OpenTelemetry, New Relic, Datadog)

- Experience running production systems at scale

Preferred :

- Experience with on-prem infrastructure, VMware, or hybrid-cloud environments

- Database reliability knowledge (PostgreSQL, MySQL, NoSQL-Mongo, caching systems)

Experience with :

- Distributed messaging (Kafka, RabbitMQ, SNS/SQS, etc.)

- Zero downtime deployments

Background in :

- FinOps optimization

- Resiliency patterns (circuit breakers, retries, autoscaling)

- Certification(s) in cloud platforms or Kubernetes