Posted on: 29/01/2026
Job Description :
- Senior Site Reliability Engineer with 7-10 years of experience building and operating highly available, scalable, and secure distributed systems.
- Strong background in cloud-native platforms, Kubernetes, automation, observability, and reliability engineering.
- Proven ability to reduce operational toil, improve system resilience, and partner closely with product and engineering teams to embed reliability by design.
We are looking for a highly skilled Senior Site Reliability Engineer (SRE) to join our engineering organization.
- As a senior member of the team, you will play a key role in designing, building, and operating highly scalable, reliable, and secure systems across cloud and on-prem environments.
- You will partner closely with product engineering, DevOps, security, and platform teams to drive reliability, improve developer velocity, and operational excellence.
- This role requires hands-on experience with large-scale distributed systems, deep expertise in automation and infrastructure engineering, and a passion for reducing toil through code.
Key Responsibilities :
Reliability & Performance :
- Ensure availability, resilience, scalability, and performance of production systems
- Define, implement, and enforce SLIs, SLOs, and error budgets
- Conduct capacity planning, load testing, and performance tuning
Automation & Operations Engineering :
- Automate manual operational tasks via tooling, scripts, and platform services
- Develop infrastructure as code (IaC) for cloud and on-premise environments
- Implement CI/CD improvements and production-safe rollout strategies (blue/green, canary, feature toggles)
Observability & Monitoring :
- Build, manage, and improve logging, metrics, tracing, and alerting
- Implement proactive monitoring strategies to detect issues before they impact customers
- Own incident management processes including postmortems and runbooks
Security & Compliance :
- Integrate security controls into pipelines and runtime environments
- Enforce least-privilege access, secret management, and vulnerability remediation
- Partner with SecOps to ensure compliance in regulated environments
Collaboration & Coaching :
- Work daily with engineering and DevOps teams to improve system reliability
- Mentor junior team members on design, reliability, cloud systems, and operational excellence
- Advocate SRE principles across engineering teams
- Incident Response & Continuous Improvement
- Lead incident triage and recovery
- Drive blameless post-incident reviews and systemic fixes
- Reduce MTTR through tooling, automation, and resilient architectures
Key Skills & Experience :
- Required 7 - 10 years of experience in SRE/Systems Engineering roles
- Expertise in Linux-based systems and distributed architectures
Proficiency in one or more programming/scripting languages :
- Python, Go, Bash, Java, or similar
Hands-on experience with :
- Kubernetes (managed or self-hosted on-prem)
- Docker and container ecosystems
Infrastructure automation tools :
- Terraform, Helm, etc.
- CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, Azure DevOps, etc.)
- Cloud experience with at least one major provider (AWS / Azure / GCP)
Strong understanding of :
- Networking concepts (DNS, load balancers, VPC, firewalls, NAT, routing)
- Observability stacks (Prometheus/Grafana, ELK, Splunk, OpenTelemetry, New Relic, Datadog)
- Experience running production systems at scale
Preferred :
- Experience with on-prem infrastructure, VMware, or hybrid-cloud environments
- Database reliability knowledge (PostgreSQL, MySQL, NoSQL-Mongo, caching systems)
Experience with :
- Distributed messaging (Kafka, RabbitMQ, SNS/SQS, etc.)
- Zero downtime deployments
Background in :
- FinOps optimization
- Resiliency patterns (circuit breakers, retries, autoscaling)
- Certification(s) in cloud platforms or Kubernetes
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1607136