Posted on: 18/08/2025
Role : Site Reliability Engineer.
Location : Pune (on-site).
Experience : 3+ years.
Someone who has experience setting up an in-house monitoring platform with 99.99% uptime SLA using Victoria Metrics & Prometheus in Multi Region.
Site Reliability Engineer Zoop.
The Opportunity :
We're seeking a Senior Site Reliability Engineer to elevate and standardize our reliability engineering practices. This role offers the opportunity to shape and optimize SRE practices in a high-growth fintech environment while working with cutting-edge technologies and critical identity verification services.
Key Responsibilities :
Standardization & Optimization :
- Assess and standardize existing monitoring and observability practices across NewRelic and Prometheus.
- Refine and formalize SLIs/SLOs for all solution offerings.
- Optimize current alerting strategies to improve signal-to-noise ratio.
- Document and standardize incident management processes.
- Create comprehensive runbooks for all critical services.
Reliability Engineering :
- Drive improvements to achieve and maintain 99.95% uptime for critical services.
- Optimize API response times to strengthen our "Fastest Platform" positioning.
- Implement advanced chaos engineering practices.
- Enhance existing automation and self-healing capabilities.
- Standardize disaster recovery and business continuity procedures.
Infrastructure Excellence :
- Optimize our GCP/Kubernetes infrastructureand AWS where applicablefor enhanced reliability.
- Standardize Infrastructure as Code (IaC) practices across teams.
- Identify and automate remaining manual operational tasks.
- Build advanced tooling for monitoring, deployment, and troubleshooting.
- Drive cloud cost optimization initiatives.
- Prepare for potential self?hosting scenarios, including operating Grafana, Prometheus, VictoriaMetrics, and log stacks such as Loki and Elastic.
Security & Compliance :
- Ensure all reliability practices meet ISO 27001:2022, ISO 27017:2015, ISO 27018:2019, ISO 27701:2019, and SOC 2 Type II requirements (with a pragmatic, risk?based approach).
- Enhance security monitoring and anomaly detection.
- Standardize secure CI/CD practices across the organization.
- Implement comprehensive audit and compliance reporting.
Collaboration & Process Improvement :
- Partner with the Platform team to enhance and standardize existing SRE workflows.
- Collaborate with 50+ developers to strengthen reliability culture.
- Lead blameless post?mortems and drive systematic improvements.
- Establish SRE best practices and knowledge's haring sessions.
- Build a roadmap for eventual SRE team expansion.
Technical Requirements :
Must?Have Skills :
- Experience : 3+ years in SRE, DevOps, or similar roles with a focus on standardizing and scaling practices.
- Cloud Expertise : Deep hands?on experience with Google Cloud Platform (GCP) and Amazon Web Services (AWS).
- Container Orchestration : Advanced Kubernetes and Docker skills in production environments.
- Programming : Proficiency in at least two of Go, Python, TypeScript, plus strong Shell's cripting abilities.
- Operating Systems : Expert?level Linux knowledge and tuning.
- Monitoring : Expert?level knowledge of Prometheus and NewRelic.
- IaC : Strong experience with Terraform or similar tools.
- Process Excellence : Proven track record of standardizing SRE practices.
Preferred Qualifications :
- Experience in fintech, banking, or other high's ecurity environments.
- Knowledge of ISO 27001, SOC 2, and related compliance requirements.
- Experience optimizing API reliability at scale (millions of requests/day).
- Background in maturing existing SRE practices.
- Familiarity with identity verification or fraud detection systems.
- GCP Professional Cloud Architect or DevOps Engineer certification.
- Experience running self?hosted observability stacks (Grafana, Prometheus, VictoriaMetrics, Loki, Elastic).
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1531634
Interview Questions for you
View All