HamburgerMenu
hirist

Zoop.One - Site Reliability Engineer - DevOps

ZOOP
Pune
3 - 5 Years
star-icon
3.9white-divider12+ Reviews

Posted on: 18/08/2025

Job Description

Role : Site Reliability Engineer.

Location : Pune (on-site).

Experience : 3+ years.

Someone who has experience setting up an in-house monitoring platform with 99.99% uptime SLA using Victoria Metrics & Prometheus in Multi Region.

Site Reliability Engineer Zoop.

The Opportunity :

We're seeking a Senior Site Reliability Engineer to elevate and standardize our reliability engineering practices. This role offers the opportunity to shape and optimize SRE practices in a high-growth fintech environment while working with cutting-edge technologies and critical identity verification services.

Key Responsibilities :

Standardization & Optimization :

- Assess and standardize existing monitoring and observability practices across NewRelic and Prometheus.

- Refine and formalize SLIs/SLOs for all solution offerings.

- Optimize current alerting strategies to improve signal-to-noise ratio.

- Document and standardize incident management processes.

- Create comprehensive runbooks for all critical services.

Reliability Engineering :

- Drive improvements to achieve and maintain 99.95% uptime for critical services.

- Optimize API response times to strengthen our "Fastest Platform" positioning.

- Implement advanced chaos engineering practices.

- Enhance existing automation and self-healing capabilities.

- Standardize disaster recovery and business continuity procedures.

Infrastructure Excellence :

- Optimize our GCP/Kubernetes infrastructureand AWS where applicablefor enhanced reliability.

- Standardize Infrastructure as Code (IaC) practices across teams.

- Identify and automate remaining manual operational tasks.

- Build advanced tooling for monitoring, deployment, and troubleshooting.

- Drive cloud cost optimization initiatives.

- Prepare for potential self?hosting scenarios, including operating Grafana, Prometheus, VictoriaMetrics, and log stacks such as Loki and Elastic.

Security & Compliance :

- Ensure all reliability practices meet ISO 27001:2022, ISO 27017:2015, ISO 27018:2019, ISO 27701:2019, and SOC 2 Type II requirements (with a pragmatic, risk?based approach).

- Enhance security monitoring and anomaly detection.

- Standardize secure CI/CD practices across the organization.

- Implement comprehensive audit and compliance reporting.

Collaboration & Process Improvement :

- Partner with the Platform team to enhance and standardize existing SRE workflows.

- Collaborate with 50+ developers to strengthen reliability culture.

- Lead blameless post?mortems and drive systematic improvements.

- Establish SRE best practices and knowledge's haring sessions.

- Build a roadmap for eventual SRE team expansion.

Technical Requirements :

Must?Have Skills :

- Experience : 3+ years in SRE, DevOps, or similar roles with a focus on standardizing and scaling practices.

- Cloud Expertise : Deep hands?on experience with Google Cloud Platform (GCP) and Amazon Web Services (AWS).

- Container Orchestration : Advanced Kubernetes and Docker skills in production environments.

- Programming : Proficiency in at least two of Go, Python, TypeScript, plus strong Shell's cripting abilities.

- Operating Systems : Expert?level Linux knowledge and tuning.

- Monitoring : Expert?level knowledge of Prometheus and NewRelic.

- IaC : Strong experience with Terraform or similar tools.

- Process Excellence : Proven track record of standardizing SRE practices.

Preferred Qualifications :

- Experience in fintech, banking, or other high's ecurity environments.

- Knowledge of ISO 27001, SOC 2, and related compliance requirements.

- Experience optimizing API reliability at scale (millions of requests/day).

- Background in maturing existing SRE practices.

- Familiarity with identity verification or fraud detection systems.

- GCP Professional Cloud Architect or DevOps Engineer certification.

- Experience running self?hosted observability stacks (Grafana, Prometheus, VictoriaMetrics, Loki, Elastic).


info-icon

Did you find something suspicious?