Posted on: 21/01/2026
Description :
We are seeking a highly motivated and experienced senior site reliability engineer to join our engineering team. As an SRE, you will be responsible for ensuring the reliability, availability, scalability, and performance of our applications and infrastructure. You will collaborate closely with software developers, platform engineers, and other team members to design, provision, build, and maintain systems that are scalable, secure, and highly available.
Responsibilities :
- Architect and lead the design of scalable, reliable infrastructure solutions.
- Implement strategies for high availability, scalability, and low-latency performance.
- Define service-level objectives (SLOs) and service-level indicators (SLIs) to track performance and reliability.
- Drive incident management by identifying root causes and providing long-term solutions.
- Mentor junior engineers and foster a collaborative, learning-focused environment.
- Design advanced monitoring and alerting systems for proactive system management.
- Architect and optimize network topologies (hybrid cloud, multi-cloud, and on-prem) to support ultra-low-latency trading and compliance-driven workloads.
- Configure and manage cloud and on-prem networking components (VPCs, Shared VPCs, Private Service Connect, Cloud NAT, and Global Load Balancers) for secure and compliant transaction flows.
- Implement secure connectivity solutions (VPNs, Interconnect, Direct Connect, and service meshes) to meet fintech regulatory requirements and standards.
- Develop and maintain DNS, load-balancing, and traffic-routing strategies to ensure millisecond-level latency for real-time transactions.
- Evolve Infrastructure as Code (IaC) practices and principles to automate infrastructure provisioning.
- Collaborate on reliability roadmaps, performance benchmarks, and disaster recovery plans tailored for low-latency and high-throughput workloads.
- Manage Kubernetes clusters at scale, integrating service meshes like Istio or Linkerd.
- Implement chaos engineering principles to strengthen system resilience.
- Influence technical direction, reliability culture, and organizational strategies.
Requirements :
- 6 to 9 years of experience in SRE, DevOps, or system architecture roles with large-scale production systems.
- Extensive experience managing and scaling high-traffic, low-latency fintech systems, ensuring reliability, compliance, and secure transaction processing.
- Proven expertise in the networking stack, with hands-on experience in BGP, OSPF, DNS, HTTP(S), TCP/IP, MPLS, and VPN protocols.
- Advanced knowledge of GCP networking (VPC design, Shared VPC, Private Service Connect, Global Load Balancers, Cloud DNS, Cloud NAT, Network Intelligence Center, and Service Mesh).
- Strong background in managing complex multi-cloud environments (AWS, GCP, Azure) with a focus on secure and compliant architectures in regulated industries.
- Hands-on expertise in Terraform and Infrastructure-as-Code (IaC) for repeatable, automated deployments.
- Expertise in Kubernetes, container orchestration, and microservices, with production experience in regulated fintech environments.
- Advanced programming and scripting skills in Python, Go, or Java, applied to automation, risk reduction, and financial system resilience.
- Proficiency with monitoring and logging tools (Prometheus, Mimir, Grafana, Loki) to ensure real-time visibility into trading, payments, and transaction flows.
- Strong understanding of networking, load balancing, and DNS management across multi-cloud and hybrid infrastructures.
- Implemented end-to-end observability solutions (metrics, logs, and traces) to monitor and optimize transaction throughput, adhering to latency SLAs.
- Leadership skills with experience mentoring teams, fostering a culture of reliability, and partnering with cross-functional stakeholders in product teams.
- Strong communication, critical thinking, and incident management abilities, especially in high-stakes production incidents involving customer transactions.
- Bachelor's or master's degree in computer science, engineering, or equivalent experience.
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1604412