PhonePe - System Engineer - Site Reliability

PhonePe Private Limited

Bangalore

5 - 7 Years

2,835+ Reviews

Site Reliability DevOps NoSQL Docker Kubernetes MariaDB Percona Server

Posted on: 24/08/2025

Job Description

Site Reliability Engineer (SRE) - System Engineer

About PhonePe Limited

PhonePe, headquartered in India, launched its flagship digital payments app in August 2016. As of April 2025, PhonePe serves over 600 million registered users with a merchant acceptance network of 40+ million. Processing 330+ million transactions daily and driving an Annualized TPV of INR 150+ lakh crore, PhonePe stands as India's largest digital payments platform.

Beyond payments, PhonePe has expanded into financial services (Insurance, Lending, Wealth) and consumer tech businesses (Pincode - hyperlocal commerce, Indus AppStore - localized Android app marketplace), aligned with its vision to democratize access to money and services for every Indian.

Culture @ PhonePe

At PhonePe, we empower people to own their work end-to-end from Day 1. Our teams solve complex problems at scale, often building frameworks from scratch to support the world's largest fintech infrastructure. We value speed, trust, and impact, creating an environment where you can bring your best self every day. If you're passionate about building platforms that touch millions of lives and thrive in a culture of ownership, learning, and collaboration, PhonePe is the place for you.

Role Overview :

- We are seeking a Site Reliability Engineer (SRE) - System Engineer to maintain and scale one of the world's largest fintech infrastructures. You will be responsible for ensuring high availability, performance, and reliability of mission-critical systems, working at the intersection of Linux, cloud, networking, and automation.

- This role demands deep technical expertise, hands-on problem-solving, and the ability to operate at scale in a 24x7 always-available environment.

Key Responsibilities :

- Manage and optimize large-scale Linux/Unix infrastructure for maximum reliability and performance.

- Implement, monitor, and maintain always-up, always-available IT operations across distributed systems.

- Troubleshoot and resolve complex issues related to system performance, networking, and database availability.

- Collaborate with engineering teams to automate deployments, scaling, and monitoring using open-source and cloud tools.

- Maintain strong networking security and performance using IP, iptables, and IPsec.

- Work with MySQL databases (MariaDB/Percona preferred) and ensure replication, clustering, and backup integrity.

- Develop automation solutions using Perl, Python, or Golang, and manage systems with tools like SaltStack.

- Participate in on-call rotation to ensure rapid response to production issues.

- Contribute to incident management, capacity planning, and disaster recovery strategies.

Required Skills & Experience :

- Proficiency in Linux/Unix administration, with 5+ years of hands-on experience.

- Strong understanding of computer networking, including protocols, firewalls, and VPNs.

- Hands-on experience with cloud services and private cloud environments.

- Working knowledge of MySQL (MariaDB/Percona) databases.

- Strong scripting/coding skills in Perl, Python, or Golang.

- Proven track record in automation, configuration management, and infrastructure scalability (SaltStack or similar).

- Excellent communication skills with the ability to work cross-functionally.

Desirable Skills :

- Experience with KVM/QEMU for Linux-based virtualization.

- Familiarity with container orchestration platforms (Kubernetes, Docker).

- Knowledge of NoSQL databases (Aerospike preferred).

- Hands-on experience with Galera Cluster for MySQL replication.

- Exposure to data center operations and infrastructure buildouts.