HamburgerMenu
hirist

PhonePe - Site Reliability Engineer - Big Data Technologies

PhonePe Private Limited
Bangalore
3 - 8 Years
star-icon
4white-divider2,835+ Reviews

Posted on: 10/12/2025

Job Description

Description :


About the Role :


This role is responsible for managing and maintaining complex, distributed big data ecosystems. It ensures the reliability, scalability, and security of large-scale production infrastructure. Key responsibilities include automating processes, optimizing workflows, troubleshooting production issues, and driving system improvements across multiple business verticals.


Roles and Responsibilities :


- Manage, maintain, and support incremental changes to Linux/Unix environments.


- Lead on-call rotations and incident responses, conducting root cause analysis and driving postmortem processes.


- Design and implement automation systems for managing big data infrastructure, including provisioning, scaling, upgrades, and patching clusters.


- Troubleshoot and resolve complex production issues while identifying root causes and implementing mitigating strategies.


- Design and review scalable and reliable system architectures.


- Collaborate with teams to optimize overall system/cluster performance.


- Enforce security standards across systems and infrastructure.


- Set technical direction, drive standardization, and operate independently.


- Ensure availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning.


- Resolve, analyze, and respond to system outages and disruptions and implement measures to prevent similar incidents from recurring.


- Develop tools and scripts to automate operational processes, reducing manual workload, increasing efficiency and improving system resilience.


- Monitor and optimize system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning.


- Collaborate with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle.


- Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities.


- Develop and enforce SRE best practices and principles.


- Align across functional teams on priorities and deliverables.


- Drive automation to enhance operational efficiency.


- Adapt new technologies as and when the need arises and define architectural recommendations for new tech stacks.


Skills Required :


- Over 4 years of experience managing and maintaining distributed big data ecosystems.


- Strong expertise in Linux including IP, Iptables, and IPsec.


- Proficiency in scripting/programming with languages like Perl, Golang, or Python.


- Hands-on experience with the Hadoop stack (HDFS, HBase, Airflow, YARN, Ranger, Kafka, Pinot).


- Familiarity with open-source configuration management and deployment tools such as Puppet, Salt, Chef, or Ansible.


- Solid understanding of networking, open-source technologies, and related tools.


- Excellent communication and collaboration skills.


- DevOps tools : Saltstack, Ansible, docker, Git.


- SRE Logging and monitoring tools : ELK stack, Grafana, Prometheus, opentsdb, Open Telemetry.


Good to Have :


- Experience managing infrastructure on public cloud platforms (AWS, Azure, GCP).


- Experience in designing and reviewing system architectures for scalability and reliability.


- Experience with observability tools to visualize and alert on system performance.


- Experience in massive petabyte scale data migrations, massive upgrades.


info-icon

Did you find something suspicious?