This role is responsible for managing and maintaining complex, distributed big data ecosystems. It ensures the reliability, scalability, and security of large-scale production infrastructure. Key responsibilities include automating processes, optimizing workflows, troubleshooting production issues, and driving system improvements across multiple business verticals.

Roles and Responsibilities :

- Manage, maintain, and support incremental changes to Linux/Unix environments.

- Lead on-call rotations and incident responses, conducting root cause analysis and driving postmortem processes.

- Design and implement automation systems for managing big data infrastructure, including provisioning, scaling, upgrades, and patching clusters.

- Troubleshoot and resolve complex production issues while identifying root causes and implementing mitigating strategies.

- Design and review scalable and reliable system architectures.

- Collaborate with teams to optimize overall system/cluster performance.

- Enforce security standards across systems and infrastructure.

- Set technical direction, drive standardization, and operate independently.

- Ensure availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning.

- Resolve, analyze, and respond to system outages and disruptions and implement measures to prevent similar incidents from recurring.

- Develop tools and scripts to automate operational processes, reducing manual workload, increasing efficiency and improving system resilience.

- Monitor and optimize system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning.

- Collaborate with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle.

- Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities.

- Develop and enforce SRE best practices and principles.

- Align across functional teams on priorities and deliverables.

- Drive automation to enhance operational efficiency.

- Adapt new technologies as and when the need arises and define architectural recommendations for new tech stacks.

Skills Required :

- Over 4 years of experience managing and maintaining distributed big data ecosystems.

- Strong expertise in Linux including IP, Iptables, and IPsec.

- Proficiency in scripting/programming with languages like Perl, Golang, or Python.

- Hands-on experience with the Hadoop stack (HDFS, HBase, Airflow, YARN, Ranger, Kafka, Pinot).

- Familiarity with open-source configuration management and deployment tools such as Puppet, Salt, Chef, or Ansible.

- Solid understanding of networking, open-source technologies, and related tools.

- Excellent communication and collaboration skills.

- DevOps tools : Saltstack, Ansible, docker, Git.

- SRE Logging and monitoring tools : ELK stack, Grafana, Prometheus, opentsdb, Open Telemetry.

Good to Have :

- Experience managing infrastructure on public cloud platforms (AWS, Azure, GCP).

- Experience in designing and reviewing system architectures for scalability and reliability.

- Experience with observability tools to visualize and alert on system performance.

- Experience in massive petabyte scale data migrations, massive upgrades.

Did you find something suspicious?

Posted by

Careers

Recruiter at PhonePe Private Limited

Last Active: 12 Dec 2025

Job Views:
162

Applications: 97

Recruiter Actions: 0

Posted in

DevOps / SRE

Functional Area

Site Reliability Engineering

Job Code

1588337

Jobs by location

Interview Questions for you

View All

How to Write Leave Application for Urgent Work: Format & Samples (2025)

Top 90+ Machine Learning Interview Questions and Answers

Top 40+ Deep Learning Interview Questions and Answers