Description :

Role Overview :

We are seeking an experienced Site Reliability Engineer (SRE) with strong hands-on expertise in Kubernetes, Python, and Linux.

The ideal candidate will be responsible for ensuring reliability, scalability, security, and performance of distributed systems and production workloads.

Mandatory requirement : Candidate must have minimum 4 years of experience in Python, Kubernetes, and SRE.

Profiles not meeting this must be rejected.

Key Responsibilities :

- Design, build, and maintain scalable and reliable production systems using SRE principles.

- Deploy, manage, and optimize workloads on Kubernetes clusters (networking, storage, deployments, scaling, troubleshooting).

- Develop Python automation scripts/tools to improve system efficiency, observability, and reliability.

- Implement CI/CD techniques, system monitoring, disaster recovery, and incident management processes.

- Perform root cause analysis (RCA) and ensure post-incident reviews and preventive actions.

- Work with cross-functional teams to drive automation and reduce manual intervention.

- Improve system reliability through performance tuning, capacity planning, and automated alerts.

- Build and maintain Linux-based production environments.

Essential Skills :

- Kubernetes : Networking, storage, deployments, cluster operations, troubleshooting

- Python : Strong scripting and automation experience (minimum 4 years)

- Linux : Administration, configuration, system performance & debugging

- SRE experience : On-call handling, RCA, reliability engineering, performance, scalability

Good to Have Skills :

- Logging & monitoring tools such as Grafana, Loki, Dynatrace

- Experience with containerization tools (Docker)

- Exposure to cloud platforms (AWS / GCP / Azure)

Soft Skills :

- Excellent analytical and debugging skills

- Strong communication & documentation ability

- Ownership mindset with a focus on continuous improvement