Job Description :

We are seeking a highly skilled Principal Site Reliability Engineer to join our team.

The ideal candidate will have a Bachelors or Masters degree in computer science, Information Technology, or a related field (or equivalent experience) with 15+ years of experience in DevOps, Infrastructure, or Site Reliability Engineering roles.

Additionally, the candidate should have 4+ years in a senior or principal-level capacity driving SRE or reliability automation initiatives and a proven track record designing and scaling large distributed, cloud-native platforms.

Telecom domain experience is good to have.

Skills :

- Deep expertise in AWS (EKS, EC2, RDS, IAM, VPC, Kafka, CloudWatch, API GW, Lambda, WAF, KMS) and container orchestration (EKS).

- Deep expertise in HelmChart.

- Hands-on experience with APM tools (Elastic APM preferred).

- Expert in Terraform, Jenkins, Bitbucket, and Python/Bash/Go scripting for automation.

- Strong understanding of SLO/SLI frameworks, error budgets, and observability design.

- Familiarity with AIOps, chaos engineering, and event-driven automation.

- Proven experience in performance optimization, capacity planning, and resilience testing.

- Excellent documentation and system design communication skills.

Accreditation/certifications/licenses :

- AWS Certified Solutions Architect Professional or DevOps Engineer Professional.

- Certified Kubernetes Administrator (CKA) or Kubernetes Application Developer (CKAD).

Preferred :

- SRE Foundation / Google SRE / Dynatrace Performance Professional / Elastic Certified Engineer.