HamburgerMenu
hirist

Job Description

About the Role :

We are looking for a versatile and hands-on Senior Site Reliability Engineer (SRE) to join our engineering team. This role combines responsibilities across DevOps, SRE, and Application Support. You will play a critical role in improving system reliability, automation, observability, and incident response across our infrastructure and application stack.

You will work closely with software engineers, infrastructure teams, and product owners to ensure our systems are resilient, observable, scalable, and secure.

Key Responsibilities :

- Design, implement, and maintain CI/CD pipelines using GitHub Actions and GitOps tools (FluxCD/ArgoCD).

- Administer and operate Kubernetes clusters (AKS preferred) and manage deployments with Helm/Kustomize.

- Define and monitor SLA/SLO/SLIs to drive system reliability and performance improvements.

- Own and improve observability using Grafana, Prometheus, Loki, Tempo, Mimir, and related tools.

- Set up and manage infrastructure as code with Terraform or Terragrunt across cloud environments (Azure focus).

- Participate in incident response, perform root cause analysis, and implement permanent resolutions.

- Collaborate with application developers to debug and tune microservices (Node.js, Go, or similar).

- Perform release readiness checks, deployment validations, and post-release monitoring.

- Automate routine operations and build tools to improve team productivity and reliability posture.

Required Skills & Experience :

- 46 years of experience in SRE, DevOps, or production support roles, preferably in cloud-native environments.

- Proven hands-on experience with Microsoft Azure cloud services.

- Proficient in managing Kubernetes (AKS preferred) clusters and using Helm or Kustomize for resource deployment.

- Strong knowledge of CI/CD pipelines, particularly using GitHub Actions.

- Deep understanding of GitOps practices using FluxCD or ArgoCD.

- Experience managing and provisioning infrastructure with Terraform or Terragrunt.

- Familiarity with observability stacks :

a. Monitoring : Prometheus, Mimir

b. Logging : Loki

c. Tracing : Tempo

d. Dashboards : Grafana

- Ability to troubleshoot and debug issues across microservices-based architectures, especially in Node.js or Go ecosystems.

- Excellent analytical and incident management skills with a proactive mindset.

Nice to Have :

- Exposure to other clouds like AWS or GCP.

- Knowledge of service mesh (Istio/Linkerd), security best practices, and zero-downtime deployments.

- Experience with on-call rotations and working in 24x7 environments.

- Familiarity with feature flagging, canary deployments, and chaos engineering tools.


info-icon

Did you find something suspicious?