HamburgerMenu
hirist

Job Description

About the job :

Come work at a place where innovation and teamwork come together to support the most exciting missions in the world!

As a Senior Site Reliability Engineer (SRE), you will be responsible for the reliability, scalability, and observability of our DevOps ecosystem.

This includes CI/CD systems, Kubernetes clusters, infrastructure automation, and telemetry platforms.
You will work closely with development, QA, and operations teams to build resilient systems and ensure continuous improvement of reliability standards.

Key Responsibilities :

- Own and manage DevOps components and tooling across 100+ production environments.

- Administer, scale, and optimize Kubernetes clusters used for application and infrastructure workloads.

- Implement and maintain observability stacks including Prometheus, OpenTelemetry (OTel), Elasticsearch, and ClickHouse for metrics, tracing, and log analytics.

- Ensure high availability of CI/CD pipelines and automate infrastructure provisioning using Terraform and Ansible.

- Build alerting, monitoring, and dashboarding systems to proactively detect and resolve issues.

- Lead root cause analysis for incidents and drive long-term stability improvements.

- Collaborate with engineering teams to design systems that are reliable, secure, and observable by default.

- Participate in on-call rotations and lead incident response efforts when necessary.

- Advice the cloud platform team to improve the reliability of the systems in production and scale them based on need.

- Participate in the development process by supporting new features, services, releases and hold an ownership mindset for the cloud platform technologies .

- Expertise in one of the programming language: Java, Python or Go.

- Proficient in writing bash scripts.

- Good understanding of SQL and NoSQL systems.

- Good understanding of systems programming (network stack, file system, OS services) .

- Should have good handson on Ansible .

- Should be able to automate Day to day activities .

Required Skills & Experience :

- 5+ years of experience in SRE, DevOps, or Infrastructure Engineering roles.

- Expertise in Kubernetes: deployment, scaling, troubleshooting, and operations in production.

- Strong Linux systems background and scripting skills (Python, Bash, or Go).

- Hands-on experience with CI/CD tools such as Jenkins, GitLab CI, or similar.

- Infrastructure-as-Code skills with tools like Terraform, Ansible, or equivalent.

- Solid knowledge of observability tools, including:

- Prometheus for monitoring and alerting

- OpenTelemetry (OTel) for tracing and telemetry

- Elasticsearch and ClickHouse for log storage and analytics

- Appdynamics

- Experience with containerization (Docker) and orchestration at scale.

- Familiarity with cloud platforms (AWS, GCP, or Azure) and hybrid-cloud architecture.

- Ability to debug and tune system performance under production load


info-icon

Did you find something suspicious?