Posted on: 07/09/2025
About the job :
Come work at a place where innovation and teamwork come together to support the most exciting missions in the world!
As a Senior Site Reliability Engineer (SRE), you will be responsible for the reliability, scalability, and observability of our DevOps ecosystem.
This includes CI/CD systems, Kubernetes clusters, infrastructure automation, and telemetry platforms.
You will work closely with development, QA, and operations teams to build resilient systems and ensure continuous improvement of reliability standards.
Key Responsibilities :
- Own and manage DevOps components and tooling across 100+ production environments.
- Administer, scale, and optimize Kubernetes clusters used for application and infrastructure workloads.
- Implement and maintain observability stacks including Prometheus, OpenTelemetry (OTel), Elasticsearch, and ClickHouse for metrics, tracing, and log analytics.
- Ensure high availability of CI/CD pipelines and automate infrastructure provisioning using Terraform and Ansible.
- Build alerting, monitoring, and dashboarding systems to proactively detect and resolve issues.
- Lead root cause analysis for incidents and drive long-term stability improvements.
- Collaborate with engineering teams to design systems that are reliable, secure, and observable by default.
- Participate in on-call rotations and lead incident response efforts when necessary.
- Advice the cloud platform team to improve the reliability of the systems in production and scale them based on need.
- Participate in the development process by supporting new features, services, releases and hold an ownership mindset for the cloud platform technologies .
- Expertise in one of the programming language: Java, Python or Go.
- Proficient in writing bash scripts.
- Good understanding of SQL and NoSQL systems.
- Good understanding of systems programming (network stack, file system, OS services) .
- Should have good handson on Ansible .
- Should be able to automate Day to day activities .
Required Skills & Experience :
- 5+ years of experience in SRE, DevOps, or Infrastructure Engineering roles.
- Expertise in Kubernetes: deployment, scaling, troubleshooting, and operations in production.
- Strong Linux systems background and scripting skills (Python, Bash, or Go).
- Hands-on experience with CI/CD tools such as Jenkins, GitLab CI, or similar.
- Infrastructure-as-Code skills with tools like Terraform, Ansible, or equivalent.
- Solid knowledge of observability tools, including:
- Prometheus for monitoring and alerting
- OpenTelemetry (OTel) for tracing and telemetry
- Elasticsearch and ClickHouse for log storage and analytics
- Appdynamics
- Experience with containerization (Docker) and orchestration at scale.
- Familiarity with cloud platforms (AWS, GCP, or Azure) and hybrid-cloud architecture.
- Ability to debug and tune system performance under production load
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1541958
Interview Questions for you
View All