Posted on: 08/07/2025
About the Role :
We are seeking an experienced and dynamic Site Reliability Engineer (SRE) with expertise in Apache stack, Kubernetes, and the Elastic ecosystem. The ideal candidate will be responsible for designing, deploying, managing, and optimizing scalable, reliable infrastructure with a focus on Elastic Cloud on Kubernetes (ECK) and observability systems.
This role blends DevOps, infrastructure management, and observability engineering. You'll support critical services and applications while ensuring uptime, reliability, and performance through automation, monitoring, and proactive incident response.
Key Responsibilities :
- Handle cluster upgrades, scaling, high availability, and disaster recovery for Elasticsearch, Logstash, Kibana, and Beats.
- Implement infrastructure as code using tools such as Helm, Terraform, or Ansible.
- Design and manage end-to-end observability stacks using Elastic Stack and integrations with Prometheus, Grafana, etc.
- Manage and fine-tune log ingestion pipelines using Logstash, Beats, and Elastic Agent.
- Build and maintain dashboards, alerts, and automated responses to incidents.
- Implement and maintain CI/CD pipelines (e.g., Jenkins, GitHub Actions) for deploying infrastructure and
applications.
- Automate system maintenance tasks to improve reliability and reduce manual effort.
- Collaborate on incident management, postmortems, and continuous improvements to system reliability.
- Design and implement data ingestion pipelines for real-time and batch processing into the Elastic stack.
- Work closely with Kafka to ingest and transform log/event data streams.
- Develop resilient integrations with Apache-based applications and other open-source systems.
- Ensure secure configuration and access control for Elastic components.
- Configure and manage Kubernetes networking elements (e.g., load balancers, firewalls, VPCs).
- Implement monitoring for compliance, performance, and auditability.
Required Skills and Qualifications :
- Proven experience managing Elastic Stack: Elasticsearch, Logstash, Kibana, Beats, Fleet.
- Hands-on experience with Elastic Cloud on Kubernetes (ECK) deployment, scaling, and tuning.
- Solid experience in Kubernetes, Docker, and container orchestration.
- Experience with Kafka integration for real-time data ingestion.
- Proficiency with Linux systems administration, shell scripting, and performance tuning.
- Strong understanding of networking concepts (DNS, load balancing, firewall rules, VPC).
- Familiarity with monitoring tools such as Prometheus and Grafana.
- Knowledge of CI/CD pipelines, automation, and version control systems (e.g., Jenkins, Git).
- Ability to debug complex distributed systems and troubleshoot performance bottlenecks.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1509710
Interview Questions for you
View All