HamburgerMenu
hirist

Job Description

About the Role :


We are seeking an experienced and dynamic Site Reliability Engineer (SRE) with expertise in Apache stack, Kubernetes, and the Elastic ecosystem. The ideal candidate will be responsible for designing, deploying, managing, and optimizing scalable, reliable infrastructure with a focus on Elastic Cloud on Kubernetes (ECK) and observability systems.

This role blends DevOps, infrastructure management, and observability engineering. You'll support critical services and applications while ensuring uptime, reliability, and performance through automation, monitoring, and proactive incident response.


Key Responsibilities :


- Deploy, configure, and manage Elastic Cloud on Kubernetes (ECK) clusters across environments.

- Handle cluster upgrades, scaling, high availability, and disaster recovery for Elasticsearch, Logstash, Kibana, and Beats.

- Implement infrastructure as code using tools such as Helm, Terraform, or Ansible.

- Design and manage end-to-end observability stacks using Elastic Stack and integrations with Prometheus, Grafana, etc.

- Manage and fine-tune log ingestion pipelines using Logstash, Beats, and Elastic Agent.

- Build and maintain dashboards, alerts, and automated responses to incidents.

- Implement and maintain CI/CD pipelines (e.g., Jenkins, GitHub Actions) for deploying infrastructure and

applications.

- Automate system maintenance tasks to improve reliability and reduce manual effort.


- Collaborate on incident management, postmortems, and continuous improvements to system reliability.

- Design and implement data ingestion pipelines for real-time and batch processing into the Elastic stack.

- Work closely with Kafka to ingest and transform log/event data streams.

- Develop resilient integrations with Apache-based applications and other open-source systems.

- Ensure secure configuration and access control for Elastic components.

- Configure and manage Kubernetes networking elements (e.g., load balancers, firewalls, VPCs).

- Implement monitoring for compliance, performance, and auditability.


Required Skills and Qualifications :


- 6- 10 years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles.

- Proven experience managing Elastic Stack: Elasticsearch, Logstash, Kibana, Beats, Fleet.

- Hands-on experience with Elastic Cloud on Kubernetes (ECK) deployment, scaling, and tuning.

- Solid experience in Kubernetes, Docker, and container orchestration.

- Experience with Kafka integration for real-time data ingestion.

- Proficiency with Linux systems administration, shell scripting, and performance tuning.

- Strong understanding of networking concepts (DNS, load balancing, firewall rules, VPC).

- Familiarity with monitoring tools such as Prometheus and Grafana.

- Knowledge of CI/CD pipelines, automation, and version control systems (e.g., Jenkins, Git).

- Ability to debug complex distributed systems and troubleshoot performance bottlenecks.


info-icon

Did you find something suspicious?