Job Description :

Zycus is looking for a Site Reliability Engineer (SRE) with deep expertise in Kubernetes, automation, and Linux systems.

The ideal candidate will have hands-on experience in deploying, administrating, and optimizing large-scale production systems, with a strong focus on microservices architecture, ensuring automation, performance, and reliability across our SaaS platform.

Roles And Responsibilities :

- System Reliability & Uptime : Ensure high availability, performance, and reliability of applications and infrastructure.

- Kubernetes & Cluster Management : Deploy, administer, and maintain Kubernetes clusters, managing scaling, upgrades, and troubleshooting.

- Microservices Management : Handle the deployment, monitoring, and scaling of microservices in distributed environments.

- Incident Management : Respond to production incidents, perform root cause analysis, and implement long-term fixes to prevent recurrence.

- Automation & Infrastructure as Code (IaC) : Automate repetitive tasks, infrastructure provisioning, and deployment workflows using tools like Ansible and Terraform.

- Monitoring & Observability : Implement and maintain monitoring tools (e.

, Prometheus, Grafana, Datadog) to track system health and application performance.

- Performance Optimization : Analyze system performance, identify bottlenecks, and optimize resources for better efficiency.

- Disaster Recovery & Backup : Design and implement backup and disaster recovery (DR) strategies for business continuity.

- Capacity Planning : Forecast infrastructure needs based on performance trends and business growth to ensure scalability.

- Security & Compliance : Ensure infrastructure and applications meet security standards and compliance requirements.

- Collaboration with Dev & Ops Teams : Work closely with development and operations teams to improve deployment pipelines, release processes, and system reliability.

- Documentation : Maintain clear and detailed documentation of systems, processes, and incident reports for knowledge sharing and compliance.

- Continuous Improvement : Identify opportunities for improving system architecture, deployment strategies, and automation workflows.

- Cloud Infrastructure Management : Manage cloud services (AWS, GCP, Azure) for resource optimization, cost management, and automation.

- On-Call Support : Participate in on-call rotations to handle urgent production issues and ensure rapid recovery.

Job Requirement :

Experience : 5 to 12 years.

Technical skills as mentioned below : .

Must Have :

Kubernetes Expertise :

- Hands-on experience with installing and provisioning Kubernetes clusters.

- Deep understanding of core Kubernetes components such as CRI, CNS, ETCD, CoreDNS, KubeProxy.

- Strong knowledge of Kubernetes internal networking, service discovery, and ingress management.

Kubernetes Distributions :

- Hands-on experience with different Kubernetes provisioners and distributions.

Kubernetes Cluster Administration :

- Experience in administering production Kubernetes clusters, including backup and disaster recovery (DR) strategies.

- Familiarity with cluster health monitoring and troubleshooting issues.

Monitoring tools : Exposure to monitoring tools such as Prometheus, Grafana, Datadog or AppDynamics.

Automation & Scripting :

- Strong programming skills in Python or Shell, or similar languages.

- Hands-on experience with Infrastructure-as-Code (IaC) tools such as Terraform or Ansible.

- Cloud automation experience, ideally with AWS or other major cloud platforms.

Operating Systems : Hands-on experience with Linux system administration.

Microservices : Experience with microservices architecture and managing more than 50 microservices simultaneously.

Good To Have Skills :

- Experience with OpenShift virtualization in production environments.

- Knowledge of AWS EKS, Rancher, or other Kubernetes distributions.

- CKA (Certified Kubernetes Administrator) certification or equivalent.

- Experience in fine-tuning RHEL, CentOS, and Ubuntu.

- Familiarity with DevSecOps practices, container security, and compliance frameworks.