We are seeking an experienced and dynamic Cloud DevOps Engineer to join our team. The ideal candidate will have a proven track record in deploying, managing, and optimizing large-scale Kubernetes clusters in public cloud environments like AWS or GCP. This role is perfect for someone who thrives in a fast-paced, cloud-native environment and has a deep understanding of cloud infrastructure, IaC (Infrastructure-as Code), automation, and observability tools.

As a Cloud DevOps Engineer, you will be responsible for building, deploying, and maintaining robust infrastructure solutions in the cloud while leveraging automation to ensure scalability, reliability, and performance of our systems.

Key Responsibilities :

Kubernetes Cluster Management :

- Deploy, manage, and monitor large-scale Kubernetes clusters on public cloud platforms (AWS or GCP).

- Implement and maintain Kubernetes-based containerized applications, ensuring high availability, performance, and security.

Cloud Infrastructure Management :

- Architect, deploy, and maintain cloud infrastructure using AWS or GCP.

- Ensure infrastructure follows best practices for scalability, availability, and fault tolerance.

- AWS or GCP Solution Architect certification is a plus.

Infrastructure as Code (IaC) & Automation :

- Develop and manage infrastructure automation using Terraform (certified) to ensure repeatable, versioned infrastructure deployments.

- Implement automation using Python for managing and orchestrating infrastructure and services.

- Ensure consistency and standardization across the cloud infrastructure using CI/CD pipelines.

Monitoring and Observability :

- Design and implement monitoring solutions using tools like Prometheus, Grafana, Splunk, and the ELK stack.

- Set up and manage alerting, logging, and dashboarding to ensure high visibility into system health and performance.

- Implement OpenTelemetry (Otel) for observability and performance monitoring.

Distributed Systems & Clusters :

- Manage and optimize distributed systems and databases, including Cassandra, Kafka, Elasticsearch, MongoDB, ZooKeeper, Redis, etc.

- Ensure data integrity, replication, and fault tolerance for large-scale clusters.

CI/CD & Pipeline-as-Code :

- Implement and maintain CI/CD pipelines using tools like Jenkins, Spinnaker, GitLab, Argo, Artifactory, Helm, and Ansible.

- Automate deployment workflows and streamline the integration of code into production environments.

Certification & Expertise :

- Apache Kafka and/or Cassandra Administrator certification is highly desired.

- Apply SRE (Site Reliability Engineering) best practices to improve system reliability, availability, and

performance.

- Hands-on experience with SignalFx for monitoring and distributed tracing is a plus.

Required Skills & Qualifications :

- Cloud Platforms : Extensive experience in managing infrastructure and services in AWS or GCP.

- Kubernetes : Strong expertise in deploying, scaling, and managing Kubernetes clusters in cloud environments.

- Infrastructure-as-Code (IaC) : Proficient in Terraform and Python for automation and managing infrastructure.

- Monitoring & Observability : Hands-on experience with Prometheus, Grafana, Splunk, and the ELK stack for monitoring, alerting, and logging.

- Distributed Systems : Expertise in managing distributed clusters such as Cassandra, Kafka, MongoDB, Elasticsearch, ZooKeeper, Redis, etc.

- CI/CD Tools : Knowledge of CI/CD tools and frameworks, including Jenkins, GitLab, Argo, Artifactory, Helm, and Ansible.

- Certifications : Preferred AWS/GCP Solution Architect certification and Apache Kafka/Cassandra Administrator certifications.

- OpenTelemetry (Otel) : Experience with observability tools, especially Otel and SignalFx, to track system performance.

- SRE Practices : Experience in applying Site Reliability Engineering (SRE) principles to ensure high availability and performance.

Desirable Skills :

- Familiarity with Helm charts for Kubernetes.

- Experience working in Agile, CI/CD-driven development environments.

- Ability to troubleshoot complex systems across multiple layers, including networking, application, and infrastructure.

Did you find something suspicious?

Posted By

Verified Recruiter

Job Views:
564

Applications: 290

Recruiter Actions: 0

Posted in

DevOps / SRE

Functional Area

DevOps / Cloud

Job Code

1540123

Jobs by location

Interview Questions for you

View All

How to Write Leave Application for Urgent Work: Format & Samples (2025)

Top 90+ Machine Learning Interview Questions and Answers

Top 40+ Deep Learning Interview Questions and Answers