Posted on: 04/09/2025
Confidential Job Posting
This role is from a verified company that prefers not to disclose its name at this stage. Learn More
Job Description :
We are seeking an experienced and dynamic Cloud DevOps Engineer to join our team. The ideal candidate will have a proven track record in deploying, managing, and optimizing large-scale Kubernetes clusters in public cloud environments like AWS or GCP. This role is perfect for someone who thrives in a fast-paced, cloud-native environment and has a deep understanding of cloud infrastructure, IaC (Infrastructure-as Code), automation, and observability tools.
As a Cloud DevOps Engineer, you will be responsible for building, deploying, and maintaining robust infrastructure solutions in the cloud while leveraging automation to ensure scalability, reliability, and performance of our systems.
Key Responsibilities :
Kubernetes Cluster Management :
- Deploy, manage, and monitor large-scale Kubernetes clusters on public cloud platforms (AWS or GCP).
- Implement and maintain Kubernetes-based containerized applications, ensuring high availability, performance, and security.
Cloud Infrastructure Management :
- Architect, deploy, and maintain cloud infrastructure using AWS or GCP.
- Ensure infrastructure follows best practices for scalability, availability, and fault tolerance.
- AWS or GCP Solution Architect certification is a plus.
Infrastructure as Code (IaC) & Automation :
- Develop and manage infrastructure automation using Terraform (certified) to ensure repeatable, versioned infrastructure deployments.
- Implement automation using Python for managing and orchestrating infrastructure and services.
- Ensure consistency and standardization across the cloud infrastructure using CI/CD pipelines.
Monitoring and Observability :
- Design and implement monitoring solutions using tools like Prometheus, Grafana, Splunk, and the ELK stack.
- Set up and manage alerting, logging, and dashboarding to ensure high visibility into system health and performance.
- Implement OpenTelemetry (Otel) for observability and performance monitoring.
Distributed Systems & Clusters :
- Manage and optimize distributed systems and databases, including Cassandra, Kafka, Elasticsearch, MongoDB, ZooKeeper, Redis, etc.
- Ensure data integrity, replication, and fault tolerance for large-scale clusters.
CI/CD & Pipeline-as-Code :
- Implement and maintain CI/CD pipelines using tools like Jenkins, Spinnaker, GitLab, Argo, Artifactory, Helm, and Ansible.
- Automate deployment workflows and streamline the integration of code into production environments.
Certification & Expertise :
- Apache Kafka and/or Cassandra Administrator certification is highly desired.
- Apply SRE (Site Reliability Engineering) best practices to improve system reliability, availability, and
performance.
- Hands-on experience with SignalFx for monitoring and distributed tracing is a plus.
Required Skills & Qualifications :
- Cloud Platforms : Extensive experience in managing infrastructure and services in AWS or GCP.
- Kubernetes : Strong expertise in deploying, scaling, and managing Kubernetes clusters in cloud environments.
- Infrastructure-as-Code (IaC) : Proficient in Terraform and Python for automation and managing infrastructure.
- Monitoring & Observability : Hands-on experience with Prometheus, Grafana, Splunk, and the ELK stack for monitoring, alerting, and logging.
- Distributed Systems : Expertise in managing distributed clusters such as Cassandra, Kafka, MongoDB, Elasticsearch, ZooKeeper, Redis, etc.
- CI/CD Tools : Knowledge of CI/CD tools and frameworks, including Jenkins, GitLab, Argo, Artifactory, Helm, and Ansible.
- Certifications : Preferred AWS/GCP Solution Architect certification and Apache Kafka/Cassandra Administrator certifications.
- OpenTelemetry (Otel) : Experience with observability tools, especially Otel and SignalFx, to track system performance.
- SRE Practices : Experience in applying Site Reliability Engineering (SRE) principles to ensure high availability and performance.
Desirable Skills :
- Familiarity with Helm charts for Kubernetes.
- Experience working in Agile, CI/CD-driven development environments.
- Ability to troubleshoot complex systems across multiple layers, including networking, application, and infrastructure.
Did you find something suspicious?
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1540123
Interview Questions for you
View All