Posted on: 17/07/2025
Role Overview :
We are seeking a highly capable Infrastructure as Code (IaC) Engineer to lead the design, implementation, and management of automated infrastructure provisioning for high-performance AI data centers.
This role is central to orchestrating compute, network, storage, and virtualization layers using modern IaC tools across on-premises and hybrid cloud environments.
The ideal candidate will play a strategic role in enabling scalable and repeatable deployment pipelines that support GPU clusters, AI model training environments, and containerized platforms such as Kubernetes and OpenShift.
Key Responsibilities :
- Orchestrate and manage multi-layer automation across compute (GPU/CPU), networking (VXLAN, EVPN, BGP), storage (NVMe, object, parallel file systems), and virtualization platforms (KVM, VMware, OpenShift).
- Develop reusable Terraform modules, Ansible playbooks, and YAML templates to define infrastructure in version-controlled environments.
- Automate deployment of Kubernetes clusters and integrate with GPU operators for training and inference pipelines.
- Build and maintain CI/CD pipelines to deploy, test, and manage infrastructure changes using tools like GitLab CI/CD, Jenkins, or ArgoCD.
- Integrate with monitoring and observability stacks (Prometheus, Grafana, DCGM) for automated infrastructure validation and health monitoring.
- Work closely with AI/ML platform teams to align infrastructure deployment with model training, data pipelines, and security policies.
- Ensure compliance with security and operational standards through policy-as-code and drift detection mechanisms.
Required Skills & Experience :
- Proficiency in Terraform, Ansible, and scripting languages such as Python, Bash, and YAML.
- Experience automating infrastructure in GPU-intensive environments supporting AI/ML workloads.
- Strong understanding of networking (VXLAN, EVPN, BGP, RoCE) and virtualization platforms (OpenShift, VMware, KVM).
- Familiarity with Kubernetes, Helm, Operators, and container orchestration frameworks.
- Exposure to storage automation for AI data lakes (e.g., Ceph, BeeGFS, Lustre, or S3-compatible storage).
- Experience with CI/CD tools (GitLab CI/CD, Jenkins, ArgoCD, Flux) in IaC pipelines.
Preferred Certifications :
- Red Hat Certified Specialist in Ansible Automation
- CKA (Certified Kubernetes Administrator) or equivalent
- Cloud certifications (AWS, Azure, or GCP preferred for hybrid orchestration)
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1514029
Interview Questions for you
View All