HamburgerMenu
hirist

Job Description

Role Overview :

We are seeking a highly capable Infrastructure as Code (IaC) Engineer to lead the design, implementation, and management of automated infrastructure provisioning for high-performance AI data centers.
This role is central to orchestrating compute, network, storage, and virtualization layers using modern IaC tools across on-premises and hybrid cloud environments.
The ideal candidate will play a strategic role in enabling scalable and repeatable deployment pipelines that support GPU clusters, AI model training environments, and containerized platforms such as Kubernetes and OpenShift.


Key Responsibilities :


- Design and implement IaC frameworks to automate the provisioning and configuration of data center infrastructure for AI workloads.

- Orchestrate and manage multi-layer automation across compute (GPU/CPU), networking (VXLAN, EVPN, BGP), storage (NVMe, object, parallel file systems), and virtualization platforms (KVM, VMware, OpenShift).

- Develop reusable Terraform modules, Ansible playbooks, and YAML templates to define infrastructure in version-controlled environments.

- Automate deployment of Kubernetes clusters and integrate with GPU operators for training and inference pipelines.

- Build and maintain CI/CD pipelines to deploy, test, and manage infrastructure changes using tools like GitLab CI/CD, Jenkins, or ArgoCD.

- Integrate with monitoring and observability stacks (Prometheus, Grafana, DCGM) for automated infrastructure validation and health monitoring.

- Work closely with AI/ML platform teams to align infrastructure deployment with model training, data pipelines, and security policies.


- Ensure compliance with security and operational standards through policy-as-code and drift detection mechanisms.


Required Skills & Experience :


- 5+ years of experience in infrastructure automation or SRE roles with hands-on IaC deployment.

- Proficiency in Terraform, Ansible, and scripting languages such as Python, Bash, and YAML.

- Experience automating infrastructure in GPU-intensive environments supporting AI/ML workloads.

- Strong understanding of networking (VXLAN, EVPN, BGP, RoCE) and virtualization platforms (OpenShift, VMware, KVM).

- Familiarity with Kubernetes, Helm, Operators, and container orchestration frameworks.

- Exposure to storage automation for AI data lakes (e.g., Ceph, BeeGFS, Lustre, or S3-compatible storage).

- Experience with CI/CD tools (GitLab CI/CD, Jenkins, ArgoCD, Flux) in IaC pipelines.


Preferred Certifications :


- HashiCorp Certified: Terraform Associate

- Red Hat Certified Specialist in Ansible Automation

- CKA (Certified Kubernetes Administrator) or equivalent

- Cloud certifications (AWS, Azure, or GCP preferred for hybrid orchestration)


info-icon

Did you find something suspicious?