Description :

Azure Cloud Architect (L3)

Experience : 8-16 Years

Location : Noida (Onsite)

Notice Period : Immediate Only

Job Description :

- Cloud Architect with experience in Microsoft Azure infrastructure, AI operations, delivering secure, scalable, and cost-efficient cloud solutions.

- Skilled in managing Azure infrastructure to support business teams in leveraging cloud and AI services efficiently, ensuring security, compliance, and governance best practices.

- Proficient in provisioning and managing Azure Virtual Machines, containerized workloads (AKS & Azure Container Apps, Container Registry), and SQL databases, Storage accounts, Key Vault, IAM, Application Gateway and Load Balancer configurations.

- Experienced in automation using PowerShell and Azure CLI to streamline user management, operational tasks, and governance processes, as well as implementing FinOps practices for cost monitoring, optimization, and budget governance and skilled in Azure Resource Graph (ARG) to gain deep insights into resources, usage patterns, and compliance.

- Experienced at logs management, monitoring, and alert configuration to ensure operational reliability, proactive issue resolution, and transparency.

- Collaborates effectively with cross-functional teams to troubleshoot complex infrastructure issues, enforce security best practices.

- Design and implement scalable AI/ML compute infrastructure (GPU clusters, HPC, distributed training systems).

- Build cloud/on-prem hybrid architectures using AWS, Azure, GCP, or on-prem GPU farms.

- Implement MLOps and CI/CD pipelines tailored for model training, deployment, and lifecycle management.

- Optimize GPU/TPU/AI accelerator utilization, scaling, auto-scheduling, and multi-tenant resource allocation.

- Implement observability across compute, storage, networking, and model performance (Prometheus/Grafana/ELK).

- Ensure high availability, cost efficiency, and performance tuning for large-scale AI/LLM workloads.

- Manage and optimize AI platforms like Kubernetes, Kubeflow, MLflow, Ray, Airflow, Databricks, or Vertex AI.

- Deploy, configure, and maintain containerized ML workloads using Docker/K8s.

- Implement model serving APIs using technologies such as Triton Inference Server, TorchServe, or custom microservices.

- Build and maintain scalable data pipelines for training and inference.

- Implement high-speed storage systems for model training (e.g., NFS, Lustre, Ceph, object storage).

- Manage feature stores, vector databases, and embeddings infrastructure.

- Implement security controls for AI environments (RBAC, IAM, network isolation).

- Ensure compliance with data governance, lineage, privacy, and policy standards.

- Manage secure access for research teams and automated agents.

- Automate provisioning using IaC (Terraform, Pulumi, Ansible, CloudFormation).

- Develop automation for GPU cluster lifecycle, upgrades, patching, and inference workloads.

- Support reproducible model training through containerization and versioning.

- Deep understanding of GPU systemsCUDA, NCCL, MIG, multi-node training.

- Experience with Kubernetes, Docker, and distributed compute frameworks.

- Hands-on exposure to MLOps tools and ML lifecycle workflows.

- Proficiency with Python, Bash, automation, DevOps pipelines, and IaC.

- Experience with observability, monitoring, and logging frameworks.

- Understanding of data engineering fundamentals.