Posted on: 10/07/2025
Job Description :
Key Responsibilities :
workloads.
- Lead the deployment and integration of HPC systems across on-premise, hybrid, and multi-cloud environments.
- Work with cross-functional teams to define and implement HPC strategies, performance tuning, and capacity planning.
- Deploy and manage parallel file systems such as GPFS and Lustre and optimize storage solutions for large scale workloads.
- Implement and maintain cluster management and job scheduling tools like Slurm, Torque, LSF, or PBSPro.
- Evaluate and integrate networking technologies such as Infiniband, RDMA, and high-throughput interconnects for maximum performance.
- Enable and support containerized HPC environments using Docker, Kubernetes, and Singularity.
- Collaborate with stakeholders to understand workloads and recommend architecture changes or new technologies.
- Develop documentation, including architecture designs, operational guides, and system configurations.
- Stay current with emerging HPC technologies, trends, and best practices.
Required Qualifications & Skills :
- 7+ years of experience designing and managing HPC systems and environments.
- Proven experience in :
1. HPC architecture and cluster design
2. Cloud-based HPC solutions (AWS, Azure, GCP)
3. On-premises and private cloud HPC implementations
- Expertise in parallel computing technologies such as MPI, OpenMP.
- Hands-on experience with high-performance file systems like GPFS, Lustre.
- Familiarity with job schedulers such as Slurm, Torque, PBSPro, or LSF.
- Proficiency in containerization and orchestration tools Docker, Kubernetes, Singularity.
- Strong knowledge of networking protocols and performance tuning TCP/IP, Infiniband, RDMA.
- Experience in at least one programming language: C, C++, Fortran, Python, or Java.
- Exposure to multi-vendor hardware environments and experience working in multi-cloud settings.
Soft Skills :
- Ability to explain technical concepts to both technical and non-technical stakeholders.
- Strong problem-solving and analytical thinking capabilities.
- Ability to thrive under pressure in a fast-paced and complex environment.
- Team-oriented mindset with leadership potential.
Preferred (Nice to Have) :
- Experience supporting research, simulation, or AI/ML workloads on HPC clusters.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
IT Infrastructure Services
Job Code
1509960
Interview Questions for you
View All