Posted on: 05/08/2025
Location : INDIA (CANDIDATE MUST BE COMFORTABLE TO RELOCATE TO UAE)
Job Summary :
You will play a key role in ensuring the performance, reliability, and scalability of compute and storage infrastructure. The role includes managing incident response, service requests, and changes across HPC environments in managed service settings.
Roles and Responsibilities :
- Configure and optimize InfiniBand and Ethernet switches, routers, and interconnects.
- Ensure high availability, redundancy, and fault tolerance in HPC systems.
- Deploy and maintain HPC clusters, monitor job scheduling, and ensure optimal system health.
- Troubleshoot compute node hardware/software issues and implement performance improvements.
- Maintain storage systems (Ceph, Vast Data, Lustre, GPFS, NFS, GlusterFS) with fast, reliable access from clusters.
- Configure and manage InfiniBand fabrics; upgrade firmware and monitor performance.
- Use tools like Grafana, Prometheus, Ganglia, and UFM for cluster and network monitoring.
- Work closely with researchers and data scientists to support HPC/AI workloads.
- Assist in debugging, tuning, and optimizing distributed applications.
- Create and maintain HLD and LLD documentation.
Required Experience :
- Strong background in data center operations servers, switches, routers, storage.
- Proficient in NVIDIA/Mellanox (Cumulus OS) switch configuration and troubleshooting.
- Hands-on with monitoring tools : Prometheus, Grafana, Elastic Observability.
- Experience with HPC schedulers : SLURM, PBS, or Torque.
- Kubernetes environment setup and maintenance experience.
- Familiar with ML and data science workflows in HPC/AI environments.
- Strong Linux administration experience.
Skills & Knowledge :
- Proficiency in distributed storage and file systems.
- Expertise in diagnosing and resolving complex infrastructure issues.
- Collaborative team player with strong communication skills.
- Capable of documenting and designing complex systems architecture.
Qualifications :
Certifications (Preferred) :
- Cisco Certified Network Associate (CCNA)
- AWS Certified Solutions Architect
Did you find something suspicious?
Posted By
Vinod T Pilla
HR & Admin Manager at Meridian Placements Services Private Limited
Last Active: 6 Aug 2025
Posted in
DevOps / SRE
Functional Area
IT Infrastructure Services
Job Code
1524902
Interview Questions for you
View All