About the Role :
- We are seeking an exceptional and highly motivated HPC System R&D Engineer to join our team.
- In this role, you will be instrumental in developing and demonstrating next-generation HPC technologies, specifically focusing on their deployment and scalability across both on-premise and cloud infrastructures.
- You will identify limitations in current CPU and GPU-based solutions for large-scale AI deployments and architect innovative distributed frameworks and system-level solutions to overcome these challenges.
- Your work will directly impact the development of KLA's future tools, enabling breakthroughs in process control.
- Critically analyze existing HPC solutions based on CPU and GPU clusters to pinpoint bottlenecks and limitations in deploying AI-based solutions at scale on both on-premise and cloud infrastructures.
- Design, develop, and implement distributed frameworks and system-level solutions that enable seamless scaling of image processing and AI workloads from single GPUs to multi-node clusters with numerous GPUs.
- Focus on the challenges and opportunities of deploying HPC workloads in cloud environments, including resource management, scalability, and cost optimization.
- Install, benchmark, and rigorously evaluate pre-release hardware (CPUs, GPUs, interconnects, etc.) to assess their suitability for next-generation KLA tools.
- This includes identifying or developing relevant workloads for early-stage evaluation and prototyping.
- Conduct in-depth performance analysis of hardware and software stacks, identifying areas for optimization and improvement.
- Build functional prototypes and demonstrations of developed technologies on on-premise testbed clusters, paving the way for their integration into future KLA tools.
- Masters or PhD in Computer Science, Electrical Engineering, or a closely related field.
- Exceptional Bachelor's degree holders with significant relevant experience and an extraordinary track record will also be considered.
- Deep understanding of operating systems (Linux internals preferred), computer networks (high-speed interconnects like InfiniBand), and high-performance computing applications.
- Strong mental model of the architecture of modern distributed systems, including a comprehensive understanding of CPUs, GPUs, and various hardware accelerators.
- Proven experience with the deployment and scaling of deep-learning frameworks such as TensorFlow and PyTorch on large-scale on-premise or cloud infrastructures.
- Strong background in modern and advanced C++ concepts (including parallel programming paradigms).
- Excellent scripting skills in Bash, Python, or similar languages for automation, system administration, and data analysis.
- Good verbal and written communication skills to effectively collaborate with a diverse team and present technical findings.
- Things to Make us go Wow!
- Experience with the development and training of deep learning models using frameworks such as TensorFlow and PyTorch.
- Experience with building or significantly contributing to open-source operating systems and the software stack on pre-release hardware.
- Solid understanding of container infrastructure such as Docker or Singularity, and container orchestration platforms like Kubernetes for managing HPC workloads in the cloud.
- Active participation in C++ standards bodies or similar technical communities.
- Deep understanding and practical experience with HPC services offered by major cloud providers (AWS, Azure, GCP).
- Demonstrated expertise in performance profiling, tuning, and optimization of HPC applications.
- In-depth knowledge of high-speed interconnect technologies like InfiniBand and their impact on distributed application performance.
Location :
- Noida, Uttar Pradesh, India (Based on the current location)
Benefits :
- Opportunity to be at the forefront of HPC and AI innovation for a world-leading technology company.
- Work with a team of extraordinary engineers and researchers.
- Access to state-of-the-art on-premise and cloud HPC infrastructure.
- Make a significant impact on the future of semiconductor manufacturin
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
IT Infrastructure Services
Job Code
1533411
Interview Questions for you
View All